SGLang Inference Server

Deploy an SGLang OpenAI-compatible inference server on Spheron GPU instances. SGLang features RadixAttention for KV cache reuse across requests and native support for constrained decoding and structured output.

Recommended hardware

Model Size	Recommended GPU	Instance Type	Notes
7B–13B	RTX 4090 (24GB)	Dedicated or Spot	Single-GPU, fast iteration
30B–70B	A100 80GB (1×)	Dedicated	Full-precision or AWQ
70B+	H100 80GB (2× or more)	Dedicated / Cluster	Use `--tp` for tensor parallelism

Manual setup

Use these steps to set up the server manually after SSH-ing into your instance. This works on any provider regardless of cloud-init support.

Step 1: Connect to your instance

ssh <user>@<ipAddress>

Replace <user> with the username shown in the instance details panel in the dashboard (e.g., ubuntu for Spheron AI instances) and <ipAddress> with your instance's public IP.

Step 2: Install SGLang

sudo apt-get update -y
sudo apt-get install -y python3-pip
pip install "sglang[all]"

Step 3: Start the server

Run the server in the foreground to verify it works:

python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --port 30000 \
  --tp 1

Press Ctrl+C to stop. Replace meta-llama/Llama-3.1-8B-Instruct with your target model and adjust --tp to match the number of GPUs.

Step 4: Run as a background service

To keep the server running after you close your SSH session, create a systemd service:

sudo tee /etc/systemd/system/sglang.service > /dev/null << 'EOF'
[Unit]
Description=SGLang Inference Server
After=network.target
 
[Service]
Type=simple
ExecStart=/usr/bin/python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --port 30000 \
  --tp 1
Restart=on-failure
RestartSec=10
 
[Install]
WantedBy=multi-user.target
EOF
 
sudo systemctl daemon-reload
sudo systemctl enable sglang
sudo systemctl start sglang

Accessing the server

SSH tunnel (recommended)

ssh -L 30000:localhost:30000 <user>@<ipAddress>

List available models

curl http://localhost:30000/v1/models

Usage example

from openai import OpenAI
 
client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="not-needed",
)
 
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Explain RadixAttention briefly."}],
)
print(response.choices[0].message.content)

Performance flags

Flag	Description	Recommended Value
`--tp`	Tensor parallel degree	Match GPU count
`--chunked-prefill-size`	Chunked prefill token budget	`512` or `1024`
`--enable-torch-compile`	torch.compile for kernel fusion	Slower startup, faster inference
`--mem-fraction-static`	Fraction of GPU memory for KV cache	`0.85`

Check server logs

journalctl -u sglang -f

Cloud-init startup script (optional)

If your provider supports cloud-init, you can paste this into the Startup Script field when deploying to automate the setup above.

#cloud-config
runcmd:
  - apt-get update -y
  - apt-get install -y python3-pip
  - pip install "sglang[all]"
  - |
    cat > /etc/systemd/system/sglang.service << 'EOF'
    [Unit]
    Description=SGLang Inference Server
    After=network.target
 
    [Service]
    Type=simple
    ExecStart=/usr/bin/python3 -m sglang.launch_server \
      --model-path meta-llama/Llama-3.1-8B-Instruct \
      --port 30000 \
      --tp 1
    Restart=on-failure
    RestartSec=10
 
    [Install]
    WantedBy=multi-user.target
    EOF
  - systemctl daemon-reload
  - systemctl enable sglang
  - systemctl start sglang

Replace meta-llama/Llama-3.1-8B-Instruct with your target model and adjust --tp to match the number of GPUs.

What's next

vLLM Inference Server: Alternative for production API workloads
Inference Frameworks: Compare all serving stacks
Networking: SSH tunneling and port access
Cost Optimization: GPU tier selection for inference workloads