Skip to content

SGLang Inference Server

Deploy an SGLang OpenAI-compatible inference server on Spheron GPU instances. SGLang features RadixAttention for KV cache reuse across requests and native support for constrained decoding and structured output.

Recommended hardware

Model SizeRecommended GPUInstance TypeNotes
7B–13BRTX 4090 (24GB)Dedicated or SpotSingle-GPU, fast iteration
30B–70BA100 80GB (1×)DedicatedFull-precision or AWQ
70B+H100 80GB (2× or more)Dedicated / ClusterUse --tp for tensor parallelism

Manual setup

Use these steps to set up the server manually after SSH-ing into your instance. This works on any provider regardless of cloud-init support.

Step 1: Connect to your instance

ssh <user>@<ipAddress>

Replace <user> with the username shown in the instance details panel in the dashboard (e.g., ubuntu for Spheron AI instances) and <ipAddress> with your instance's public IP.

Step 2: Install SGLang

sudo apt-get update -y
sudo apt-get install -y python3-pip
pip install "sglang[all]"

Step 3: Start the server

Run the server in the foreground to verify it works:

python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --port 30000 \
  --tp 1

Press Ctrl+C to stop. Replace meta-llama/Llama-3.1-8B-Instruct with your target model and adjust --tp to match the number of GPUs.

Step 4: Run as a background service

To keep the server running after you close your SSH session, create a systemd service:

sudo tee /etc/systemd/system/sglang.service > /dev/null << 'EOF'
[Unit]
Description=SGLang Inference Server
After=network.target
 
[Service]
Type=simple
ExecStart=/usr/bin/python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --port 30000 \
  --tp 1
Restart=on-failure
RestartSec=10
 
[Install]
WantedBy=multi-user.target
EOF
 
sudo systemctl daemon-reload
sudo systemctl enable sglang
sudo systemctl start sglang

Accessing the server

SSH tunnel (recommended)

ssh -L 30000:localhost:30000 <user>@<ipAddress>

List available models

curl http://localhost:30000/v1/models

Usage example

from openai import OpenAI
 
client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="not-needed",
)
 
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Explain RadixAttention briefly."}],
)
print(response.choices[0].message.content)

Performance flags

FlagDescriptionRecommended Value
--tpTensor parallel degreeMatch GPU count
--chunked-prefill-sizeChunked prefill token budget512 or 1024
--enable-torch-compiletorch.compile for kernel fusionSlower startup, faster inference
--mem-fraction-staticFraction of GPU memory for KV cache0.85

Check server logs

journalctl -u sglang -f

Cloud-init startup script (optional)

If your provider supports cloud-init, you can paste this into the Startup Script field when deploying to automate the setup above.

#cloud-config
runcmd:
  - apt-get update -y
  - apt-get install -y python3-pip
  - pip install "sglang[all]"
  - |
    cat > /etc/systemd/system/sglang.service << 'EOF'
    [Unit]
    Description=SGLang Inference Server
    After=network.target
 
    [Service]
    Type=simple
    ExecStart=/usr/bin/python3 -m sglang.launch_server \
      --model-path meta-llama/Llama-3.1-8B-Instruct \
      --port 30000 \
      --tp 1
    Restart=on-failure
    RestartSec=10
 
    [Install]
    WantedBy=multi-user.target
    EOF
  - systemctl daemon-reload
  - systemctl enable sglang
  - systemctl start sglang

Replace meta-llama/Llama-3.1-8B-Instruct with your target model and adjust --tp to match the number of GPUs.

What's next