SGLang Inference Server
Deploy an SGLang OpenAI-compatible inference server on Spheron GPU instances. SGLang features RadixAttention for KV cache reuse across requests and native support for constrained decoding and structured output.
Recommended hardware
| Model Size | Recommended GPU | Instance Type | Notes |
|---|---|---|---|
| 7B–13B | RTX 4090 (24GB) | Dedicated or Spot | Single-GPU, fast iteration |
| 30B–70B | A100 80GB (1×) | Dedicated | Full-precision or AWQ |
| 70B+ | H100 80GB (2× or more) | Dedicated / Cluster | Use --tp for tensor parallelism |
Manual setup
Use these steps to set up the server manually after SSH-ing into your instance. This works on any provider regardless of cloud-init support.
Step 1: Connect to your instance
ssh <user>@<ipAddress>Replace <user> with the username shown in the instance details panel in the dashboard (e.g., ubuntu for Spheron AI instances) and <ipAddress> with your instance's public IP.
Step 2: Install SGLang
sudo apt-get update -y
sudo apt-get install -y python3-pip
pip install "sglang[all]"Step 3: Start the server
Run the server in the foreground to verify it works:
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--port 30000 \
--tp 1Press Ctrl+C to stop. Replace meta-llama/Llama-3.1-8B-Instruct with your target model and adjust --tp to match the number of GPUs.
Step 4: Run as a background service
To keep the server running after you close your SSH session, create a systemd service:
sudo tee /etc/systemd/system/sglang.service > /dev/null << 'EOF'
[Unit]
Description=SGLang Inference Server
After=network.target
[Service]
Type=simple
ExecStart=/usr/bin/python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--port 30000 \
--tp 1
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable sglang
sudo systemctl start sglangAccessing the server
SSH tunnel (recommended)
ssh -L 30000:localhost:30000 <user>@<ipAddress>List available models
curl http://localhost:30000/v1/modelsUsage example
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="not-needed",
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Explain RadixAttention briefly."}],
)
print(response.choices[0].message.content)Performance flags
| Flag | Description | Recommended Value |
|---|---|---|
--tp | Tensor parallel degree | Match GPU count |
--chunked-prefill-size | Chunked prefill token budget | 512 or 1024 |
--enable-torch-compile | torch.compile for kernel fusion | Slower startup, faster inference |
--mem-fraction-static | Fraction of GPU memory for KV cache | 0.85 |
Check server logs
journalctl -u sglang -fCloud-init startup script (optional)
If your provider supports cloud-init, you can paste this into the Startup Script field when deploying to automate the setup above.
#cloud-config
runcmd:
- apt-get update -y
- apt-get install -y python3-pip
- pip install "sglang[all]"
- |
cat > /etc/systemd/system/sglang.service << 'EOF'
[Unit]
Description=SGLang Inference Server
After=network.target
[Service]
Type=simple
ExecStart=/usr/bin/python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--port 30000 \
--tp 1
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
- systemctl daemon-reload
- systemctl enable sglang
- systemctl start sglangReplace meta-llama/Llama-3.1-8B-Instruct with your target model and adjust --tp to match the number of GPUs.
What's next
- vLLM Inference Server: Alternative for production API workloads
- Inference Frameworks: Compare all serving stacks
- Networking: SSH tunneling and port access
- Cost Optimization: GPU tier selection for inference workloads