Qwen3 Dense & MoE
Deploy Qwen3 dense and Mixture-of-Experts (MoE) models on Spheron GPU instances using vLLM. Qwen3 introduces a thinking mode that can be toggled at inference time using system prompt tokens.
Recommended hardware
| Model | Recommended GPU | Instance Type | Notes |
|---|---|---|---|
| Qwen3-8B | RTX 4090 (24GB) | Dedicated or Spot | Dense, single-GPU |
| Qwen3-14B | A100 40GB | Dedicated | Dense |
| Qwen3-32B | A100 80GB | Dedicated | Dense |
| Qwen3-30B-A3B | A100 80GB | Dedicated | MoE, 3B active params |
| Qwen3-235B-A22B | 8× H100 80GB | Cluster | MoE, --tensor-parallel-size 8 |
Manual setup
Use these steps to set up the server manually after SSH-ing into your instance. This works on any provider regardless of cloud-init support.
Step 1: Connect to your instance
ssh <user>@<ipAddress>Replace <user> with the username shown in the instance details panel (e.g., ubuntu for Spheron AI instances) and <ipAddress> with your instance's public IP.
Step 2: Install vLLM
sudo apt-get update -y
sudo apt-get install -y python3-pip
pip install vllmStep 3: Start the server
Run the server in the foreground to verify it works:
python3 -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-32B \
--port 8000 \
--dtype bfloat16Press Ctrl+C to stop.
Step 4: Run as a background service
To keep the server running after you close your SSH session, create a systemd service:
sudo tee /etc/systemd/system/vllm-qwen3.service > /dev/null << 'EOF'
[Unit]
Description=Qwen3 vLLM Inference Server
After=network.target
[Service]
Type=simple
ExecStart=/usr/bin/python3 -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-32B \
--port 8000 \
--dtype bfloat16
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable vllm-qwen3
sudo systemctl start vllm-qwen3Qwen3-235B-A22B MoE (8× H100)
For the large MoE model, replace the ExecStart command with:
/usr/bin/python3 -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-235B-A22B \
--port 8000 \
--dtype bfloat16 \
--tensor-parallel-size 8 \
--gpu-memory-utilization 0.95Accessing the server
SSH tunnel
ssh -L 8000:localhost:8000 <user>@<ipAddress>Usage example: thinking mode
Qwen3 supports toggleable chain-of-thought reasoning via system prompt tokens.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
# Enable thinking mode (default for instruction models)
response_think = client.chat.completions.create(
model="Qwen/Qwen3-32B",
messages=[
{"role": "system", "content": "/think"},
{"role": "user", "content": "Solve: if x² + 3x - 10 = 0, what is x?"},
],
)
# Disable thinking mode for faster responses
response_no_think = client.chat.completions.create(
model="Qwen/Qwen3-32B",
messages=[
{"role": "system", "content": "/no_think"},
{"role": "user", "content": "What is the capital of France?"},
],
)
print("With thinking:", response_think.choices[0].message.content[:200])
print("Without thinking:", response_no_think.choices[0].message.content)Cloud-init startup script (optional)
If your provider supports cloud-init, you can paste this into the Startup Script field when deploying to automate the setup above.
Qwen3-32B (A100 80GB)
#cloud-config
runcmd:
- apt-get update -y
- apt-get install -y python3-pip
- pip install vllm
- |
cat > /etc/systemd/system/vllm-qwen3.service << 'EOF'
[Unit]
Description=Qwen3 vLLM Inference Server
After=network.target
[Service]
Type=simple
ExecStart=/usr/bin/python3 -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-32B \
--port 8000 \
--dtype bfloat16
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
- systemctl daemon-reload
- systemctl enable vllm-qwen3
- systemctl start vllm-qwen3Qwen3-235B-A22B MoE (8× H100)
Replace the ExecStart line with:
/usr/bin/python3 -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-235B-A22B \
--port 8000 \
--dtype bfloat16 \
--tensor-parallel-size 8 \
--gpu-memory-utilization 0.95What's next
- vLLM Inference Server: vLLM configuration details
- DeepSeek R1 & V3: Another reasoning model option
- Instance Types: Multi-GPU setup for MoE models
- Cost Optimization: Spot instances for development