DeepSeek R1 & V3

Deploy DeepSeek R1 and DeepSeek V3 reasoning models on Spheron GPU instances using vLLM. DeepSeek R1 features chain-of-thought reasoning exposed via <think> blocks; distilled variants (7B–32B) run on single GPUs.

Recommended hardware

Model	Recommended GPU	Instance Type	Notes
R1-Distill-7B	RTX 4090 (24GB)	Dedicated or Spot	Single-GPU, fast iteration
R1-Distill-14B	A100 40GB	Dedicated	Full precision
R1-Distill-32B	A100 80GB (INT4)	Dedicated	AWQ quantization
DeepSeek-V3 671B	8× H100 80GB (FP8)	Cluster	`--tensor-parallel-size 8`
DeepSeek-R1 671B	8× H100 80GB (FP8)	Cluster	`--tensor-parallel-size 8`

Manual setup

Use these steps to set up the server manually after SSH-ing into your instance. This works on any provider regardless of cloud-init support.

Step 1: Connect to your instance

ssh <user>@<ipAddress>

Replace <user> with the username shown in the instance details panel (e.g., ubuntu for Spheron AI instances) and <ipAddress> with your instance's public IP.

Step 2: Install vLLM

sudo apt-get update -y
sudo apt-get install -y python3-pip
pip install vllm

Step 3: Start the server

Run the server in the foreground to verify it works:

python3 -m vllm.entrypoints.openai.api_server \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
  --port 8000 \
  --dtype bfloat16

Press Ctrl+C to stop.

Step 4: Run as a background service

To keep the server running after you close your SSH session, create a systemd service:

sudo tee /etc/systemd/system/vllm-deepseek.service > /dev/null << 'EOF'
[Unit]
Description=DeepSeek R1 vLLM Inference Server
After=network.target
 
[Service]
Type=simple
ExecStart=/usr/bin/python3 -m vllm.entrypoints.openai.api_server \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
  --port 8000 \
  --dtype bfloat16
Restart=on-failure
RestartSec=10
 
[Install]
WantedBy=multi-user.target
EOF
 
sudo systemctl daemon-reload
sudo systemctl enable vllm-deepseek
sudo systemctl start vllm-deepseek

For the full DeepSeek-R1 671B model on 8× H100, replace the ExecStart command with:

/usr/bin/python3 -m vllm.entrypoints.openai.api_server \
  --model deepseek-ai/DeepSeek-R1 \
  --port 8000 \
  --dtype fp8 \
  --tensor-parallel-size 8 \
  --gpu-memory-utilization 0.95

Accessing the server

SSH tunnel

ssh -L 8000:localhost:8000 <user>@<ipAddress>

Usage example: reasoning block parsing

DeepSeek R1 outputs reasoning inside <think>...</think> tags before the final answer.

from openai import OpenAI
import re
 
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
 
response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
    messages=[{"role": "user", "content": "What is 17 × 23? Show your reasoning."}],
)
 
content = response.choices[0].message.content
 
# Extract reasoning and final answer
think_match = re.search(r"<think>(.*?)</think>", content, re.DOTALL)
if think_match:
    reasoning = think_match.group(1).strip()
    answer = content[think_match.end():].strip()
    print("Reasoning:", reasoning)
    print("Answer:", answer)
else:
    print(content)

Cloud-init startup script (optional)

If your provider supports cloud-init, you can paste this into the Startup Script field when deploying to automate the setup above.

R1-Distill-7B (RTX 4090)

#cloud-config
write_files:
  - path: /etc/systemd/system/vllm-deepseek.service
    content: |
      [Unit]
      Description=DeepSeek R1 vLLM Inference Server
      After=network.target
 
      [Service]
      Type=simple
      ExecStart=/usr/bin/python3 -m vllm.entrypoints.openai.api_server \
        --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
        --port 8000 \
        --dtype bfloat16
      Restart=on-failure
      RestartSec=10
 
      [Install]
      WantedBy=multi-user.target
runcmd:
  - apt-get update -y
  - apt-get install -y python3-pip
  - pip install vllm
  - systemctl daemon-reload
  - systemctl enable vllm-deepseek
  - systemctl start vllm-deepseek

DeepSeek-R1 671B (8× H100, FP8)

Replace the ExecStart line with:

/usr/bin/python3 -m vllm.entrypoints.openai.api_server \
  --model deepseek-ai/DeepSeek-R1 \
  --port 8000 \
  --dtype fp8 \
  --tensor-parallel-size 8 \
  --gpu-memory-utilization 0.95

What's next

vLLM Inference Server: vLLM configuration details
Llama 3.1 / 3.2 / 3.3: Meta Llama model guides
Instance Types: H100 NVLink cluster for full 671B
Cost Optimization: GPU tier selection for inference workloads