LMDeploy
Deploy LMDeploy with the TurboMind inference engine on Spheron A100 or H100 instances. LMDeploy supports AWQ quantization for memory-efficient inference and exposes an OpenAI-compatible API.
Recommended hardware
| Model Size | Recommended GPU | Instance Type | Notes |
|---|---|---|---|
| 7B (AWQ) | RTX 4090 (24GB) | Dedicated or Spot | ~8GB VRAM with W4A16 AWQ |
| 7B (FP16) | A100 40GB | Dedicated | Full precision |
| 30B+ | A100 80GB (1–2×) | Dedicated | Use --tp 2 for tensor parallelism |
| 70B+ | H100 80GB (2× or more) | Cluster | TurboMind multi-GPU |
Manual setup
Use these steps to set up the server manually after SSH-ing into your instance. This works on any provider regardless of cloud-init support.
Step 1: Connect to your instance
ssh <user>@<ipAddress>Replace <user> with the username shown in the instance details panel in the dashboard (e.g., ubuntu for Spheron AI instances) and <ipAddress> with your instance's public IP.
Step 2: Install LMDeploy
sudo apt-get update -y
sudo apt-get install -y python3-pip
pip install lmdeployStep 3: Start the server
Run the server in the foreground to verify it works:
python3 -m lmdeploy serve api_server \
Qwen/Qwen2.5-7B-Instruct \
--server-port 23333 \
--backend turbomindPress Ctrl+C to stop. Replace Qwen/Qwen2.5-7B-Instruct with your target model.
Step 4: Run as a background service
To keep the server running after you close your SSH session, create a systemd service:
sudo tee /etc/systemd/system/lmdeploy.service > /dev/null << 'EOF'
[Unit]
Description=LMDeploy Inference Server
After=network.target
[Service]
Type=simple
ExecStart=/usr/bin/python3 -m lmdeploy serve api_server \
Qwen/Qwen2.5-7B-Instruct \
--server-port 23333 \
--backend turbomind
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable lmdeploy
sudo systemctl start lmdeployAWQ quantization
To convert a model to AWQ 4-bit before serving (reduces VRAM by ~50%):
lmdeploy lite auto_awq \
Qwen/Qwen2.5-7B-Instruct \
--calib-dataset ptb \
--calib-samples 128 \
--work-dir ./qwen-7b-awqThen serve the quantized model:
lmdeploy serve api_server ./qwen-7b-awq \
--server-port 23333 \
--backend turbomindAccessing the server
SSH tunnel
ssh -L 23333:localhost:23333 <user>@<ipAddress>Usage example
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:23333/v1",
api_key="not-needed",
)
response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[{"role": "user", "content": "What is AWQ quantization?"}],
)
print(response.choices[0].message.content)Performance flags
| Flag | Description |
|---|---|
--backend turbomind | Use TurboMind engine (default, fastest) |
--tp | Tensor parallel degree |
--cache-max-entry-count | KV cache size fraction |
Check server logs
journalctl -u lmdeploy -fCloud-init startup script (optional)
If your provider supports cloud-init, you can paste this into the Startup Script field when deploying to automate the setup above.
#cloud-config
runcmd:
- apt-get update -y
- apt-get install -y python3-pip
- pip install lmdeploy
- |
cat > /etc/systemd/system/lmdeploy.service << 'EOF'
[Unit]
Description=LMDeploy Inference Server
After=network.target
[Service]
Type=simple
ExecStart=/usr/bin/python3 -m lmdeploy serve api_server \
Qwen/Qwen2.5-7B-Instruct \
--server-port 23333 \
--backend turbomind
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
- systemctl daemon-reload
- systemctl enable lmdeploy
- systemctl start lmdeployReplace Qwen/Qwen2.5-7B-Instruct with your target model.
What's next
- vLLM Inference Server: Wider model compatibility
- Inference Frameworks: Compare all serving stacks
- Networking: SSH tunneling and port access
- Instance Types: A100 vs H100 for inference workloads