LMDeploy

Deploy LMDeploy with the TurboMind inference engine on Spheron A100 or H100 instances. LMDeploy supports AWQ quantization for memory-efficient inference and exposes an OpenAI-compatible API.

Recommended hardware

Model Size	Recommended GPU	Instance Type	Notes
7B (AWQ)	RTX 4090 (24GB)	Dedicated or Spot	~8GB VRAM with W4A16 AWQ
7B (FP16)	A100 40GB	Dedicated	Full precision
30B+	A100 80GB (1–2×)	Dedicated	Use `--tp 2` for tensor parallelism
70B+	H100 80GB (2× or more)	Cluster	TurboMind multi-GPU

Manual setup

Use these steps to set up the server manually after SSH-ing into your instance. This works on any provider regardless of cloud-init support.

Step 1: Connect to your instance

ssh <user>@<ipAddress>

Replace <user> with the username shown in the instance details panel in the dashboard (e.g., ubuntu for Spheron AI instances) and <ipAddress> with your instance's public IP.

Step 2: Install LMDeploy

sudo apt-get update -y
sudo apt-get install -y python3-pip
pip install lmdeploy

Step 3: Start the server

Run the server in the foreground to verify it works:

python3 -m lmdeploy serve api_server \
  Qwen/Qwen2.5-7B-Instruct \
  --server-port 23333 \
  --backend turbomind

Press Ctrl+C to stop. Replace Qwen/Qwen2.5-7B-Instruct with your target model.

Step 4: Run as a background service

To keep the server running after you close your SSH session, create a systemd service:

sudo tee /etc/systemd/system/lmdeploy.service > /dev/null << 'EOF'
[Unit]
Description=LMDeploy Inference Server
After=network.target
 
[Service]
Type=simple
ExecStart=/usr/bin/python3 -m lmdeploy serve api_server \
  Qwen/Qwen2.5-7B-Instruct \
  --server-port 23333 \
  --backend turbomind
Restart=on-failure
RestartSec=10
 
[Install]
WantedBy=multi-user.target
EOF
 
sudo systemctl daemon-reload
sudo systemctl enable lmdeploy
sudo systemctl start lmdeploy

AWQ quantization

To convert a model to AWQ 4-bit before serving (reduces VRAM by ~50%):

lmdeploy lite auto_awq \
  Qwen/Qwen2.5-7B-Instruct \
  --calib-dataset ptb \
  --calib-samples 128 \
  --work-dir ./qwen-7b-awq

Then serve the quantized model:

lmdeploy serve api_server ./qwen-7b-awq \
  --server-port 23333 \
  --backend turbomind

Accessing the server

SSH tunnel

ssh -L 23333:localhost:23333 <user>@<ipAddress>

Usage example

from openai import OpenAI
 
client = OpenAI(
    base_url="http://localhost:23333/v1",
    api_key="not-needed",
)
 
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "What is AWQ quantization?"}],
)
print(response.choices[0].message.content)

Performance flags

Flag	Description
`--backend turbomind`	Use TurboMind engine (default, fastest)
`--tp`	Tensor parallel degree
`--cache-max-entry-count`	KV cache size fraction

Check server logs

journalctl -u lmdeploy -f

Cloud-init startup script (optional)

If your provider supports cloud-init, you can paste this into the Startup Script field when deploying to automate the setup above.

#cloud-config
runcmd:
  - apt-get update -y
  - apt-get install -y python3-pip
  - pip install lmdeploy
  - |
    cat > /etc/systemd/system/lmdeploy.service << 'EOF'
    [Unit]
    Description=LMDeploy Inference Server
    After=network.target
 
    [Service]
    Type=simple
    ExecStart=/usr/bin/python3 -m lmdeploy serve api_server \
      Qwen/Qwen2.5-7B-Instruct \
      --server-port 23333 \
      --backend turbomind
    Restart=on-failure
    RestartSec=10
 
    [Install]
    WantedBy=multi-user.target
    EOF
  - systemctl daemon-reload
  - systemctl enable lmdeploy
  - systemctl start lmdeploy

Replace Qwen/Qwen2.5-7B-Instruct with your target model.

What's next

vLLM Inference Server: Wider model compatibility
Inference Frameworks: Compare all serving stacks
Networking: SSH tunneling and port access
Instance Types: A100 vs H100 for inference workloads