Mistral & Mixtral
Deploy Mistral AI models on Spheron GPU instances using vLLM. Includes Mistral 7B for single-GPU deployment, Mixtral 8x7B MoE (requires ~90 GB VRAM in bfloat16, needs 2× A100 80GB), and Mistral Small 3.1 24B.
Recommended hardware
| Model | Recommended GPU | Instance Type | Notes |
|---|---|---|---|
| Mistral-7B-Instruct-v0.3 | RTX 4090 (24GB) | Dedicated or Spot | Fits in 16GB VRAM |
| Mixtral-8x7B-Instruct-v0.1 | 2× A100 80GB | Dedicated | MoE, ~90 GB VRAM in bfloat16 |
| Mistral-Small-3.1-24B | A100 80GB | Dedicated | Full precision (~55GB VRAM) |
Manual setup
Use these steps to set up the server manually after SSH-ing into your instance. This works on any provider regardless of cloud-init support.
Step 1: Connect to your instance
ssh <user>@<ipAddress>Replace <user> with the username shown in the instance details panel (e.g., ubuntu for Spheron AI instances) and <ipAddress> with your instance's public IP.
Step 2: Install vLLM
sudo apt-get update -y
sudo apt-get install -y python3-pip
pip install vllmStep 3: Start the server
Run the server in the foreground to verify it works:
python3 -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-7B-Instruct-v0.3 \
--port 8000 \
--dtype bfloat16Press Ctrl+C to stop.
Step 4: Run as a background service
To keep the server running after you close your SSH session, create a systemd service:
sudo tee /etc/systemd/system/vllm-mistral.service > /dev/null << 'EOF'
[Unit]
Description=Mistral vLLM Inference Server
After=network.target
[Service]
Type=simple
ExecStart=/usr/bin/python3 -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-7B-Instruct-v0.3 \
--port 8000 \
--dtype bfloat16
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable vllm-mistral
sudo systemctl start vllm-mistralMixtral 8x7B (2× A100 80GB)
For the Mixtral MoE model, replace the ExecStart command with:
/usr/bin/python3 -m vllm.entrypoints.openai.api_server \
--model mistralai/Mixtral-8x7B-Instruct-v0.1 \
--port 8000 \
--dtype bfloat16 \
--tensor-parallel-size 2Accessing the server
SSH tunnel
ssh -L 8000:localhost:8000 <user>@<ipAddress>Usage example: function calling
from openai import OpenAI
import json
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
tools = [
{
"type": "function",
"function": {
"name": "search_web",
"description": "Search the web for information",
"parameters": {
"type": "object",
"properties": {"query": {"type": "string"}},
"required": ["query"],
},
},
}
]
response = client.chat.completions.create(
model="mistralai/Mistral-7B-Instruct-v0.3",
messages=[{"role": "user", "content": "Search for the latest GPU benchmarks."}],
tools=tools,
tool_choice="auto",
)
if response.choices[0].message.tool_calls:
tool_call = response.choices[0].message.tool_calls[0]
print(f"Function: {tool_call.function.name}")
print(f"Arguments: {tool_call.function.arguments}")Cloud-init startup script (optional)
If your provider supports cloud-init, you can paste this into the Startup Script field when deploying to automate the setup above.
Mistral 7B (RTX 4090)
#cloud-config
runcmd:
- apt-get update -y
- apt-get install -y python3-pip
- pip install vllm
- |
cat > /etc/systemd/system/vllm-mistral.service << 'EOF'
[Unit]
Description=Mistral vLLM Inference Server
After=network.target
[Service]
Type=simple
ExecStart=/usr/bin/python3 -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-7B-Instruct-v0.3 \
--port 8000 \
--dtype bfloat16
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
- systemctl daemon-reload
- systemctl enable vllm-mistral
- systemctl start vllm-mistralMixtral 8x7B (2× A100 80GB)
Replace the ExecStart line with:
/usr/bin/python3 -m vllm.entrypoints.openai.api_server \
--model mistralai/Mixtral-8x7B-Instruct-v0.1 \
--port 8000 \
--dtype bfloat16 \
--tensor-parallel-size 2What's next
- vLLM Inference Server: vLLM configuration details
- Llama 3.1 / 3.2 / 3.3: Meta Llama model guides
- Instance Types: GPU selection for MoE models
- Cost Optimization: Spot instances for Mistral 7B