llama.cpp Server
Deploy llama.cpp as an OpenAI-compatible HTTP server on Spheron GPU instances. llama.cpp supports GGUF-quantized models and can offload layers between CPU and GPU, making it ideal for consumer-grade GPUs and quantized inference.
Recommended hardware
| Model Size | Quantization | VRAM Required | Recommended GPU |
|---|---|---|---|
| 7B | Q4_K_M | ~4GB | RTX 4090 or any 8GB GPU |
| 7B | Q8_0 | ~8GB | RTX 4090 |
| 7B | F16 | ~14GB | RTX 4090 (24GB) |
| 13B | Q4_K_M | ~8GB | RTX 4090 |
| 30B | Q4_K_M | ~20GB | RTX 4090 (24GB, tight) |
| 70B | Q4_K_M | ~40GB | A100 80GB |
Manual setup
Use these steps to set up the server manually after SSH-ing into your instance. This works on any provider regardless of cloud-init support.
Step 1: Connect to your instance
ssh <user>@<ipAddress>Replace <user> with your instance username (e.g., root or ubuntu) and <ipAddress> with your instance's public IP.
Step 2: Install dependencies
sudo apt-get update -y
sudo apt-get install -y python3-pip cmake build-essential
pip install "llama-cpp-python[server]" --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121Step 3: Download a GGUF model
mkdir -p /opt/llama-models
# Example: download a GGUF model using huggingface-cli
pip install huggingface_hub
huggingface-cli download TheBloke/Llama-2-7B-Chat-GGUF llama-2-7b-chat.Q4_K_M.gguf \
--local-dir /opt/llama-modelsOr copy your own GGUF file to /opt/llama-models/model.gguf.
Step 4: Start the server
Run the server in the foreground to verify it works:
python3 -m llama_cpp.server \
--model /opt/llama-models/model.gguf \
--n_gpu_layers -1 \
--port 8080Press Ctrl+C to stop.
Step 5: Run as a background service
To keep the server running after you close your SSH session, create a systemd service:
sudo tee /etc/systemd/system/llama-cpp.service > /dev/null << 'EOF'
[Unit]
Description=llama.cpp Inference Server
After=network.target
[Service]
Type=simple
ExecStart=/usr/bin/python3 -m llama_cpp.server \
--model /opt/llama-models/model.gguf \
--n_gpu_layers -1 \
--port 8080
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable llama-cpp
sudo systemctl start llama-cppReplace /opt/llama-models/model.gguf with your downloaded GGUF file path. Set --n_gpu_layers -1 to offload all layers to GPU; reduce this number for CPU+GPU mixed inference.
Accessing the server
SSH tunnel
ssh -L 8080:localhost:8080 <user>@<ipAddress>Test completion
curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"prompt": "Explain GGUF quantization briefly.",
"max_tokens": 100
}'Usage example
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed",
)
response = client.chat.completions.create(
model="model",
messages=[{"role": "user", "content": "What is GGUF quantization?"}],
)
print(response.choices[0].message.content)GPU layer offload
--n_gpu_layers | Behavior |
|---|---|
-1 | All layers on GPU (fastest) |
0 | CPU only (no GPU) |
20 | First 20 layers on GPU, rest on CPU |
Use partial offload when VRAM is insufficient for the full model.
Check server logs
journalctl -u llama-cpp -fCloud-init startup script (optional)
If your provider supports cloud-init, you can paste this into the Startup Script field when deploying to automate the setup above.
#cloud-config
runcmd:
- apt-get update -y
- apt-get install -y python3-pip cmake build-essential
- pip install "llama-cpp-python[server]" --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
- pip install huggingface_hub
- mkdir -p /opt/llama-models
- huggingface-cli download TheBloke/Llama-2-7B-Chat-GGUF llama-2-7b-chat.Q4_K_M.gguf --local-dir /opt/llama-models
- mv /opt/llama-models/llama-2-7b-chat.Q4_K_M.gguf /opt/llama-models/model.gguf
- |
cat > /etc/systemd/system/llama-cpp.service << 'EOF'
[Unit]
Description=llama.cpp Inference Server
After=network.target
[Service]
Type=simple
ExecStart=/usr/bin/python3 -m llama_cpp.server \
--model /opt/llama-models/model.gguf \
--n_gpu_layers -1 \
--port 8080
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
- systemctl daemon-reload
- systemctl enable llama-cpp
- systemctl start llama-cppThe script downloads llama-2-7b-chat.Q4_K_M.gguf (~4GB) as model.gguf before starting the service. To use a different model, replace the huggingface-cli download and mv lines with your preferred model download, ensuring the final file is saved to /opt/llama-models/model.gguf.
What's next
- vLLM Inference Server: Higher throughput for production workloads
- Inference Frameworks: Compare all serving stacks
- Networking: SSH tunneling and port access
- Instance Types: Choosing the right GPU for your model