Skip to content

llama.cpp Server

Deploy llama.cpp as an OpenAI-compatible HTTP server on Spheron GPU instances. llama.cpp supports GGUF-quantized models and can offload layers between CPU and GPU, making it ideal for consumer-grade GPUs and quantized inference.

Recommended hardware

Model SizeQuantizationVRAM RequiredRecommended GPU
7BQ4_K_M~4GBRTX 4090 or any 8GB GPU
7BQ8_0~8GBRTX 4090
7BF16~14GBRTX 4090 (24GB)
13BQ4_K_M~8GBRTX 4090
30BQ4_K_M~20GBRTX 4090 (24GB, tight)
70BQ4_K_M~40GBA100 80GB

Manual setup

Use these steps to set up the server manually after SSH-ing into your instance. This works on any provider regardless of cloud-init support.

Step 1: Connect to your instance

ssh <user>@<ipAddress>

Replace <user> with your instance username (e.g., root or ubuntu) and <ipAddress> with your instance's public IP.

Step 2: Install dependencies

sudo apt-get update -y
sudo apt-get install -y python3-pip cmake build-essential
pip install "llama-cpp-python[server]" --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121

Step 3: Download a GGUF model

mkdir -p /opt/llama-models
# Example: download a GGUF model using huggingface-cli
pip install huggingface_hub
huggingface-cli download TheBloke/Llama-2-7B-Chat-GGUF llama-2-7b-chat.Q4_K_M.gguf \
  --local-dir /opt/llama-models

Or copy your own GGUF file to /opt/llama-models/model.gguf.

Step 4: Start the server

Run the server in the foreground to verify it works:

python3 -m llama_cpp.server \
  --model /opt/llama-models/model.gguf \
  --n_gpu_layers -1 \
  --port 8080

Press Ctrl+C to stop.

Step 5: Run as a background service

To keep the server running after you close your SSH session, create a systemd service:

sudo tee /etc/systemd/system/llama-cpp.service > /dev/null << 'EOF'
[Unit]
Description=llama.cpp Inference Server
After=network.target
 
[Service]
Type=simple
ExecStart=/usr/bin/python3 -m llama_cpp.server \
  --model /opt/llama-models/model.gguf \
  --n_gpu_layers -1 \
  --port 8080
Restart=on-failure
RestartSec=10
 
[Install]
WantedBy=multi-user.target
EOF
 
sudo systemctl daemon-reload
sudo systemctl enable llama-cpp
sudo systemctl start llama-cpp

Replace /opt/llama-models/model.gguf with your downloaded GGUF file path. Set --n_gpu_layers -1 to offload all layers to GPU; reduce this number for CPU+GPU mixed inference.

Accessing the server

SSH tunnel

ssh -L 8080:localhost:8080 <user>@<ipAddress>

Test completion

curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Explain GGUF quantization briefly.",
    "max_tokens": 100
  }'

Usage example

from openai import OpenAI
 
client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed",
)
 
response = client.chat.completions.create(
    model="model",
    messages=[{"role": "user", "content": "What is GGUF quantization?"}],
)
print(response.choices[0].message.content)

GPU layer offload

--n_gpu_layersBehavior
-1All layers on GPU (fastest)
0CPU only (no GPU)
20First 20 layers on GPU, rest on CPU

Use partial offload when VRAM is insufficient for the full model.

Check server logs

journalctl -u llama-cpp -f

Cloud-init startup script (optional)

If your provider supports cloud-init, you can paste this into the Startup Script field when deploying to automate the setup above.

#cloud-config
runcmd:
  - apt-get update -y
  - apt-get install -y python3-pip cmake build-essential
  - pip install "llama-cpp-python[server]" --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
  - pip install huggingface_hub
  - mkdir -p /opt/llama-models
  - huggingface-cli download TheBloke/Llama-2-7B-Chat-GGUF llama-2-7b-chat.Q4_K_M.gguf --local-dir /opt/llama-models
  - mv /opt/llama-models/llama-2-7b-chat.Q4_K_M.gguf /opt/llama-models/model.gguf
  - |
    cat > /etc/systemd/system/llama-cpp.service << 'EOF'
    [Unit]
    Description=llama.cpp Inference Server
    After=network.target
 
    [Service]
    Type=simple
    ExecStart=/usr/bin/python3 -m llama_cpp.server \
      --model /opt/llama-models/model.gguf \
      --n_gpu_layers -1 \
      --port 8080
    Restart=on-failure
    RestartSec=10
 
    [Install]
    WantedBy=multi-user.target
    EOF
  - systemctl daemon-reload
  - systemctl enable llama-cpp
  - systemctl start llama-cpp

The script downloads llama-2-7b-chat.Q4_K_M.gguf (~4GB) as model.gguf before starting the service. To use a different model, replace the huggingface-cli download and mv lines with your preferred model download, ensuring the final file is saved to /opt/llama-models/model.gguf.

What's next