llama.cpp Server

Deploy llama.cpp as an OpenAI-compatible HTTP server on Spheron GPU instances. llama.cpp supports GGUF-quantized models and can offload layers between CPU and GPU, making it ideal for consumer-grade GPUs and quantized inference.

Recommended hardware

Model Size	Quantization	VRAM Required	Recommended GPU
7B	Q4_K_M	~4GB	RTX 4090 or any 8GB GPU
7B	Q8_0	~8GB	RTX 4090
7B	F16	~14GB	RTX 4090 (24GB)
13B	Q4_K_M	~8GB	RTX 4090
30B	Q4_K_M	~20GB	RTX 4090 (24GB, tight)
70B	Q4_K_M	~40GB	A100 80GB

Manual setup

Use these steps to set up the server manually after SSH-ing into your instance. This works on any provider regardless of cloud-init support.

Step 1: Connect to your instance

ssh <user>@<ipAddress>

Replace <user> with your instance username (e.g., root or ubuntu) and <ipAddress> with your instance's public IP.

Step 2: Install dependencies

sudo apt-get update -y
sudo apt-get install -y python3-pip cmake build-essential
pip install "llama-cpp-python[server]" --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121

Step 3: Download a GGUF model

mkdir -p /opt/llama-models
# Example: download a GGUF model using huggingface-cli
pip install huggingface_hub
huggingface-cli download TheBloke/Llama-2-7B-Chat-GGUF llama-2-7b-chat.Q4_K_M.gguf \
  --local-dir /opt/llama-models

Or copy your own GGUF file to /opt/llama-models/model.gguf.

Step 4: Start the server

Run the server in the foreground to verify it works:

python3 -m llama_cpp.server \
  --model /opt/llama-models/model.gguf \
  --n_gpu_layers -1 \
  --port 8080

Press Ctrl+C to stop.

Step 5: Run as a background service

To keep the server running after you close your SSH session, create a systemd service:

sudo tee /etc/systemd/system/llama-cpp.service > /dev/null << 'EOF'
[Unit]
Description=llama.cpp Inference Server
After=network.target
 
[Service]
Type=simple
ExecStart=/usr/bin/python3 -m llama_cpp.server \
  --model /opt/llama-models/model.gguf \
  --n_gpu_layers -1 \
  --port 8080
Restart=on-failure
RestartSec=10
 
[Install]
WantedBy=multi-user.target
EOF
 
sudo systemctl daemon-reload
sudo systemctl enable llama-cpp
sudo systemctl start llama-cpp

Replace /opt/llama-models/model.gguf with your downloaded GGUF file path. Set --n_gpu_layers -1 to offload all layers to GPU; reduce this number for CPU+GPU mixed inference.

Accessing the server

SSH tunnel

ssh -L 8080:localhost:8080 <user>@<ipAddress>

Test completion

curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Explain GGUF quantization briefly.",
    "max_tokens": 100
  }'

Usage example

from openai import OpenAI
 
client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed",
)
 
response = client.chat.completions.create(
    model="model",
    messages=[{"role": "user", "content": "What is GGUF quantization?"}],
)
print(response.choices[0].message.content)

GPU layer offload

`--n_gpu_layers`	Behavior
`-1`	All layers on GPU (fastest)
`0`	CPU only (no GPU)
`20`	First 20 layers on GPU, rest on CPU

Use partial offload when VRAM is insufficient for the full model.

Check server logs

journalctl -u llama-cpp -f

Cloud-init startup script (optional)

If your provider supports cloud-init, you can paste this into the Startup Script field when deploying to automate the setup above.

#cloud-config
runcmd:
  - apt-get update -y
  - apt-get install -y python3-pip cmake build-essential
  - pip install "llama-cpp-python[server]" --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
  - pip install huggingface_hub
  - mkdir -p /opt/llama-models
  - huggingface-cli download TheBloke/Llama-2-7B-Chat-GGUF llama-2-7b-chat.Q4_K_M.gguf --local-dir /opt/llama-models
  - mv /opt/llama-models/llama-2-7b-chat.Q4_K_M.gguf /opt/llama-models/model.gguf
  - |
    cat > /etc/systemd/system/llama-cpp.service << 'EOF'
    [Unit]
    Description=llama.cpp Inference Server
    After=network.target
 
    [Service]
    Type=simple
    ExecStart=/usr/bin/python3 -m llama_cpp.server \
      --model /opt/llama-models/model.gguf \
      --n_gpu_layers -1 \
      --port 8080
    Restart=on-failure
    RestartSec=10
 
    [Install]
    WantedBy=multi-user.target
    EOF
  - systemctl daemon-reload
  - systemctl enable llama-cpp
  - systemctl start llama-cpp

The script downloads llama-2-7b-chat.Q4_K_M.gguf (~4GB) as model.gguf before starting the service. To use a different model, replace the huggingface-cli download and mv lines with your preferred model download, ensuring the final file is saved to /opt/llama-models/model.gguf.

What's next

vLLM Inference Server: Higher throughput for production workloads
Inference Frameworks: Compare all serving stacks
Networking: SSH tunneling and port access
Instance Types: Choosing the right GPU for your model