LocalAI
Deploy LocalAI on a Spheron GPU instance. LocalAI is an OpenAI-compatible drop-in replacement that supports LLMs, Whisper speech-to-text, and Stable Diffusion image generation through a single Docker container with NVIDIA GPU passthrough.
Recommended hardware
| Workload | Recommended GPU | Instance Type | Notes |
|---|---|---|---|
| LLM inference (7B Q4) | RTX 4090 (24GB) | Dedicated or Spot | ~4GB VRAM |
| LLM + image generation | A100 40GB | Dedicated | Separate VRAM budgets |
| Whisper only | Any GPU | Spot | CPU-capable, GPU accelerated |
Supported model types
| Type | Example Models | Notes |
|---|---|---|
| LLMs (GGUF) | Llama, Mistral, Qwen | Download GGUF files to /models |
| Speech-to-text | Whisper base/small/large | Auto-downloaded |
| Image generation | Stable Diffusion 1.5, SDXL | Requires 8–16GB VRAM |
Manual setup
Use these steps to set up the server manually after SSH-ing into your instance. This works on any provider regardless of cloud-init support.
Step 1: Connect to your instance
ssh <user>@<ipAddress>Replace <user> with the username shown in the instance details panel in the dashboard (e.g., ubuntu for Spheron AI instances) and <ipAddress> with your instance's public IP.
Step 2: Install Docker and NVIDIA container toolkit
sudo apt-get update -y
sudo apt-get install -y docker.io nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart dockerStep 3: Create the models directory
sudo mkdir -p /opt/localai/modelsPlace GGUF model files in /opt/localai/models/ before or after startup. LocalAI auto-discovers any .gguf file placed in the models directory.
Step 4: Start the LocalAI container
docker run -d \
--gpus all \
--name localai \
-p 8080:8080 \
-v /opt/localai/models:/models \
-e DEBUG=true \
quay.io/go-skynet/local-ai:latest-gpu-nvidia-cuda-12Verify it is running:
docker ps
docker logs -f localaiAccessing the server
SSH tunnel
ssh -L 8080:localhost:8080 <user>@<ipAddress>List available models
curl http://localhost:8080/v1/modelsUsage example
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed",
)
response = client.chat.completions.create(
model="my-model.gguf", # filename of the GGUF file in /models
messages=[{"role": "user", "content": "Hello from LocalAI!"}],
)
print(response.choices[0].message.content)Check container logs
docker logs -f localaiCloud-init startup script (optional)
If your provider supports cloud-init, you can paste this into the Startup Script field when deploying to automate the setup above.
#cloud-config
runcmd:
- apt-get update -y
- apt-get install -y docker.io nvidia-container-toolkit
- nvidia-ctk runtime configure --runtime=docker
- systemctl restart docker
- mkdir -p /opt/localai/models
- |
docker run -d \
--gpus all \
--name localai \
-p 8080:8080 \
-v /opt/localai/models:/models \
-e DEBUG=true \
quay.io/go-skynet/local-ai:latest-gpu-nvidia-cuda-12Place GGUF model files in /opt/localai/models/ before or after startup.
What's next
- vLLM Inference Server: Higher throughput for production API workloads
- Inference Frameworks: Compare all serving stacks
- Networking: SSH tunneling and port access
- Templates & Images: Additional startup script templates