LocalAI

Deploy LocalAI on a Spheron GPU instance. LocalAI is an OpenAI-compatible drop-in replacement that supports LLMs, Whisper speech-to-text, and Stable Diffusion image generation through a single Docker container with NVIDIA GPU passthrough.

Recommended hardware

Workload	Recommended GPU	Instance Type	Notes
LLM inference (7B Q4)	RTX 4090 (24GB)	Dedicated or Spot	~4GB VRAM
LLM + image generation	A100 40GB	Dedicated	Separate VRAM budgets
Whisper only	Any GPU	Spot	CPU-capable, GPU accelerated

Supported model types

Type	Example Models	Notes
LLMs (GGUF)	Llama, Mistral, Qwen	Download GGUF files to `/models`
Speech-to-text	Whisper base/small/large	Auto-downloaded
Image generation	Stable Diffusion 1.5, SDXL	Requires 8–16GB VRAM

Manual setup

Use these steps to set up the server manually after SSH-ing into your instance. This works on any provider regardless of cloud-init support.

Step 1: Connect to your instance

ssh <user>@<ipAddress>

Replace <user> with the username shown in the instance details panel in the dashboard (e.g., ubuntu for Spheron AI instances) and <ipAddress> with your instance's public IP.

Step 2: Install Docker and NVIDIA container toolkit

sudo apt-get update -y
sudo apt-get install -y docker.io nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Step 3: Create the models directory

sudo mkdir -p /opt/localai/models

Place GGUF model files in /opt/localai/models/ before or after startup. LocalAI auto-discovers any .gguf file placed in the models directory.

Step 4: Start the LocalAI container

docker run -d \
  --gpus all \
  --name localai \
  -p 8080:8080 \
  -v /opt/localai/models:/models \
  -e DEBUG=true \
  quay.io/go-skynet/local-ai:latest-gpu-nvidia-cuda-12

Verify it is running:

docker ps
docker logs -f localai

Accessing the server

SSH tunnel

ssh -L 8080:localhost:8080 <user>@<ipAddress>

List available models

curl http://localhost:8080/v1/models

Usage example

from openai import OpenAI
 
client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed",
)
 
response = client.chat.completions.create(
    model="my-model.gguf",   # filename of the GGUF file in /models
    messages=[{"role": "user", "content": "Hello from LocalAI!"}],
)
print(response.choices[0].message.content)

Check container logs

docker logs -f localai

Cloud-init startup script (optional)

If your provider supports cloud-init, you can paste this into the Startup Script field when deploying to automate the setup above.

#cloud-config
runcmd:
  - apt-get update -y
  - apt-get install -y docker.io nvidia-container-toolkit
  - nvidia-ctk runtime configure --runtime=docker
  - systemctl restart docker
  - mkdir -p /opt/localai/models
  - |
    docker run -d \
      --gpus all \
      --name localai \
      -p 8080:8080 \
      -v /opt/localai/models:/models \
      -e DEBUG=true \
      quay.io/go-skynet/local-ai:latest-gpu-nvidia-cuda-12

Place GGUF model files in /opt/localai/models/ before or after startup.

What's next

vLLM Inference Server: Higher throughput for production API workloads
Inference Frameworks: Compare all serving stacks
Networking: SSH tunneling and port access
Templates & Images: Additional startup script templates