LLM & AI Guides
Guides for running language model and AI inference workloads on Spheron GPU instances, from interactive chat interfaces to high-throughput OpenAI-compatible API servers.
Choosing the right instance for inference
| Workload | Recommended Type | Why |
|---|---|---|
| Interactive chat, testing | Spot (RTX 4090) | Cost-effective for low-traffic usage |
| Production API (7B–13B) | Dedicated (H100 80GB) | Consistent latency, single-GPU throughput |
| Large models (30B+) | Dedicated (2× A100 80GB) | Multi-GPU tensor parallelism |
| 70B+ models | Cluster (H100 NVLink) | NVLink bandwidth for maximum throughput |
Use Spot instances for experiments and development; switch to Dedicated for production traffic.
Inference frameworks
Choose the right serving stack for your use case. See the Inference Frameworks index for a comparison.
vLLM Inference Server
OpenAI-compatible inference server using vLLM on H100 or A100. Includes a systemd service for persistence across reboots, SSH tunnel access, and performance tuning flags (--tensor-parallel-size, --dtype, --max-model-len).
Best for: Production API workloads; drop-in replacement for the OpenAI API.
Ollama + Open WebUI
Browser-based chat interface backed by Ollama on an RTX 4090. Docker Compose setup with NVIDIA GPU passthrough; pull any model with a single command.
Best for: Interactive local model usage; exploring models without writing code.
SGLang
Agentic LLM serving with RadixAttention for KV cache reuse, constrained decoding, and OpenAI-compatible API.
Best for: Agentic pipelines, multi-turn workloads, and structured output generation.
TensorRT-LLM + Triton
NVIDIA-optimized engine compilation via TensorRT-LLM with Triton Inference Server for production-grade serving.
Best for: Maximum throughput on NVIDIA GPUs; production deployments requiring low latency.
llama.cpp Server
GGUF model serving with CPU+GPU offload, lightweight and portable.
Best for: Running quantized models on consumer GPUs; mixed CPU/GPU inference.
LMDeploy
LMDeploy TurboMind inference toolkit with AWQ quantization support.
Best for: Memory-efficient deployment with AWQ-quantized models on A100/H100.
LocalAI
OpenAI-compatible drop-in replacement via Docker with support for LLMs, Whisper, and Stable Diffusion.
Best for: Multi-modal local inference with a single OpenAI-compatible endpoint.
Text models
DeepSeek R1 & V3
DeepSeek reasoning models, from 7B distillations to the full 671B FP8 multi-GPU deployment. Includes <think> reasoning block parsing.
Best for: Complex reasoning tasks, math, and code generation.
Llama 4 Scout & Maverick
Meta's latest multimodal MoE models (Scout 109B and Maverick 400B) with long-context and image understanding.
Best for: State-of-the-art multimodal reasoning with large context windows.
Llama 3.1 / 3.2 / 3.3
Meta Llama 3 family: 8B on RTX 4090, 70B on 2× A100, 405B on 8× H100 with tensor parallelism and function calling.
Best for: General-purpose chat, instruction following, and function calling.
Qwen3 Dense & MoE
Qwen3 text models with thinking mode toggle, 7B dense through 235B-A22B MoE on multi-GPU.
Best for: Reasoning tasks with controllable chain-of-thought via /think and /no_think tokens.
Mistral & Mixtral
Mistral 7B, Mixtral 8x7B MoE, and Mistral Small 3.1 24B with function calling support.
Best for: Efficient inference with strong instruction following and function calling.
Gemma 3
Google DeepMind Gemma 3 in 4B, 12B, and 27B (INT4). Available under the Gemma Terms of Use; commercial use is permitted after accepting Google's license agreement on HuggingFace.
Best for: Low-latency inference on smaller GPUs; research and commercial projects.
Phi-4 & Phi-4 Multimodal
Microsoft Phi-4 14B SLM and Phi-4-multimodal with image input support. MIT license.
Best for: Efficient SLM inference; multimodal tasks on a single RTX 4090.
Multimodal models
Qwen3-Omni-30B-A3B
Multimodal language model with 30B parameters supporting text, audio, images, and video inputs. 32K context window (single GPU) on A100/H100.
Best for: Multimodal tasks requiring audio, vision, and text processing in a single model.
Qwen3-VL 4B & 8B
Vision-language models available in 4B and 8B parameter variants. 256K context, multimodal reasoning, and GUI automation capabilities on RTX 4090 or A100.
Best for: Image understanding, visual reasoning, and GUI automation tasks.
InternVL3
InternVL3 vision-language model series, 8B through 78B, deployed via vLLM.
Best for: High-accuracy visual question answering and multimodal reasoning.
LLaVA-Next
LLaVA-NeXT vision-language model (7B/13B) with improved visual reasoning, deployed via vLLM.
Best for: Image-to-text tasks; accessible VLM on RTX 4090 or A100.
Pixtral-12B
Mistral's Pixtral-12B multimodal model, deployed via vLLM on RTX 4090 (24GB).
Best for: Compact multimodal inference with Mistral-quality text generation.
Baidu ERNIE-4.5-VL-28B-A3B
Advanced vision-language model from Baidu with 28B active parameters (MoE architecture). Strong visual reasoning and STEM task performance on RTX 4090 or A6000.
Best for: Visual reasoning, multimodal understanding, and STEM-domain tasks.
Specialized models
Chandra OCR
Specialized OCR model for document processing with 83.1% accuracy, outperforming GPT-4o on document tasks. Supports vLLM deployment for high-throughput document pipelines.
Best for: Document digitization, text extraction, and OCR pipelines.
Soulx Podcast-1.7B
Multi-speaker podcast generation model (1.7B parameters). Generates 60+ minute dialogues with speaker switching, zero-shot voice cloning, and paralinguistics.
Best for: Audio content generation, podcast production, and voice synthesis.
Janus CoderV-8B
8B multimodal code intelligence model. Generates HTML/CSS/React from screenshots, charts, and mockups. Trained on JANUSCODE-800K, the largest multimodal code dataset.
Best for: Visual-to-code translation, layout bug fixing, and UI mockup generation.
What's next
- Instance Types: Spot vs Dedicated vs Cluster
- Networking: SSH tunneling and port access
- Cost Optimization: Reducing inference costs with Spot instances
- Templates & Images: Copy-ready startup scripts