Skip to content

LLM & AI Guides

Guides for running language model and AI inference workloads on Spheron GPU instances, from interactive chat interfaces to high-throughput OpenAI-compatible API servers.

Choosing the right instance for inference

WorkloadRecommended TypeWhy
Interactive chat, testingSpot (RTX 4090)Cost-effective for low-traffic usage
Production API (7B–13B)Dedicated (H100 80GB)Consistent latency, single-GPU throughput
Large models (30B+)Dedicated (2× A100 80GB)Multi-GPU tensor parallelism
70B+ modelsCluster (H100 NVLink)NVLink bandwidth for maximum throughput

Use Spot instances for experiments and development; switch to Dedicated for production traffic.

Inference frameworks

Choose the right serving stack for your use case. See the Inference Frameworks index for a comparison.

vLLM Inference Server

OpenAI-compatible inference server using vLLM on H100 or A100. Includes a systemd service for persistence across reboots, SSH tunnel access, and performance tuning flags (--tensor-parallel-size, --dtype, --max-model-len).

Best for: Production API workloads; drop-in replacement for the OpenAI API.

Ollama + Open WebUI

Browser-based chat interface backed by Ollama on an RTX 4090. Docker Compose setup with NVIDIA GPU passthrough; pull any model with a single command.

Best for: Interactive local model usage; exploring models without writing code.

SGLang

Agentic LLM serving with RadixAttention for KV cache reuse, constrained decoding, and OpenAI-compatible API.

Best for: Agentic pipelines, multi-turn workloads, and structured output generation.

TensorRT-LLM + Triton

NVIDIA-optimized engine compilation via TensorRT-LLM with Triton Inference Server for production-grade serving.

Best for: Maximum throughput on NVIDIA GPUs; production deployments requiring low latency.

llama.cpp Server

GGUF model serving with CPU+GPU offload, lightweight and portable.

Best for: Running quantized models on consumer GPUs; mixed CPU/GPU inference.

LMDeploy

LMDeploy TurboMind inference toolkit with AWQ quantization support.

Best for: Memory-efficient deployment with AWQ-quantized models on A100/H100.

LocalAI

OpenAI-compatible drop-in replacement via Docker with support for LLMs, Whisper, and Stable Diffusion.

Best for: Multi-modal local inference with a single OpenAI-compatible endpoint.

Text models

DeepSeek R1 & V3

DeepSeek reasoning models, from 7B distillations to the full 671B FP8 multi-GPU deployment. Includes <think> reasoning block parsing.

Best for: Complex reasoning tasks, math, and code generation.

Llama 4 Scout & Maverick

Meta's latest multimodal MoE models (Scout 109B and Maverick 400B) with long-context and image understanding.

Best for: State-of-the-art multimodal reasoning with large context windows.

Llama 3.1 / 3.2 / 3.3

Meta Llama 3 family: 8B on RTX 4090, 70B on 2× A100, 405B on 8× H100 with tensor parallelism and function calling.

Best for: General-purpose chat, instruction following, and function calling.

Qwen3 Dense & MoE

Qwen3 text models with thinking mode toggle, 7B dense through 235B-A22B MoE on multi-GPU.

Best for: Reasoning tasks with controllable chain-of-thought via /think and /no_think tokens.

Mistral & Mixtral

Mistral 7B, Mixtral 8x7B MoE, and Mistral Small 3.1 24B with function calling support.

Best for: Efficient inference with strong instruction following and function calling.

Gemma 3

Google DeepMind Gemma 3 in 4B, 12B, and 27B (INT4). Available under the Gemma Terms of Use; commercial use is permitted after accepting Google's license agreement on HuggingFace.

Best for: Low-latency inference on smaller GPUs; research and commercial projects.

Phi-4 & Phi-4 Multimodal

Microsoft Phi-4 14B SLM and Phi-4-multimodal with image input support. MIT license.

Best for: Efficient SLM inference; multimodal tasks on a single RTX 4090.

Multimodal models

Qwen3-Omni-30B-A3B

Multimodal language model with 30B parameters supporting text, audio, images, and video inputs. 32K context window (single GPU) on A100/H100.

Best for: Multimodal tasks requiring audio, vision, and text processing in a single model.

Qwen3-VL 4B & 8B

Vision-language models available in 4B and 8B parameter variants. 256K context, multimodal reasoning, and GUI automation capabilities on RTX 4090 or A100.

Best for: Image understanding, visual reasoning, and GUI automation tasks.

InternVL3

InternVL3 vision-language model series, 8B through 78B, deployed via vLLM.

Best for: High-accuracy visual question answering and multimodal reasoning.

LLaVA-Next

LLaVA-NeXT vision-language model (7B/13B) with improved visual reasoning, deployed via vLLM.

Best for: Image-to-text tasks; accessible VLM on RTX 4090 or A100.

Pixtral-12B

Mistral's Pixtral-12B multimodal model, deployed via vLLM on RTX 4090 (24GB).

Best for: Compact multimodal inference with Mistral-quality text generation.

Baidu ERNIE-4.5-VL-28B-A3B

Advanced vision-language model from Baidu with 28B active parameters (MoE architecture). Strong visual reasoning and STEM task performance on RTX 4090 or A6000.

Best for: Visual reasoning, multimodal understanding, and STEM-domain tasks.

Specialized models

Chandra OCR

Specialized OCR model for document processing with 83.1% accuracy, outperforming GPT-4o on document tasks. Supports vLLM deployment for high-throughput document pipelines.

Best for: Document digitization, text extraction, and OCR pipelines.

Soulx Podcast-1.7B

Multi-speaker podcast generation model (1.7B parameters). Generates 60+ minute dialogues with speaker switching, zero-shot voice cloning, and paralinguistics.

Best for: Audio content generation, podcast production, and voice synthesis.

Janus CoderV-8B

8B multimodal code intelligence model. Generates HTML/CSS/React from screenshots, charts, and mockups. Trained on JANUSCODE-800K, the largest multimodal code dataset.

Best for: Visual-to-code translation, layout bug fixing, and UI mockup generation.

What's next