Text Models

Guides for deploying large language models (LLMs) on Spheron GPU instances. All models are served via vLLM's OpenAI-compatible API unless otherwise noted.

VRAM requirements

Model	Parameters	Min VRAM	Recommended GPU
DeepSeek-R1-Distill-7B	7B	16GB	RTX 4090 (24GB)
Llama 3.1/3.2/3.3 8B	8B	16GB	RTX 4090 (24GB)
Mistral 7B	7B	14GB	RTX 4090 (24GB)
Gemma 3 12B	12B	24GB	RTX 4090 (24GB)
Phi-4 14B	14B	24GB	RTX 4090 (24GB)
Qwen3-32B	32B	64GB	A100 80GB
DeepSeek-R1-Distill-32B (INT4)	32B	20GB	A100 40GB
Llama 3.1 70B	70B	140GB	2× A100 80GB
Mixtral 8x7B	~47B	90GB	2× A100 80GB
Llama 4 Scout 109B (INT4)	17B active	40GB	H100 80GB
DeepSeek-R1 671B (FP8)	671B	8× H100	8× H100 80GB

Available guides

DeepSeek R1 & V3

DeepSeek's reasoning models, from 7B distillations on RTX 4090 to the full 671B FP8 deployment on 8× H100. Features chain-of-thought reasoning exposed via <think> blocks.

Best for: Complex reasoning, math, and code generation tasks.

Llama 4 Scout & Maverick

Meta's latest multimodal MoE models (Scout at 109B total/17B active, Maverick at 400B total/17B active) with long-context and native image understanding.

Best for: State-of-the-art multimodal reasoning and large context windows.

Llama 3.1 / 3.2 / 3.3

Meta Llama 3 family: 8B on a single RTX 4090, 70B on 2× A100, 405B on 8× H100. Strong instruction following and function calling.

Best for: General-purpose chat, instruction following, and function calling.

Qwen3 Dense & MoE

Alibaba Qwen3 text models with toggleable chain-of-thought reasoning. Dense models from 7B to 32B; MoE models up to 235B-A22B.

Best for: Reasoning tasks with controllable chain-of-thought via /think and /no_think tokens.

Mistral & Mixtral

Mistral 7B, Mixtral 8x7B MoE, and Mistral Small 3.1 24B with efficient inference and function calling support.

Best for: Efficient inference with strong instruction following and function calling.

Gemma 3

Google DeepMind Gemma 3 in 4B, 12B, and 27B (INT4 option). Available under the Gemma Terms of Use; commercial use is permitted after accepting Google's license agreement.

Best for: Low-latency inference on smaller GPUs; research and commercial projects.

Phi-4 & Phi-4 Multimodal

Microsoft Phi-4 14B small language model and Phi-4-multimodal with image input support. MIT license.

Best for: Efficient SLM inference; multimodal tasks on a single RTX 4090.

What's next

Inference Frameworks: Choose the right serving stack (vLLM, SGLang, llama.cpp, etc.)
Multimodal Models: Vision-language model guides
Specialized Models: OCR, audio generation, and code intelligence
Instance Types: GPU selection for text models
Cost Optimization: Reducing inference costs with Spot instances