Skip to content

Text Models

Guides for deploying large language models (LLMs) on Spheron GPU instances. All models are served via vLLM's OpenAI-compatible API unless otherwise noted.

VRAM requirements

ModelParametersMin VRAMRecommended GPU
DeepSeek-R1-Distill-7B7B16GBRTX 4090 (24GB)
Llama 3.1/3.2/3.3 8B8B16GBRTX 4090 (24GB)
Mistral 7B7B14GBRTX 4090 (24GB)
Gemma 3 12B12B24GBRTX 4090 (24GB)
Phi-4 14B14B24GBRTX 4090 (24GB)
Qwen3-32B32B64GBA100 80GB
DeepSeek-R1-Distill-32B (INT4)32B20GBA100 40GB
Llama 3.1 70B70B140GB2× A100 80GB
Mixtral 8x7B~47B90GB2× A100 80GB
Llama 4 Scout 109B (INT4)17B active40GBH100 80GB
DeepSeek-R1 671B (FP8)671B8× H1008× H100 80GB

Available guides

DeepSeek R1 & V3

DeepSeek's reasoning models, from 7B distillations on RTX 4090 to the full 671B FP8 deployment on 8× H100. Features chain-of-thought reasoning exposed via <think> blocks.

Best for: Complex reasoning, math, and code generation tasks.

Llama 4 Scout & Maverick

Meta's latest multimodal MoE models (Scout at 109B total/17B active, Maverick at 400B total/17B active) with long-context and native image understanding.

Best for: State-of-the-art multimodal reasoning and large context windows.

Llama 3.1 / 3.2 / 3.3

Meta Llama 3 family: 8B on a single RTX 4090, 70B on 2× A100, 405B on 8× H100. Strong instruction following and function calling.

Best for: General-purpose chat, instruction following, and function calling.

Qwen3 Dense & MoE

Alibaba Qwen3 text models with toggleable chain-of-thought reasoning. Dense models from 7B to 32B; MoE models up to 235B-A22B.

Best for: Reasoning tasks with controllable chain-of-thought via /think and /no_think tokens.

Mistral & Mixtral

Mistral 7B, Mixtral 8x7B MoE, and Mistral Small 3.1 24B with efficient inference and function calling support.

Best for: Efficient inference with strong instruction following and function calling.

Gemma 3

Google DeepMind Gemma 3 in 4B, 12B, and 27B (INT4 option). Available under the Gemma Terms of Use; commercial use is permitted after accepting Google's license agreement.

Best for: Low-latency inference on smaller GPUs; research and commercial projects.

Phi-4 & Phi-4 Multimodal

Microsoft Phi-4 14B small language model and Phi-4-multimodal with image input support. MIT license.

Best for: Efficient SLM inference; multimodal tasks on a single RTX 4090.

What's next