Text Models
Guides for deploying large language models (LLMs) on Spheron GPU instances. All models are served via vLLM's OpenAI-compatible API unless otherwise noted.
VRAM requirements
| Model | Parameters | Min VRAM | Recommended GPU |
|---|---|---|---|
| DeepSeek-R1-Distill-7B | 7B | 16GB | RTX 4090 (24GB) |
| Llama 3.1/3.2/3.3 8B | 8B | 16GB | RTX 4090 (24GB) |
| Mistral 7B | 7B | 14GB | RTX 4090 (24GB) |
| Gemma 3 12B | 12B | 24GB | RTX 4090 (24GB) |
| Phi-4 14B | 14B | 24GB | RTX 4090 (24GB) |
| Qwen3-32B | 32B | 64GB | A100 80GB |
| DeepSeek-R1-Distill-32B (INT4) | 32B | 20GB | A100 40GB |
| Llama 3.1 70B | 70B | 140GB | 2× A100 80GB |
| Mixtral 8x7B | ~47B | 90GB | 2× A100 80GB |
| Llama 4 Scout 109B (INT4) | 17B active | 40GB | H100 80GB |
| DeepSeek-R1 671B (FP8) | 671B | 8× H100 | 8× H100 80GB |
Available guides
DeepSeek R1 & V3
DeepSeek's reasoning models, from 7B distillations on RTX 4090 to the full 671B FP8 deployment on 8× H100. Features chain-of-thought reasoning exposed via <think> blocks.
Best for: Complex reasoning, math, and code generation tasks.
Llama 4 Scout & Maverick
Meta's latest multimodal MoE models (Scout at 109B total/17B active, Maverick at 400B total/17B active) with long-context and native image understanding.
Best for: State-of-the-art multimodal reasoning and large context windows.
Llama 3.1 / 3.2 / 3.3
Meta Llama 3 family: 8B on a single RTX 4090, 70B on 2× A100, 405B on 8× H100. Strong instruction following and function calling.
Best for: General-purpose chat, instruction following, and function calling.
Qwen3 Dense & MoE
Alibaba Qwen3 text models with toggleable chain-of-thought reasoning. Dense models from 7B to 32B; MoE models up to 235B-A22B.
Best for: Reasoning tasks with controllable chain-of-thought via /think and /no_think tokens.
Mistral & Mixtral
Mistral 7B, Mixtral 8x7B MoE, and Mistral Small 3.1 24B with efficient inference and function calling support.
Best for: Efficient inference with strong instruction following and function calling.
Gemma 3
Google DeepMind Gemma 3 in 4B, 12B, and 27B (INT4 option). Available under the Gemma Terms of Use; commercial use is permitted after accepting Google's license agreement.
Best for: Low-latency inference on smaller GPUs; research and commercial projects.
Phi-4 & Phi-4 Multimodal
Microsoft Phi-4 14B small language model and Phi-4-multimodal with image input support. MIT license.
Best for: Efficient SLM inference; multimodal tasks on a single RTX 4090.
What's next
- Inference Frameworks: Choose the right serving stack (vLLM, SGLang, llama.cpp, etc.)
- Multimodal Models: Vision-language model guides
- Specialized Models: OCR, audio generation, and code intelligence
- Instance Types: GPU selection for text models
- Cost Optimization: Reducing inference costs with Spot instances