Multimodal Models
Guides for deploying vision-language models (VLMs) on Spheron GPU instances. All models accept both text and image inputs and are served via vLLM's OpenAI-compatible multimodal API.
VRAM requirements
| Model | Parameters | Min VRAM | Recommended GPU |
|---|---|---|---|
| Qwen3-VL 4B | 4B | 10GB | RTX 4090 (24GB) |
| Qwen3-VL 8B | 8B | 18GB | RTX 4090 (24GB) |
| LLaVA-NeXT 7B | 7B | 14GB | RTX 4090 (24GB) |
| InternVL3-8B | 8B | 18GB | RTX 4090 (24GB) |
| Pixtral-12B | 12B | 24GB | RTX 4090 (24GB) |
| LLaVA-NeXT 13B | 13B | 28GB | A100 40GB |
| Baidu ERNIE-4.5-VL-28B-A3B | 28B active | 48GB | A100 80GB |
| Qwen3-Omni-30B-A3B | 30B active | 40GB | A100 80GB |
| InternVL3-78B | 78B | 80GB × 2 | 2× A100 80GB |
Available guides
Qwen3-Omni-30B-A3B
Multimodal language model with 30B parameters supporting text, audio, images, and video inputs. 32K context window (single GPU).
Best for: Multimodal tasks requiring audio, vision, and text processing in a single model.
Qwen3-VL 4B & 8B
Vision-language models in 4B and 8B variants. 256K context, multimodal reasoning, and GUI automation.
Best for: Image understanding, visual reasoning, and GUI automation tasks.
InternVL3
InternVL3 series (1B–78B) deployed via vLLM. Strong visual question answering and multimodal reasoning.
Best for: High-accuracy visual QA and multimodal reasoning across model scales.
LLaVA-Next
LLaVA-NeXT 7B and 13B with improved visual reasoning, served via vLLM.
Best for: Accessible image-to-text inference on RTX 4090 or A100.
Pixtral-12B
Mistral's Pixtral-12B multimodal model on RTX 4090 (24GB) via vLLM.
Best for: Compact multimodal inference with Mistral-quality text generation.
Baidu ERNIE-4.5-VL-28B-A3B
28B active parameter MoE vision-language model from Baidu. Strong STEM and visual reasoning performance.
Best for: Visual reasoning, multimodal understanding, and STEM-domain tasks.
What's next
- LLM Inference Overview: Text model and framework guides
- vLLM Inference Server: Serving stack used by most guides here
- Instance Types: GPU selection for VLMs
- Cost Optimization: Reducing inference costs with Spot instances