Multimodal Models

Guides for deploying vision-language models (VLMs) on Spheron GPU instances. All models accept both text and image inputs and are served via vLLM's OpenAI-compatible multimodal API.

VRAM requirements

Model	Parameters	Min VRAM	Recommended GPU
Qwen3-VL 4B	4B	10GB	RTX 4090 (24GB)
Qwen3-VL 8B	8B	18GB	RTX 4090 (24GB)
LLaVA-NeXT 7B	7B	14GB	RTX 4090 (24GB)
InternVL3-8B	8B	18GB	RTX 4090 (24GB)
Pixtral-12B	12B	24GB	RTX 4090 (24GB)
LLaVA-NeXT 13B	13B	28GB	A100 40GB
Baidu ERNIE-4.5-VL-28B-A3B	28B active	48GB	A100 80GB
Qwen3-Omni-30B-A3B	30B active	40GB	A100 80GB
InternVL3-78B	78B	80GB × 2	2× A100 80GB

Available guides

Qwen3-Omni-30B-A3B

Multimodal language model with 30B parameters supporting text, audio, images, and video inputs. 32K context window (single GPU).

Best for: Multimodal tasks requiring audio, vision, and text processing in a single model.

Qwen3-VL 4B & 8B

Vision-language models in 4B and 8B variants. 256K context, multimodal reasoning, and GUI automation.

Best for: Image understanding, visual reasoning, and GUI automation tasks.

InternVL3

InternVL3 series (1B–78B) deployed via vLLM. Strong visual question answering and multimodal reasoning.

Best for: High-accuracy visual QA and multimodal reasoning across model scales.

LLaVA-Next

LLaVA-NeXT 7B and 13B with improved visual reasoning, served via vLLM.

Best for: Accessible image-to-text inference on RTX 4090 or A100.

Pixtral-12B

Mistral's Pixtral-12B multimodal model on RTX 4090 (24GB) via vLLM.

Best for: Compact multimodal inference with Mistral-quality text generation.

Baidu ERNIE-4.5-VL-28B-A3B

28B active parameter MoE vision-language model from Baidu. Strong STEM and visual reasoning performance.

Best for: Visual reasoning, multimodal understanding, and STEM-domain tasks.

What's next

LLM Inference Overview: Text model and framework guides
vLLM Inference Server: Serving stack used by most guides here
Instance Types: GPU selection for VLMs
Cost Optimization: Reducing inference costs with Spot instances