Skip to content

Multimodal Models

Guides for deploying vision-language models (VLMs) on Spheron GPU instances. All models accept both text and image inputs and are served via vLLM's OpenAI-compatible multimodal API.

VRAM requirements

ModelParametersMin VRAMRecommended GPU
Qwen3-VL 4B4B10GBRTX 4090 (24GB)
Qwen3-VL 8B8B18GBRTX 4090 (24GB)
LLaVA-NeXT 7B7B14GBRTX 4090 (24GB)
InternVL3-8B8B18GBRTX 4090 (24GB)
Pixtral-12B12B24GBRTX 4090 (24GB)
LLaVA-NeXT 13B13B28GBA100 40GB
Baidu ERNIE-4.5-VL-28B-A3B28B active48GBA100 80GB
Qwen3-Omni-30B-A3B30B active40GBA100 80GB
InternVL3-78B78B80GB × 22× A100 80GB

Available guides

Qwen3-Omni-30B-A3B

Multimodal language model with 30B parameters supporting text, audio, images, and video inputs. 32K context window (single GPU).

Best for: Multimodal tasks requiring audio, vision, and text processing in a single model.

Qwen3-VL 4B & 8B

Vision-language models in 4B and 8B variants. 256K context, multimodal reasoning, and GUI automation.

Best for: Image understanding, visual reasoning, and GUI automation tasks.

InternVL3

InternVL3 series (1B–78B) deployed via vLLM. Strong visual question answering and multimodal reasoning.

Best for: High-accuracy visual QA and multimodal reasoning across model scales.

LLaVA-Next

LLaVA-NeXT 7B and 13B with improved visual reasoning, served via vLLM.

Best for: Accessible image-to-text inference on RTX 4090 or A100.

Pixtral-12B

Mistral's Pixtral-12B multimodal model on RTX 4090 (24GB) via vLLM.

Best for: Compact multimodal inference with Mistral-quality text generation.

Baidu ERNIE-4.5-VL-28B-A3B

28B active parameter MoE vision-language model from Baidu. Strong STEM and visual reasoning performance.

Best for: Visual reasoning, multimodal understanding, and STEM-domain tasks.

What's next