Baidu ERNIE-4.5-VL-28B-A3B-Thinking

Deploy Baidu ERNIE-4.5-VL-28B-A3B-Thinking on a Spheron GPU instance. This multimodal reasoning model uses a Mixture-of-Experts architecture with 28B total parameters and 3B active per token. It supports visual reasoning, STEM problem solving, chart analysis, and video understanding under the Apache 2.0 license.

Overview

ERNIE-4.5-VL-28B-A3B-Thinking includes a "Thinking" mode for multi-step chain-of-thought reasoning over visual inputs. 28B parameters with MoE design (3B active per token). Performance is competitive with GPT-4o and Gemini 2.5 Pro on visual reasoning, STEM, charts, and video understanding tasks.

Released: November 11, 2025 by Baidu Architecture: ERNIE-4.5-VL-28B-A3B + reasoning fine-tuning (GSPO, IcePop) Training: Visual-language reasoning datasets with multimodal RL

Key capabilities

Visual reasoning: Multi-step reasoning, chart analysis, causal relationships
STEM reasoning: Math, science, engineering from images
Visual grounding: Object localization, industrial QC/automation
Dynamic detail focus: Zooms into regions, chain-of-thought over visuals
Tool calling: Image search, cropping, web lookup integration
Video understanding: Temporal awareness, event localization, frame tracking

Use cases: Multimodal agents, document automation, visual search, education, video analysis

Requirements

Hardware:

GPU: A100 80GB (recommended), RTX A6000 48GB (minimum for single-card), or 2× RTX 4090 for tensor parallelism
RAM: 32GB+
Storage: 60GB free
VRAM: 48GB+ per card for bfloat16 (80GB recommended); 20GB+ with 4-bit quantization

Software:

Ubuntu 22.04 LTS
CUDA 12.1+
Python 3.11
Conda/Miniconda

Deploy on Spheron

Sign up at app.spheron.ai
Add credits (card/stables)
Deploy → Select A100 80GB (or 2× RTX 4090 for multi-GPU) → Region → Ubuntu 22.04 → SSH key → Deploy

Connect:

ssh -i <private-key-path> root@<your-vm-ip>

New to Spheron? See Getting Started and SSH Setup.

Installation

Install Miniconda

curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/miniconda3
$HOME/miniconda3/bin/conda init bash
source ~/.bashrc

Create Python environment

conda create -n ernie python=3.11 -y && conda activate ernie

Install dependencies

pip install torch torchvision torchaudio einops timm pillow
pip install git+https://github.com/huggingface/transformers
pip install git+https://github.com/huggingface/accelerate
pip install git+https://github.com/huggingface/diffusers
pip install huggingface_hub
pip install sentencepiece bitsandbytes protobuf decord numpy

Install Jupyter

conda install -c conda-forge --override-channels notebook -y
conda install -c conda-forge --override-channels ipywidgets -y
jupyter notebook --allow-root

Access Jupyter from your local machine

SSH port forwarding from your local machine:

ssh -L 8888:localhost:8888 -p <YOUR_SERVER_PORT> -i <PATH_TO_SSH_KEY> root@<YOUR_SERVER_IP>

Copy the Jupyter URL from the server terminal to your browser.

Run model

Load model

Open a notebook and run:

import torch
from transformers import AutoProcessor, AutoTokenizer, AutoModelForCausalLM
 
model_path = 'baidu/ERNIE-4.5-VL-28B-A3B-Thinking'
 
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",
    dtype=torch.bfloat16,
    trust_remote_code=True
)
 
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model.add_image_preprocess(processor)

Run inference

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What is in the image and what is the color of the dog"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://images.pexels.com/photos/58997/pexels-photo-58997.jpeg"
                }
            },
        ]
    },
]
 
text = processor.tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
 
image_inputs, video_inputs = processor.process_vision_info(messages)
 
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
 
device = next(model.parameters()).device
inputs = inputs.to(device)
 
generated_ids = model.generate(
    inputs=inputs['input_ids'].to(device),
    **inputs,
    max_new_tokens=1024,
    use_cache=False
)
 
output_text = processor.decode(generated_ids[0][len(inputs['input_ids'][0]):])
print(output_text)

What's next

Model on HuggingFace
Multimodal Models: Other vision-language model guides
Getting Started: Spheron deployment basics
vLLM Inference Server: Serve multimodal models via an OpenAI-compatible API