Baidu ERNIE-4.5-VL-28B-A3B-Thinking
Deploy Baidu ERNIE-4.5-VL-28B-A3B-Thinking on a Spheron GPU instance. This multimodal reasoning model uses a Mixture-of-Experts architecture with 28B total parameters and 3B active per token. It supports visual reasoning, STEM problem solving, chart analysis, and video understanding under the Apache 2.0 license.
Overview
ERNIE-4.5-VL-28B-A3B-Thinking includes a "Thinking" mode for multi-step chain-of-thought reasoning over visual inputs. 28B parameters with MoE design (3B active per token). Performance is competitive with GPT-4o and Gemini 2.5 Pro on visual reasoning, STEM, charts, and video understanding tasks.
Released: November 11, 2025 by Baidu Architecture: ERNIE-4.5-VL-28B-A3B + reasoning fine-tuning (GSPO, IcePop) Training: Visual-language reasoning datasets with multimodal RL
Key capabilities
- Visual reasoning: Multi-step reasoning, chart analysis, causal relationships
- STEM reasoning: Math, science, engineering from images
- Visual grounding: Object localization, industrial QC/automation
- Dynamic detail focus: Zooms into regions, chain-of-thought over visuals
- Tool calling: Image search, cropping, web lookup integration
- Video understanding: Temporal awareness, event localization, frame tracking
Use cases: Multimodal agents, document automation, visual search, education, video analysis
Requirements
Hardware:- GPU: A100 80GB (recommended), RTX A6000 48GB (minimum for single-card), or 2× RTX 4090 for tensor parallelism
- RAM: 32GB+
- Storage: 60GB free
- VRAM: 48GB+ per card for bfloat16 (80GB recommended); 20GB+ with 4-bit quantization
- Ubuntu 22.04 LTS
- CUDA 12.1+
- Python 3.11
- Conda/Miniconda
Deploy on Spheron
- Sign up at app.spheron.ai
- Add credits (card/crypto)
- Deploy → Select A100 80GB (or 2× RTX 4090 for multi-GPU) → Region → Ubuntu 22.04 → SSH key → Deploy
ssh -i <private-key-path> root@<your-vm-ip>New to Spheron? See Getting Started and SSH Setup.
Installation
Install Miniconda
curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/miniconda3
$HOME/miniconda3/bin/conda init bash
source ~/.bashrcCreate Python environment
conda create -n ernie python=3.11 -y && conda activate ernieInstall dependencies
pip install torch torchvision torchaudio einops timm pillow
pip install git+https://github.com/huggingface/transformers
pip install git+https://github.com/huggingface/accelerate
pip install git+https://github.com/huggingface/diffusers
pip install huggingface_hub
pip install sentencepiece bitsandbytes protobuf decord numpyInstall Jupyter
conda install -c conda-forge --override-channels notebook -y
conda install -c conda-forge --override-channels ipywidgets -y
jupyter notebook --allow-rootAccess Jupyter from your local machine
SSH port forwarding from your local machine:
ssh -L 8888:localhost:8888 -p <YOUR_SERVER_PORT> -i <PATH_TO_SSH_KEY> root@<YOUR_SERVER_IP>Copy the Jupyter URL from the server terminal to your browser.
Run model
Load model
Open a notebook and run:
import torch
from transformers import AutoProcessor, AutoTokenizer, AutoModelForCausalLM
model_path = 'baidu/ERNIE-4.5-VL-28B-A3B-Thinking'
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="auto",
dtype=torch.bfloat16,
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model.add_image_preprocess(processor)Run inference
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in the image and what is the color of the dog"
},
{
"type": "image_url",
"image_url": {
"url": "https://images.pexels.com/photos/58997/pexels-photo-58997.jpeg"
}
},
]
},
]
text = processor.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
image_inputs, video_inputs = processor.process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
device = next(model.parameters()).device
inputs = inputs.to(device)
generated_ids = model.generate(
inputs=inputs['input_ids'].to(device),
**inputs,
max_new_tokens=1024,
use_cache=False
)
output_text = processor.decode(generated_ids[0][len(inputs['input_ids'][0]):])
print(output_text)What's next
- Model on HuggingFace
- Multimodal Models: Other vision-language model guides
- Getting Started: Spheron deployment basics
- vLLM Inference Server: Serve multimodal models via an OpenAI-compatible API