Qwen3-Omni-30B-A3B
Deploy Qwen3-Omni-30B-A3B on a Spheron A100 or H100 instance. This multimodal language model processes text, audio, images, and video with a 32K context window (single GPU). It differs from Qwen3-VL, which handles vision and language only.
Key capabilities
- Multimodal inputs: Text, audio, images, video
- Audio understanding: Speech recognition, audio analysis
- Vision-language: Image understanding and generation
- Context window: 32K tokens (single GPU), up to 65K (multi-GPU)
- Multilingual: 119+ languages and dialects
Use cases: Audio transcription, multimodal chat, content analysis, accessibility tools
Requirements
Hardware:- GPU: A100 or H100 (30B model needs significant VRAM)
- VRAM: 24GB+ minimum, 40GB+ recommended
- RAM: 32GB+
- Storage: 60GB (SSD recommended)
- Ubuntu 22.04 LTS
- CUDA 12.1+
- Python 3.11
- Conda/Miniconda
Deploy on Spheron
- Sign up at app.spheron.ai
- Add credits (card/crypto)
- Deploy → A100 or H100 → Region → Ubuntu 22.04 → SSH key → Deploy
ssh -i <private-key-path> root@<your-vm-ip>New to Spheron? See Getting Started and SSH Setup.
Installation
Install Miniconda
curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/miniconda3
$HOME/miniconda3/bin/conda init bash
source ~/.bashrcCreate environment
conda create -n qwen python=3.11 -y && conda activate qwenAccept ToS if prompted:
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/rInstall PyTorch (CUDA 12.1)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121Install dependencies
pip install git+https://github.com/huggingface/transformers
pip install git+https://github.com/huggingface/accelerate
pip install huggingface_hub
pip install einops timm pillow sentencepiece protobuf decord numpy requests
pip install bitsandbytes
pip install qwen-omni-utils -UCreate test.py
Create the inference script:
from transformers import Qwen3OmniMoeForConditionalGeneration, Qwen3OmniMoeProcessor
# Load the model on available devices
model = Qwen3OmniMoeForConditionalGeneration.from_pretrained(
"Qwen/Qwen3-Omni-30B-A3B-Instruct",
dtype="auto",
device_map="auto"
)
# Optional: Enable flash_attention_2 for better performance and memory efficiency,
# especially in multi-image, video, or audio tasks.
# model = Qwen3OmniMoeForConditionalGeneration.from_pretrained(
# "Qwen/Qwen3-Omni-30B-A3B-Instruct",
# dtype=torch.bfloat16,
# attn_implementation="flash_attention_2",
# device_map="auto",
# )
# Load the processor
processor = Qwen3OmniMoeProcessor.from_pretrained("Qwen/Qwen3-Omni-30B-A3B-Instruct")
# Define input messages (image + text prompt)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
# Prepare inputs for inference
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
).to(model.device)
# Generate model output
generated_ids = model.generate(**inputs, max_new_tokens=128)
# Extract generated tokens (excluding prompt tokens)
generated_ids_trimmed = [
output[len(input_ids):] for input_ids, output in zip(inputs.input_ids, generated_ids)
]
# Decode output text
output_text = processor.batch_decode(
generated_ids_trimmed,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)
print(output_text)Run script
conda activate qwen
python3 test.pyConfiguration
Model variants:- For a smaller omni model from the previous generation, use
Qwen/Qwen2.5-Omni-7B.
dtype=torch.float16ortorch.bfloat16(A100/H100)
- Add
attn_implementation="flash_attention_2"if supported.
device_map="auto"(default, recommended)device_map={"":0}(single GPU)
- Use
load_image_from_path()instead of a URL.
Troubleshooting
Issue: Out of memory (OOM)
Symptoms: CUDA OOM error during model load or inference.
Resolution: Reduce max_new_tokens, switch to dtype=torch.float16, or enable bitsandbytes quantization.
Issue: Slow model loading
Symptoms: Model takes several minutes to load.
Resolution: Cache models locally, use NVMe storage, and enable use_safetensors=True.
Issue: CUDA errors
Symptoms: CUDA version mismatch errors.
Resolution: Verify that your PyTorch and CUDA versions match. Run nvidia-smi to check the installed CUDA version.
What's next
- Qwen3 Models on HuggingFace
- Multimodal Models: Other vision-language model guides
- Getting Started: Spheron deployment basics
- Instance Types: GPU selection for large multimodal models