InternVL3
Deploy InternVL3 on Spheron GPU instances using vLLM. InternVL3 is a vision-language model series from 1B to 78B parameters with strong performance on visual question answering and multimodal reasoning benchmarks.
Recommended hardware
| Model | Recommended GPU | Instance Type | Notes |
|---|---|---|---|
| InternVL3-8B | RTX 4090 (24GB) | Dedicated or Spot | Single-GPU |
| InternVL3-14B | A100 40GB | Dedicated | Full precision |
| InternVL3-38B | A100 80GB | Dedicated | Single-GPU |
| InternVL3-78B | 2× A100 80GB | Dedicated / Cluster | --tensor-parallel-size 2 |
Manual setup
Use these steps to set up the server manually after SSH-ing into your instance. This works on any provider regardless of cloud-init support.
Step 1: Connect to your instance
ssh <user>@<ipAddress>Replace <user> with the username shown in the instance details panel (e.g., ubuntu for Spheron AI instances) and <ipAddress> with your instance's public IP.
Step 2: Install vLLM
sudo apt-get update -y
sudo apt-get install -y python3-pip
pip install vllmStep 3: Start the server
Run the server in the foreground to verify it works:
python3 -m vllm.entrypoints.openai.api_server \
--model OpenGVLab/InternVL3-8B \
--port 8000 \
--dtype auto \
--trust-remote-codePress Ctrl+C to stop. For InternVL3-78B on 2× A100, add --tensor-parallel-size 2.
Step 4: Run as a background service
To keep the server running after you close your SSH session, create a systemd service:
sudo tee /etc/systemd/system/vllm-internvl.service > /dev/null << 'EOF'
[Unit]
Description=InternVL3 vLLM Inference Server
After=network.target
[Service]
Type=simple
ExecStart=/usr/bin/python3 -m vllm.entrypoints.openai.api_server \
--model OpenGVLab/InternVL3-8B \
--port 8000 \
--dtype auto \
--trust-remote-code
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable vllm-internvl
sudo systemctl start vllm-internvlAccessing the server
SSH tunnel
ssh -L 8000:localhost:8000 <user>@<ipAddress>Usage example: image input
import base64
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
with open("image.jpg", "rb") as f:
image_b64 = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="OpenGVLab/InternVL3-8B",
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}},
{"type": "text", "text": "Describe what you see in this image."},
],
}
],
)
print(response.choices[0].message.content)Check server logs
journalctl -u vllm-internvl -fCloud-init startup script (optional)
If your provider supports cloud-init, you can paste this into the Startup Script field when deploying to automate the setup above.
#cloud-config
runcmd:
- apt-get update -y
- apt-get install -y python3-pip
- pip install vllm
- |
cat > /etc/systemd/system/vllm-internvl.service << 'EOF'
[Unit]
Description=InternVL3 vLLM Inference Server
After=network.target
[Service]
Type=simple
ExecStart=/usr/bin/python3 -m vllm.entrypoints.openai.api_server \
--model OpenGVLab/InternVL3-8B \
--port 8000 \
--dtype auto \
--trust-remote-code
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
- systemctl daemon-reload
- systemctl enable vllm-internvl
- systemctl start vllm-internvlFor InternVL3-78B on 2× A100, add --tensor-parallel-size 2 to the ExecStart command.
What's next
- Multimodal Models: Other vision-language model guides
- vLLM Inference Server: vLLM configuration details
- Networking: SSH tunneling and port access
- Instance Types: Multi-GPU setup for large VLMs