LLaVA-Next

Deploy LLaVA-NeXT on Spheron GPU instances using vLLM. LLaVA-NeXT (Large Language and Vision Assistant Next) improves on LLaVA with better visual reasoning, higher image resolution support, and improved OCR capabilities.

Recommended hardware

Model	Recommended GPU	Instance Type	Notes
LLaVA-NeXT 7B	RTX 4090 (24GB)	Dedicated or Spot	`llava-v1.6-mistral-7b-hf`
LLaVA-NeXT 13B	A100 40GB	Dedicated	`llava-v1.6-vicuna-13b-hf`

Manual setup

Use these steps to set up the server manually after SSH-ing into your instance. This works on any provider regardless of cloud-init support.

Step 1: Connect to your instance

ssh <user>@<ipAddress>

Replace <user> with the username shown in the instance details panel (e.g., ubuntu for Spheron AI instances) and <ipAddress> with your instance's public IP.

Step 2: Install vLLM

sudo apt-get update -y
sudo apt-get install -y python3-pip
pip install vllm

Step 3: Start the server

Run the server in the foreground to verify it works:

python3 -m vllm.entrypoints.openai.api_server \
  --model llava-hf/llava-v1.6-mistral-7b-hf \
  --port 8000 \
  --dtype auto

Press Ctrl+C to stop.

Step 4: Run as a background service

To keep the server running after you close your SSH session, create a systemd service:

sudo tee /etc/systemd/system/vllm-llava.service > /dev/null << 'EOF'
[Unit]
Description=LLaVA-Next vLLM Inference Server
After=network.target
 
[Service]
Type=simple
ExecStart=/usr/bin/python3 -m vllm.entrypoints.openai.api_server \
  --model llava-hf/llava-v1.6-mistral-7b-hf \
  --port 8000 \
  --dtype auto
Restart=on-failure
RestartSec=10
 
[Install]
WantedBy=multi-user.target
EOF
 
sudo systemctl daemon-reload
sudo systemctl enable vllm-llava
sudo systemctl start vllm-llava

Accessing the server

SSH tunnel

ssh -L 8000:localhost:8000 <user>@<ipAddress>

Usage example: image input

import base64
from openai import OpenAI
 
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
 
with open("image.jpg", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()
 
response = client.chat.completions.create(
    model="llava-hf/llava-v1.6-mistral-7b-hf",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}},
                {"type": "text", "text": "What is shown in this image?"},
            ],
        }
    ],
)
print(response.choices[0].message.content)

Check server logs

journalctl -u vllm-llava -f

Cloud-init startup script (optional)

If your provider supports cloud-init, you can paste this into the Startup Script field when deploying to automate the setup above.

#cloud-config
runcmd:
  - apt-get update -y
  - apt-get install -y python3-pip
  - pip install vllm
  - |
    cat > /etc/systemd/system/vllm-llava.service << 'EOF'
    [Unit]
    Description=LLaVA-Next vLLM Inference Server
    After=network.target
 
    [Service]
    Type=simple
    ExecStart=/usr/bin/python3 -m vllm.entrypoints.openai.api_server \
      --model llava-hf/llava-v1.6-mistral-7b-hf \
      --port 8000 \
      --dtype auto
    Restart=on-failure
    RestartSec=10
 
    [Install]
    WantedBy=multi-user.target
    EOF
  - systemctl daemon-reload
  - systemctl enable vllm-llava
  - systemctl start vllm-llava

What's next

Multimodal Models: Other vision-language model guides
InternVL3: Higher-accuracy VLM alternative
vLLM Inference Server: vLLM configuration details
Networking: SSH tunneling and port access