Skip to content

LLaVA-Next

Deploy LLaVA-NeXT on Spheron GPU instances using vLLM. LLaVA-NeXT (Large Language and Vision Assistant Next) improves on LLaVA with better visual reasoning, higher image resolution support, and improved OCR capabilities.

Recommended hardware

ModelRecommended GPUInstance TypeNotes
LLaVA-NeXT 7BRTX 4090 (24GB)Dedicated or Spotllava-v1.6-mistral-7b-hf
LLaVA-NeXT 13BA100 40GBDedicatedllava-v1.6-vicuna-13b-hf

Manual setup

Use these steps to set up the server manually after SSH-ing into your instance. This works on any provider regardless of cloud-init support.

Step 1: Connect to your instance

ssh <user>@<ipAddress>

Replace <user> with the username shown in the instance details panel (e.g., ubuntu for Spheron AI instances) and <ipAddress> with your instance's public IP.

Step 2: Install vLLM

sudo apt-get update -y
sudo apt-get install -y python3-pip
pip install vllm

Step 3: Start the server

Run the server in the foreground to verify it works:

python3 -m vllm.entrypoints.openai.api_server \
  --model llava-hf/llava-v1.6-mistral-7b-hf \
  --port 8000 \
  --dtype auto

Press Ctrl+C to stop.

Step 4: Run as a background service

To keep the server running after you close your SSH session, create a systemd service:

sudo tee /etc/systemd/system/vllm-llava.service > /dev/null << 'EOF'
[Unit]
Description=LLaVA-Next vLLM Inference Server
After=network.target
 
[Service]
Type=simple
ExecStart=/usr/bin/python3 -m vllm.entrypoints.openai.api_server \
  --model llava-hf/llava-v1.6-mistral-7b-hf \
  --port 8000 \
  --dtype auto
Restart=on-failure
RestartSec=10
 
[Install]
WantedBy=multi-user.target
EOF
 
sudo systemctl daemon-reload
sudo systemctl enable vllm-llava
sudo systemctl start vllm-llava

Accessing the server

SSH tunnel

ssh -L 8000:localhost:8000 <user>@<ipAddress>

Usage example: image input

import base64
from openai import OpenAI
 
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
 
with open("image.jpg", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()
 
response = client.chat.completions.create(
    model="llava-hf/llava-v1.6-mistral-7b-hf",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}},
                {"type": "text", "text": "What is shown in this image?"},
            ],
        }
    ],
)
print(response.choices[0].message.content)

Check server logs

journalctl -u vllm-llava -f

Cloud-init startup script (optional)

If your provider supports cloud-init, you can paste this into the Startup Script field when deploying to automate the setup above.

#cloud-config
runcmd:
  - apt-get update -y
  - apt-get install -y python3-pip
  - pip install vllm
  - |
    cat > /etc/systemd/system/vllm-llava.service << 'EOF'
    [Unit]
    Description=LLaVA-Next vLLM Inference Server
    After=network.target
 
    [Service]
    Type=simple
    ExecStart=/usr/bin/python3 -m vllm.entrypoints.openai.api_server \
      --model llava-hf/llava-v1.6-mistral-7b-hf \
      --port 8000 \
      --dtype auto
    Restart=on-failure
    RestartSec=10
 
    [Install]
    WantedBy=multi-user.target
    EOF
  - systemctl daemon-reload
  - systemctl enable vllm-llava
  - systemctl start vllm-llava

What's next