Llama 4 Scout & Maverick

Deploy Meta Llama 4 Scout and Maverick on Spheron GPU instances using vLLM. Llama 4 introduces a Mixture-of-Experts (MoE) architecture with native multimodal support for text and images.

Recommended hardware

Model	Parameters	Recommended GPU	Instance Type	Notes
Llama 4 Scout	109B (17B active)	H100 80GB (FP8)	Dedicated	MoE, 16 experts
Llama 4 Maverick	400B (17B active)	8× H200 141 GB	Cluster	Requires multi-GPU

Manual setup

Use these steps to set up the server manually after SSH-ing into your instance. This works on any provider regardless of cloud-init support.

Step 1: Connect to your instance

ssh <user>@<ipAddress>

Replace <user> with the username shown in the instance details panel (e.g., ubuntu for Spheron AI instances) and <ipAddress> with your instance's public IP.

Step 2: Install vLLM

sudo apt-get update -y
sudo apt-get install -y python3-pip
pip install vllm

Step 3: Start the server

Run the server in the foreground to verify it works:

HF_TOKEN=<your-hf-token> python3 -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --port 8000 \
  --dtype fp8 \
  --gpu-memory-utilization 0.95

Press Ctrl+C to stop. Replace <your-hf-token> with your HuggingFace token.

Step 4: Run as a background service

To keep the server running after you close your SSH session, create a restricted token file and a systemd service:

# Store the token in a file readable only by root
sudo mkdir -p /etc/vllm
sudo install -m 600 /dev/null /etc/vllm/hf-token
echo "HF_TOKEN=<your-hf-token>" | sudo tee -a /etc/vllm/hf-token > /dev/null
 
sudo tee /etc/systemd/system/vllm-llama4.service > /dev/null << 'EOF'
[Unit]
Description=Llama 4 Scout vLLM Inference Server
After=network.target
 
[Service]
Type=simple
EnvironmentFile=/etc/vllm/hf-token
ExecStart=/usr/bin/python3 -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --port 8000 \
  --dtype fp8 \
  --gpu-memory-utilization 0.95
Restart=on-failure
RestartSec=10
 
[Install]
WantedBy=multi-user.target
EOF
 
sudo systemctl daemon-reload
sudo systemctl enable vllm-llama4
sudo systemctl start vllm-llama4

Replace <your-hf-token> with your HuggingFace token. Using EnvironmentFile= with chmod 600 prevents other local users from reading the token via systemctl show.

Accessing the server

SSH tunnel

ssh -L 8000:localhost:8000 <user>@<ipAddress>

Usage example: multimodal image input

import base64
from openai import OpenAI
 
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
 
with open("image.jpg", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()
 
response = client.chat.completions.create(
    model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}},
                {"type": "text", "text": "What is in this image?"},
            ],
        }
    ],
)
print(response.choices[0].message.content)

Cloud-init startup script (optional)

If your provider supports cloud-init, you can paste this into the Startup Script field when deploying to automate the setup above.

Llama 4 Scout (H100, FP8)

#cloud-config
write_files:
  - path: /etc/systemd/system/vllm-llama4.service
    content: |
      [Unit]
      Description=Llama 4 Scout vLLM Inference Server
      After=network.target
 
      [Service]
      Type=simple
      EnvironmentFile=/etc/vllm/hf-token
      ExecStart=/usr/bin/python3 -m vllm.entrypoints.openai.api_server \
        --model meta-llama/Llama-4-Scout-17B-16E-Instruct \
        --port 8000 \
        --dtype fp8 \
        --gpu-memory-utilization 0.95
      Restart=on-failure
      RestartSec=10
 
      [Install]
      WantedBy=multi-user.target
runcmd:
  - apt-get update -y
  - apt-get install -y python3-pip
  - pip install vllm
  - mkdir -p /etc/vllm
  - install -m 600 /dev/null /etc/vllm/hf-token
  - echo "HF_TOKEN=<your-hf-token>" >> /etc/vllm/hf-token
  - systemctl daemon-reload
  - systemctl enable vllm-llama4
  - systemctl start vllm-llama4

Replace <your-hf-token> with your HuggingFace token.

What's next

Llama 3.1 / 3.2 / 3.3: Previous Llama generation guides
vLLM Inference Server: vLLM configuration details
Instance Types: H100/H200 cluster requirements
Multimodal Models: Other vision-language model guides