Llama 3.1 / 3.2 / 3.3

Deploy Meta's Llama 3 family on Spheron GPU instances using vLLM. The Llama 3 series covers 8B through 405B parameters with strong instruction following and function calling capabilities.

Recommended hardware

Model	Recommended GPU	Instance Type	Notes
Llama 3.1/3.2/3.3 8B	RTX 4090 (24GB)	Dedicated or Spot	Single-GPU
Llama 3.1 70B	2× A100 80GB	Dedicated	`--tensor-parallel-size 2`
Llama 3.1 405B	8× H100 80GB	Cluster	`--tensor-parallel-size 8 --dtype fp8`

Manual setup

Use these steps to set up the server manually after SSH-ing into your instance. This works on any provider regardless of cloud-init support.

Step 1: Connect to your instance

ssh <user>@<ipAddress>

Replace <user> with the username shown in the instance details panel (e.g., ubuntu for Spheron AI instances) and <ipAddress> with your instance's public IP.

Step 2: Install vLLM

sudo apt-get update -y
sudo apt-get install -y python3-pip
pip install vllm

Step 3: Start the server

Run the server in the foreground to verify it works:

HF_TOKEN=<your-hf-token> python3 -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --port 8000 \
  --dtype bfloat16

Press Ctrl+C to stop. Replace <your-hf-token> with your HuggingFace token.

Step 4: Run as a background service

To keep the server running after you close your SSH session, create a restricted token file and a systemd service:

# Store the token in a file readable only by root
sudo mkdir -p /etc/vllm
sudo install -m 600 /dev/null /etc/vllm/hf-token
echo "HF_TOKEN=<your-hf-token>" | sudo tee -a /etc/vllm/hf-token > /dev/null
 
sudo tee /etc/systemd/system/vllm-llama3.service > /dev/null << 'EOF'
[Unit]
Description=Llama 3 vLLM Inference Server
After=network.target
 
[Service]
Type=simple
EnvironmentFile=/etc/vllm/hf-token
ExecStart=/usr/bin/python3 -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --port 8000 \
  --dtype bfloat16
Restart=on-failure
RestartSec=10
 
[Install]
WantedBy=multi-user.target
EOF
 
sudo systemctl daemon-reload
sudo systemctl enable vllm-llama3
sudo systemctl start vllm-llama3

Replace <your-hf-token> with your HuggingFace token. Using EnvironmentFile= with chmod 600 prevents other local users from reading the token via systemctl show.

Llama 3.1 70B (2× A100 80GB)

For the 70B model, replace the ExecStart command with:

/usr/bin/python3 -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-70B-Instruct \
  --port 8000 \
  --dtype bfloat16 \
  --tensor-parallel-size 2

Llama 3.1 405B (8× H100)

The 405B model in BF16 requires more than 640 GB VRAM and cannot run on a single 8× H100 node. Use the official FP8 quantized variant, which fits within the 640 GB total VRAM available across 8× H100 80GB.

/usr/bin/python3 -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 \
  --port 8000 \
  --dtype fp8 \
  --tensor-parallel-size 8 \
  --gpu-memory-utilization 0.95

Accessing the server

SSH tunnel

ssh -L 8000:localhost:8000 <user>@<ipAddress>

Usage example: function calling (Llama 3.1/3.3)

from openai import OpenAI
import json
 
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
 
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {"city": {"type": "string"}},
                "required": ["city"],
            },
        },
    }
]
 
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "What is the weather in Paris?"}],
    tools=tools,
    tool_choice="auto",
)
 
if response.choices[0].message.tool_calls:
    tool_call = response.choices[0].message.tool_calls[0]
    print(f"Function: {tool_call.function.name}")
    print(f"Arguments: {tool_call.function.arguments}")

Cloud-init startup script (optional)

If your provider supports cloud-init, you can paste this into the Startup Script field when deploying to automate the setup above.

Llama 3.1 8B (RTX 4090)

#cloud-config
runcmd:
  - apt-get update -y
  - apt-get install -y python3-pip
  - pip install vllm
  - mkdir -p /etc/vllm
  - install -m 600 /dev/null /etc/vllm/hf-token
  - echo "HF_TOKEN=<your-hf-token>" >> /etc/vllm/hf-token
  - |
    cat > /etc/systemd/system/vllm-llama3.service << 'EOF'
    [Unit]
    Description=Llama 3 vLLM Inference Server
    After=network.target
 
    [Service]
    Type=simple
    EnvironmentFile=/etc/vllm/hf-token
    ExecStart=/usr/bin/python3 -m vllm.entrypoints.openai.api_server \
      --model meta-llama/Meta-Llama-3.1-8B-Instruct \
      --port 8000 \
      --dtype bfloat16
    Restart=on-failure
    RestartSec=10
 
    [Install]
    WantedBy=multi-user.target
    EOF
  - systemctl daemon-reload
  - systemctl enable vllm-llama3
  - systemctl start vllm-llama3

Llama 3.1 70B (2× A100 80GB)

Replace the ExecStart line with:

/usr/bin/python3 -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-70B-Instruct \
  --port 8000 \
  --dtype bfloat16 \
  --tensor-parallel-size 2

Llama 3.1 405B (8× H100)

/usr/bin/python3 -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 \
  --port 8000 \
  --dtype fp8 \
  --tensor-parallel-size 8 \
  --gpu-memory-utilization 0.95

What's next

Llama 4 Scout & Maverick: Latest Llama generation
vLLM Inference Server: vLLM configuration details
Instance Types: Multi-GPU setup for 70B+ models
Cost Optimization: Spot vs Dedicated for inference workloads