Gemma 3
Deploy Gemma 3 from Google DeepMind on Spheron GPU instances using vLLM. Gemma 3 is released under the Gemma Terms of Use, which permits commercial use and modification after accepting Google's license agreement.
Recommended hardware
| Model | Recommended GPU | Instance Type | Notes |
|---|---|---|---|
| Gemma-3-4B-IT | Any 8GB GPU | Spot | Lightweight |
| Gemma-3-12B-IT | RTX 4090 (24GB) | Dedicated or Spot | Full precision |
| Gemma-3-27B-IT | A100 80GB | Dedicated | Full precision |
| Gemma-3-27B-IT (INT4) | RTX 4090 (24GB) | Dedicated or Spot | AWQ 4-bit |
Manual setup
Use these steps to set up the server manually after SSH-ing into your instance. This works on any provider regardless of cloud-init support.
Step 1: Connect to your instance
ssh <user>@<ipAddress>Replace <user> with the username shown in the instance details panel (e.g., ubuntu for Spheron AI instances) and <ipAddress> with your instance's public IP.
Step 2: Install vLLM
sudo apt-get update -y
sudo apt-get install -y python3-pip
pip install vllmStep 3: Start the server
Run the server in the foreground to verify it works:
HF_TOKEN=<your-hf-token> python3 -m vllm.entrypoints.openai.api_server \
--model google/gemma-3-12b-it \
--port 8000 \
--dtype bfloat16Press Ctrl+C to stop. Replace <your-hf-token> with your HuggingFace token. For Gemma-3-27B on A100 80GB, replace the model with google/gemma-3-27b-it.
Step 4: Run as a background service
To keep the server running after you close your SSH session, create a restricted token file and a systemd service:
# Store the token in a file readable only by root
sudo mkdir -p /etc/vllm
sudo install -m 600 /dev/null /etc/vllm/hf-token
echo "HF_TOKEN=<your-hf-token>" | sudo tee -a /etc/vllm/hf-token > /dev/null
sudo tee /etc/systemd/system/vllm-gemma3.service > /dev/null << 'EOF'
[Unit]
Description=Gemma 3 vLLM Inference Server
After=network.target
[Service]
Type=simple
EnvironmentFile=/etc/vllm/hf-token
ExecStart=/usr/bin/python3 -m vllm.entrypoints.openai.api_server \
--model google/gemma-3-12b-it \
--port 8000 \
--dtype bfloat16
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable vllm-gemma3
sudo systemctl start vllm-gemma3Replace <your-hf-token> with your HuggingFace token. Using EnvironmentFile= with chmod 600 prevents other local users from reading the token via systemctl show.
Accessing the server
SSH tunnel
ssh -L 8000:localhost:8000 <user>@<ipAddress>Usage example
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="google/gemma-3-12b-it",
messages=[{"role": "user", "content": "Explain the Apache 2.0 license in one paragraph."}],
)
print(response.choices[0].message.content)Cloud-init startup script (optional)
If your provider supports cloud-init, you can paste this into the Startup Script field when deploying to automate the setup above.
Gemma-3-12B (RTX 4090)
#cloud-config
write_files:
- path: /etc/systemd/system/vllm-gemma3.service
content: |
[Unit]
Description=Gemma 3 vLLM Inference Server
After=network.target
[Service]
Type=simple
EnvironmentFile=/etc/vllm/hf-token
ExecStart=/usr/bin/python3 -m vllm.entrypoints.openai.api_server \
--model google/gemma-3-12b-it \
--port 8000 \
--dtype bfloat16
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.target
runcmd:
- apt-get update -y
- apt-get install -y python3-pip
- pip install vllm
- mkdir -p /etc/vllm
- install -m 600 /dev/null /etc/vllm/hf-token
- echo "HF_TOKEN=<your-hf-token>" >> /etc/vllm/hf-token
- systemctl daemon-reload
- systemctl enable vllm-gemma3
- systemctl start vllm-gemma3Replace <your-hf-token> with your HuggingFace token.
Gemma-3-27B (A100 80GB)
Replace the model in ExecStart with google/gemma-3-27b-it.
What's next
- vLLM Inference Server: vLLM configuration details
- Phi-4 & Phi-4 Multimodal: Another efficient small model option
- Instance Types: GPU selection for small models
- Cost Optimization: Spot instances for Gemma 3