TensorRT-LLM + Triton Inference Server
Deploy TensorRT-LLM with Triton Inference Server on Spheron H100 instances for maximum NVIDIA GPU throughput. TensorRT-LLM compiles model weights into an optimized engine before inference, yielding best-in-class token generation rates.
Recommended hardware
| Model Size | Recommended GPU | Instance Type | Notes |
|---|---|---|---|
| 7B–13B | H100 80GB (1×) | Dedicated | FP8 precision, highest throughput |
| 30B–70B | H100 80GB (2× or 4×) | Cluster | Multi-GPU engine with --tp_size |
| 70B+ | H100 NVLink (8×) | Cluster | Requires NVLink for tensor parallelism |
Manual setup
Use these steps to set up the server manually after SSH-ing into your instance. This works on any provider regardless of cloud-init support.
Step 1: Connect to your instance
ssh <user>@<ipAddress>Replace <user> with the username shown in the instance details panel in the dashboard (e.g., ubuntu for Spheron AI instances) and <ipAddress> with your instance's public IP.
Step 2: Install dependencies
sudo apt-get update -y
sudo apt-get install -y docker.io nvidia-container-toolkit git python3-pip
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart dockerStep 3: Pull the TensorRT-LLM Docker image
docker pull nvcr.io/nvidia/tritonserver:25.08-trtllm-python-py3Step 4: Prepare directories and download model
sudo mkdir -p /opt/trtllm/engines /opt/trtllm/model_repo /opt/trtllm/hf_model
pip install huggingface_hub
# Store the token in a file readable only by root
sudo mkdir -p /etc/trtllm
sudo install -m 600 /dev/null /etc/trtllm/hf-token
echo "HF_TOKEN=<your_hf_token>" | sudo tee /etc/trtllm/hf-token > /dev/null
# Load and use the token without exposing it in the process list or shell history
. /etc/trtllm/hf-token
huggingface-cli download meta-llama/Meta-Llama-3.1-8B-Instruct \
--local-dir /opt/trtllm/hf_modelReplace <your_hf_token> with your HuggingFace token. The token is stored in /etc/trtllm/hf-token (mode 600) and loaded via . /etc/trtllm/hf-token so it never appears inline on the command line.
Step 5: Build the TensorRT engine
docker run --rm --gpus all \
-v /opt/trtllm:/workspace \
nvcr.io/nvidia/tritonserver:25.08-trtllm-python-py3 \
bash -c "
pip install tensorrt_llm -U && \
python3 -m tensorrt_llm.commands.build \
--model_dir /workspace/hf_model \
--output_dir /workspace/engines/llama-8b \
--dtype float16 \
--tp_size 1
"Step 6: Configure Triton model repository
git clone --depth 1 https://github.com/triton-inference-server/tensorrtllm_backend.git /opt/trtllm/tensorrtllm_backend
cp -r /opt/trtllm/tensorrtllm_backend/all_models/inflight_batcher_llm/* /opt/trtllm/model_repo/
docker run --rm \
-v /opt/trtllm:/workspace \
nvcr.io/nvidia/tritonserver:25.08-trtllm-python-py3 \
bash -c "
python3 /workspace/tensorrtllm_backend/tools/fill_template.py \
--in_place \
/workspace/model_repo/tensorrt_llm/config.pbtxt \
'decoupled_mode:false,engine_dir:/engines/llama-8b,max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:true,enable_kv_cache_reuse:false,batching_strategy:inflight_fused_batching,max_beam_width:1' && \
python3 /workspace/tensorrtllm_backend/tools/fill_template.py \
--in_place \
/workspace/model_repo/preprocessing/config.pbtxt \
'tokenizer_dir:/hf_model,tokenizer_type:auto,triton_max_batch_size:64,preprocessing_instance_count:1' && \
python3 /workspace/tensorrtllm_backend/tools/fill_template.py \
--in_place \
/workspace/model_repo/postprocessing/config.pbtxt \
'tokenizer_dir:/hf_model,tokenizer_type:auto,triton_max_batch_size:64,postprocessing_instance_count:1' && \
python3 /workspace/tensorrtllm_backend/tools/fill_template.py \
--in_place \
/workspace/model_repo/ensemble/config.pbtxt \
'triton_max_batch_size:64' && \
python3 /workspace/tensorrtllm_backend/tools/fill_template.py \
--in_place \
/workspace/model_repo/tensorrt_llm_bls/config.pbtxt \
'triton_max_batch_size:64,decoupled_mode:false,bls_instance_count:1,accumulate_tokens:false'
"Step 7: Start Triton server
docker run -d --gpus all \
-p 8000:8000 -p 8001:8001 -p 8002:8002 \
-v /opt/trtllm/model_repo:/opt/tritonserver/model_repo \
-v /opt/trtllm/engines:/engines \
-v /opt/trtllm/hf_model:/hf_model \
--name triton \
nvcr.io/nvidia/tritonserver:25.08-trtllm-python-py3 \
tritonserver --model-repository=/opt/tritonserver/model_repoMonitor startup:
docker logs -f tritonAccessing the server
SSH tunnel
ssh -L 8000:localhost:8000 <user>@<ipAddress>Health check
curl http://localhost:8000/v2/health/readyUsage example
import requests
response = requests.post(
"http://localhost:8000/v2/models/ensemble/generate",
json={
"text_input": "Explain GPU tensor parallelism.",
"max_tokens": 200,
"bad_words": "",
"stop_words": "",
},
)
response.raise_for_status()
print(response.json()["text_output"])Engine build flags
| Flag | Description |
|---|---|
--dtype | Weight precision: float16, bfloat16, float8 |
--tp_size | Tensor parallel degree (match GPU count) |
--max_batch_size | Maximum concurrent requests |
--max_input_len | Maximum input sequence length |
Check container logs
docker logs -f tritonCloud-init startup script (optional)
If your provider supports cloud-init, you can paste this into the Startup Script field when deploying. This script pulls the NVIDIA Triton + TensorRT-LLM Docker image, builds an engine for Llama-3-8B in FP16, configures the Triton model repository from the official tensorrtllm_backend templates, and starts the Triton HTTP server on port 8000.
#cloud-config
runcmd:
- apt-get update -y
- apt-get install -y docker.io nvidia-container-toolkit git python3-pip
- nvidia-ctk runtime configure --runtime=docker
- systemctl restart docker
- docker pull nvcr.io/nvidia/tritonserver:25.08-trtllm-python-py3
- mkdir -p /opt/trtllm/engines /opt/trtllm/model_repo /opt/trtllm/hf_model /etc/trtllm
- pip install huggingface_hub
- install -m 600 /dev/null /etc/trtllm/hf-token
- echo "HF_TOKEN=<your_hf_token>" > /etc/trtllm/hf-token
- sh -c '. /etc/trtllm/hf-token && huggingface-cli download meta-llama/Meta-Llama-3.1-8B-Instruct --local-dir /opt/trtllm/hf_model'
- |
docker run --rm --gpus all \
-v /opt/trtllm:/workspace \
nvcr.io/nvidia/tritonserver:25.08-trtllm-python-py3 \
bash -c "
pip install tensorrt_llm -U && \
python3 -m tensorrt_llm.commands.build \
--model_dir /workspace/hf_model \
--output_dir /workspace/engines/llama-8b \
--dtype float16 \
--tp_size 1
"
- git clone --depth 1 https://github.com/triton-inference-server/tensorrtllm_backend.git /opt/trtllm/tensorrtllm_backend
- cp -r /opt/trtllm/tensorrtllm_backend/all_models/inflight_batcher_llm/* /opt/trtllm/model_repo/
- |
docker run --rm \
-v /opt/trtllm:/workspace \
nvcr.io/nvidia/tritonserver:25.08-trtllm-python-py3 \
bash -c "
python3 /workspace/tensorrtllm_backend/tools/fill_template.py \
--in_place \
/workspace/model_repo/tensorrt_llm/config.pbtxt \
'decoupled_mode:false,engine_dir:/engines/llama-8b,max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:true,enable_kv_cache_reuse:false,batching_strategy:inflight_fused_batching,max_beam_width:1' && \
python3 /workspace/tensorrtllm_backend/tools/fill_template.py \
--in_place \
/workspace/model_repo/preprocessing/config.pbtxt \
'tokenizer_dir:/hf_model,tokenizer_type:auto,triton_max_batch_size:64,preprocessing_instance_count:1' && \
python3 /workspace/tensorrtllm_backend/tools/fill_template.py \
--in_place \
/workspace/model_repo/postprocessing/config.pbtxt \
'tokenizer_dir:/hf_model,tokenizer_type:auto,triton_max_batch_size:64,postprocessing_instance_count:1' && \
python3 /workspace/tensorrtllm_backend/tools/fill_template.py \
--in_place \
/workspace/model_repo/ensemble/config.pbtxt \
'triton_max_batch_size:64' && \
python3 /workspace/tensorrtllm_backend/tools/fill_template.py \
--in_place \
/workspace/model_repo/tensorrt_llm_bls/config.pbtxt \
'triton_max_batch_size:64,decoupled_mode:false,bls_instance_count:1,accumulate_tokens:false'
"
- |
docker run -d --gpus all \
-p 8000:8000 -p 8001:8001 -p 8002:8002 \
-v /opt/trtllm/model_repo:/opt/tritonserver/model_repo \
-v /opt/trtllm/engines:/engines \
-v /opt/trtllm/hf_model:/hf_model \
--name triton \
nvcr.io/nvidia/tritonserver:25.08-trtllm-python-py3 \
tritonserver --model-repository=/opt/tritonserver/model_repoWhat's next
- vLLM Inference Server: Easier setup for most use cases
- Inference Frameworks: Compare all serving stacks
- Instance Types: H100 NVLink cluster requirements
- Networking: SSH tunneling and port access