Skip to content

TensorRT-LLM + Triton Inference Server

Deploy TensorRT-LLM with Triton Inference Server on Spheron H100 instances for maximum NVIDIA GPU throughput. TensorRT-LLM compiles model weights into an optimized engine before inference, yielding best-in-class token generation rates.

Recommended hardware

Model SizeRecommended GPUInstance TypeNotes
7B–13BH100 80GB (1×)DedicatedFP8 precision, highest throughput
30B–70BH100 80GB (2× or 4×)ClusterMulti-GPU engine with --tp_size
70B+H100 NVLink (8×)ClusterRequires NVLink for tensor parallelism

Manual setup

Use these steps to set up the server manually after SSH-ing into your instance. This works on any provider regardless of cloud-init support.

Step 1: Connect to your instance

ssh <user>@<ipAddress>

Replace <user> with the username shown in the instance details panel in the dashboard (e.g., ubuntu for Spheron AI instances) and <ipAddress> with your instance's public IP.

Step 2: Install dependencies

sudo apt-get update -y
sudo apt-get install -y docker.io nvidia-container-toolkit git python3-pip
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Step 3: Pull the TensorRT-LLM Docker image

docker pull nvcr.io/nvidia/tritonserver:25.08-trtllm-python-py3

Step 4: Prepare directories and download model

sudo mkdir -p /opt/trtllm/engines /opt/trtllm/model_repo /opt/trtllm/hf_model
pip install huggingface_hub
 
# Store the token in a file readable only by root
sudo mkdir -p /etc/trtllm
sudo install -m 600 /dev/null /etc/trtllm/hf-token
echo "HF_TOKEN=<your_hf_token>" | sudo tee /etc/trtllm/hf-token > /dev/null
 
# Load and use the token without exposing it in the process list or shell history
. /etc/trtllm/hf-token
huggingface-cli download meta-llama/Meta-Llama-3.1-8B-Instruct \
  --local-dir /opt/trtllm/hf_model

Replace <your_hf_token> with your HuggingFace token. The token is stored in /etc/trtllm/hf-token (mode 600) and loaded via . /etc/trtllm/hf-token so it never appears inline on the command line.

Step 5: Build the TensorRT engine

docker run --rm --gpus all \
  -v /opt/trtllm:/workspace \
  nvcr.io/nvidia/tritonserver:25.08-trtllm-python-py3 \
  bash -c "
    pip install tensorrt_llm -U && \
    python3 -m tensorrt_llm.commands.build \
      --model_dir /workspace/hf_model \
      --output_dir /workspace/engines/llama-8b \
      --dtype float16 \
      --tp_size 1
  "

Step 6: Configure Triton model repository

git clone --depth 1 https://github.com/triton-inference-server/tensorrtllm_backend.git /opt/trtllm/tensorrtllm_backend
cp -r /opt/trtllm/tensorrtllm_backend/all_models/inflight_batcher_llm/* /opt/trtllm/model_repo/
 
docker run --rm \
  -v /opt/trtllm:/workspace \
  nvcr.io/nvidia/tritonserver:25.08-trtllm-python-py3 \
  bash -c "
    python3 /workspace/tensorrtllm_backend/tools/fill_template.py \
      --in_place \
      /workspace/model_repo/tensorrt_llm/config.pbtxt \
      'decoupled_mode:false,engine_dir:/engines/llama-8b,max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:true,enable_kv_cache_reuse:false,batching_strategy:inflight_fused_batching,max_beam_width:1' && \
    python3 /workspace/tensorrtllm_backend/tools/fill_template.py \
      --in_place \
      /workspace/model_repo/preprocessing/config.pbtxt \
      'tokenizer_dir:/hf_model,tokenizer_type:auto,triton_max_batch_size:64,preprocessing_instance_count:1' && \
    python3 /workspace/tensorrtllm_backend/tools/fill_template.py \
      --in_place \
      /workspace/model_repo/postprocessing/config.pbtxt \
      'tokenizer_dir:/hf_model,tokenizer_type:auto,triton_max_batch_size:64,postprocessing_instance_count:1' && \
    python3 /workspace/tensorrtllm_backend/tools/fill_template.py \
      --in_place \
      /workspace/model_repo/ensemble/config.pbtxt \
      'triton_max_batch_size:64' && \
    python3 /workspace/tensorrtllm_backend/tools/fill_template.py \
      --in_place \
      /workspace/model_repo/tensorrt_llm_bls/config.pbtxt \
      'triton_max_batch_size:64,decoupled_mode:false,bls_instance_count:1,accumulate_tokens:false'
  "

Step 7: Start Triton server

docker run -d --gpus all \
  -p 8000:8000 -p 8001:8001 -p 8002:8002 \
  -v /opt/trtllm/model_repo:/opt/tritonserver/model_repo \
  -v /opt/trtllm/engines:/engines \
  -v /opt/trtllm/hf_model:/hf_model \
  --name triton \
  nvcr.io/nvidia/tritonserver:25.08-trtllm-python-py3 \
  tritonserver --model-repository=/opt/tritonserver/model_repo

Monitor startup:

docker logs -f triton

Accessing the server

SSH tunnel

ssh -L 8000:localhost:8000 <user>@<ipAddress>

Health check

curl http://localhost:8000/v2/health/ready

Usage example

import requests
 
response = requests.post(
    "http://localhost:8000/v2/models/ensemble/generate",
    json={
        "text_input": "Explain GPU tensor parallelism.",
        "max_tokens": 200,
        "bad_words": "",
        "stop_words": "",
    },
)
response.raise_for_status()
print(response.json()["text_output"])

Engine build flags

FlagDescription
--dtypeWeight precision: float16, bfloat16, float8
--tp_sizeTensor parallel degree (match GPU count)
--max_batch_sizeMaximum concurrent requests
--max_input_lenMaximum input sequence length

Check container logs

docker logs -f triton

Cloud-init startup script (optional)

If your provider supports cloud-init, you can paste this into the Startup Script field when deploying. This script pulls the NVIDIA Triton + TensorRT-LLM Docker image, builds an engine for Llama-3-8B in FP16, configures the Triton model repository from the official tensorrtllm_backend templates, and starts the Triton HTTP server on port 8000.

#cloud-config
runcmd:
  - apt-get update -y
  - apt-get install -y docker.io nvidia-container-toolkit git python3-pip
  - nvidia-ctk runtime configure --runtime=docker
  - systemctl restart docker
  - docker pull nvcr.io/nvidia/tritonserver:25.08-trtllm-python-py3
  - mkdir -p /opt/trtllm/engines /opt/trtllm/model_repo /opt/trtllm/hf_model /etc/trtllm
  - pip install huggingface_hub
  - install -m 600 /dev/null /etc/trtllm/hf-token
  - echo "HF_TOKEN=<your_hf_token>" > /etc/trtllm/hf-token
  - sh -c '. /etc/trtllm/hf-token && huggingface-cli download meta-llama/Meta-Llama-3.1-8B-Instruct --local-dir /opt/trtllm/hf_model'
  - |
    docker run --rm --gpus all \
      -v /opt/trtllm:/workspace \
      nvcr.io/nvidia/tritonserver:25.08-trtllm-python-py3 \
      bash -c "
        pip install tensorrt_llm -U && \
        python3 -m tensorrt_llm.commands.build \
          --model_dir /workspace/hf_model \
          --output_dir /workspace/engines/llama-8b \
          --dtype float16 \
          --tp_size 1
      "
  - git clone --depth 1 https://github.com/triton-inference-server/tensorrtllm_backend.git /opt/trtllm/tensorrtllm_backend
  - cp -r /opt/trtllm/tensorrtllm_backend/all_models/inflight_batcher_llm/* /opt/trtllm/model_repo/
  - |
    docker run --rm \
      -v /opt/trtllm:/workspace \
      nvcr.io/nvidia/tritonserver:25.08-trtllm-python-py3 \
      bash -c "
        python3 /workspace/tensorrtllm_backend/tools/fill_template.py \
          --in_place \
          /workspace/model_repo/tensorrt_llm/config.pbtxt \
          'decoupled_mode:false,engine_dir:/engines/llama-8b,max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:true,enable_kv_cache_reuse:false,batching_strategy:inflight_fused_batching,max_beam_width:1' && \
        python3 /workspace/tensorrtllm_backend/tools/fill_template.py \
          --in_place \
          /workspace/model_repo/preprocessing/config.pbtxt \
          'tokenizer_dir:/hf_model,tokenizer_type:auto,triton_max_batch_size:64,preprocessing_instance_count:1' && \
        python3 /workspace/tensorrtllm_backend/tools/fill_template.py \
          --in_place \
          /workspace/model_repo/postprocessing/config.pbtxt \
          'tokenizer_dir:/hf_model,tokenizer_type:auto,triton_max_batch_size:64,postprocessing_instance_count:1' && \
        python3 /workspace/tensorrtllm_backend/tools/fill_template.py \
          --in_place \
          /workspace/model_repo/ensemble/config.pbtxt \
          'triton_max_batch_size:64' && \
        python3 /workspace/tensorrtllm_backend/tools/fill_template.py \
          --in_place \
          /workspace/model_repo/tensorrt_llm_bls/config.pbtxt \
          'triton_max_batch_size:64,decoupled_mode:false,bls_instance_count:1,accumulate_tokens:false'
      "
  - |
    docker run -d --gpus all \
      -p 8000:8000 -p 8001:8001 -p 8002:8002 \
      -v /opt/trtllm/model_repo:/opt/tritonserver/model_repo \
      -v /opt/trtllm/engines:/engines \
      -v /opt/trtllm/hf_model:/hf_model \
      --name triton \
      nvcr.io/nvidia/tritonserver:25.08-trtllm-python-py3 \
      tritonserver --model-repository=/opt/tritonserver/model_repo

What's next