Inference Frameworks

Choose the right LLM serving stack for your workload. All frameworks listed here expose an OpenAI-compatible /v1 API unless otherwise noted.

Framework comparison

Framework	Interface	Best For	Key Differentiator
vLLM	OpenAI REST	Production API, high throughput	PagedAttention; most widely used
SGLang	OpenAI REST	Agentic pipelines, structured output	RadixAttention; constrained decoding
TensorRT-LLM + Triton	Triton HTTP/gRPC	Maximum NVIDIA throughput	Engine compilation; NVIDIA-optimized
llama.cpp	OpenAI REST	Quantized models, CPU+GPU offload	GGUF format; runs on 8GB cards
LMDeploy	OpenAI REST	AWQ-quantized models	TurboMind engine; memory-efficient
LocalAI	OpenAI REST	Multi-modal drop-in replacement	Docker; LLMs + Whisper + SD
Ollama	OpenAI REST	Interactive local usage	Browser WebUI; one-command model pull

Available guides

vLLM Inference Server

OpenAI-compatible inference server using vLLM on H100 or A100. Includes systemd service, SSH tunnel access, and performance tuning flags (--tensor-parallel-size, --dtype, --max-model-len).

Best for: Production API workloads; drop-in replacement for the OpenAI API.

Ollama + Open WebUI

Browser-based chat interface backed by Ollama on an RTX 4090. Docker Compose setup with NVIDIA GPU passthrough; pull any model with a single command.

Best for: Interactive local model usage; exploring models without writing code.

SGLang

Agentic LLM serving with RadixAttention for KV cache reuse across requests, constrained decoding for JSON output, and torch.compile support.

Best for: Agentic pipelines, multi-turn workloads, and structured output generation.

TensorRT-LLM + Triton

NVIDIA-optimized engine compilation via TensorRT-LLM with Triton Inference Server for production-grade serving. Requires an engine-build step before inference.

Best for: Maximum throughput on NVIDIA GPUs; latency-critical production deployments.

llama.cpp Server

GGUF model serving with CPU+GPU offload. Supports Q4, Q8, and F16 quantizations. Runs on GPUs as small as 8GB VRAM.

Best for: Running quantized models on consumer GPUs; mixed CPU/GPU inference; minimal dependencies.

LMDeploy

LMDeploy TurboMind inference toolkit with AWQ quantization support and OpenAI-compatible API.

Best for: Memory-efficient deployment of AWQ-quantized models on A100/H100.

LocalAI

OpenAI-compatible drop-in replacement via Docker with support for LLMs, Whisper speech-to-text, and Stable Diffusion image generation.

Best for: Multi-modal local inference with a single OpenAI-compatible endpoint; replacing OpenAI calls without code changes.

What's next

LLM Inference Overview: Model guides and hardware recommendations
Instance Types: Spot vs Dedicated vs Cluster
Networking: SSH tunneling and port access
Cost Optimization: Reducing inference costs with Spot instances