Inference Frameworks
Choose the right LLM serving stack for your workload. All frameworks listed here expose an OpenAI-compatible /v1 API unless otherwise noted.
Framework comparison
| Framework | Interface | Best For | Key Differentiator |
|---|---|---|---|
| vLLM | OpenAI REST | Production API, high throughput | PagedAttention; most widely used |
| SGLang | OpenAI REST | Agentic pipelines, structured output | RadixAttention; constrained decoding |
| TensorRT-LLM + Triton | Triton HTTP/gRPC | Maximum NVIDIA throughput | Engine compilation; NVIDIA-optimized |
| llama.cpp | OpenAI REST | Quantized models, CPU+GPU offload | GGUF format; runs on 8GB cards |
| LMDeploy | OpenAI REST | AWQ-quantized models | TurboMind engine; memory-efficient |
| LocalAI | OpenAI REST | Multi-modal drop-in replacement | Docker; LLMs + Whisper + SD |
| Ollama | OpenAI REST | Interactive local usage | Browser WebUI; one-command model pull |
Available guides
vLLM Inference Server
OpenAI-compatible inference server using vLLM on H100 or A100. Includes systemd service, SSH tunnel access, and performance tuning flags (--tensor-parallel-size, --dtype, --max-model-len).
Best for: Production API workloads; drop-in replacement for the OpenAI API.
Ollama + Open WebUI
Browser-based chat interface backed by Ollama on an RTX 4090. Docker Compose setup with NVIDIA GPU passthrough; pull any model with a single command.
Best for: Interactive local model usage; exploring models without writing code.
SGLang
Agentic LLM serving with RadixAttention for KV cache reuse across requests, constrained decoding for JSON output, and torch.compile support.
Best for: Agentic pipelines, multi-turn workloads, and structured output generation.
TensorRT-LLM + Triton
NVIDIA-optimized engine compilation via TensorRT-LLM with Triton Inference Server for production-grade serving. Requires an engine-build step before inference.
Best for: Maximum throughput on NVIDIA GPUs; latency-critical production deployments.
llama.cpp Server
GGUF model serving with CPU+GPU offload. Supports Q4, Q8, and F16 quantizations. Runs on GPUs as small as 8GB VRAM.
Best for: Running quantized models on consumer GPUs; mixed CPU/GPU inference; minimal dependencies.
LMDeploy
LMDeploy TurboMind inference toolkit with AWQ quantization support and OpenAI-compatible API.
Best for: Memory-efficient deployment of AWQ-quantized models on A100/H100.
LocalAI
OpenAI-compatible drop-in replacement via Docker with support for LLMs, Whisper speech-to-text, and Stable Diffusion image generation.
Best for: Multi-modal local inference with a single OpenAI-compatible endpoint; replacing OpenAI calls without code changes.
What's next
- LLM Inference Overview: Model guides and hardware recommendations
- Instance Types: Spot vs Dedicated vs Cluster
- Networking: SSH tunneling and port access
- Cost Optimization: Reducing inference costs with Spot instances