Skip to content

Inference Frameworks

Choose the right LLM serving stack for your workload. All frameworks listed here expose an OpenAI-compatible /v1 API unless otherwise noted.

Framework comparison

FrameworkInterfaceBest ForKey Differentiator
vLLMOpenAI RESTProduction API, high throughputPagedAttention; most widely used
SGLangOpenAI RESTAgentic pipelines, structured outputRadixAttention; constrained decoding
TensorRT-LLM + TritonTriton HTTP/gRPCMaximum NVIDIA throughputEngine compilation; NVIDIA-optimized
llama.cppOpenAI RESTQuantized models, CPU+GPU offloadGGUF format; runs on 8GB cards
LMDeployOpenAI RESTAWQ-quantized modelsTurboMind engine; memory-efficient
LocalAIOpenAI RESTMulti-modal drop-in replacementDocker; LLMs + Whisper + SD
OllamaOpenAI RESTInteractive local usageBrowser WebUI; one-command model pull

Available guides

vLLM Inference Server

OpenAI-compatible inference server using vLLM on H100 or A100. Includes systemd service, SSH tunnel access, and performance tuning flags (--tensor-parallel-size, --dtype, --max-model-len).

Best for: Production API workloads; drop-in replacement for the OpenAI API.

Ollama + Open WebUI

Browser-based chat interface backed by Ollama on an RTX 4090. Docker Compose setup with NVIDIA GPU passthrough; pull any model with a single command.

Best for: Interactive local model usage; exploring models without writing code.

SGLang

Agentic LLM serving with RadixAttention for KV cache reuse across requests, constrained decoding for JSON output, and torch.compile support.

Best for: Agentic pipelines, multi-turn workloads, and structured output generation.

TensorRT-LLM + Triton

NVIDIA-optimized engine compilation via TensorRT-LLM with Triton Inference Server for production-grade serving. Requires an engine-build step before inference.

Best for: Maximum throughput on NVIDIA GPUs; latency-critical production deployments.

llama.cpp Server

GGUF model serving with CPU+GPU offload. Supports Q4, Q8, and F16 quantizations. Runs on GPUs as small as 8GB VRAM.

Best for: Running quantized models on consumer GPUs; mixed CPU/GPU inference; minimal dependencies.

LMDeploy

LMDeploy TurboMind inference toolkit with AWQ quantization support and OpenAI-compatible API.

Best for: Memory-efficient deployment of AWQ-quantized models on A100/H100.

LocalAI

OpenAI-compatible drop-in replacement via Docker with support for LLMs, Whisper speech-to-text, and Stable Diffusion image generation.

Best for: Multi-modal local inference with a single OpenAI-compatible endpoint; replacing OpenAI calls without code changes.

What's next