Comparing the Top 6 Inference Runtimes for LLM Serving in 2025

Large language models are now limited less by training and more by how fast and cheaply we can serve tokens under real traffic. That comes down to three implementation details: how the runtime batches requests, how it overlaps prefill and decode, and how it stores and reuses the KV cache. Different engines make different tradeoffs on these axes, which show up directly as differences in tokens per second, P50/P99 latency, and GPU memory usage.

This article compares six runtimes that show up repeatedly in production stacks:

vLLM
TensorRT LLM
Hugging Face Text Generation Inference (TGI v3)
LMDeploy
SGLang
DeepSpeed Inference / ZeRO Inference

1. vLLM

Design

vLLM is built around PagedAttention. Instead of storing each sequence’s KV cache in a large contiguous buffer, it partitions KV into fixed size blocks and uses an indirection layer so each sequence points to a list of blocks.

This gives:

Very low KV fragmentation (reported <4% waste vs 60–80% in naïve allocators)
High GPU utilization with continuous batching
Native support for prefix sharing and KV reuse at block level

Recent versions add KV quantization (FP8) and integrate FlashAttention style kernels.

Performance

vLLM evaluation:

vLLM achieves 14–24× higher throughput than Hugging Face Transformers and 2.2–3.5× higher than early TGI for LLaMA models on NVIDIA GPUs.

KV and memory behavior

PagedAttention provides a KV layout that is both GPU friendly and fragmentation resistant.
FP8 KV quantization reduces KV size and improves decode throughput when compute is not the bottleneck.

Where it fits

Default high performance engine when you need a general LLM serving backend with good throughput, good TTFT, and hardware flexibility.

2. TensorRT LLM

Design

TensorRT LLM is a compilation based engine on top of NVIDIA TensorRT. It generates fused kernels per model and shape, and exposes an executor API used by frameworks such as Triton.

Its KV subsystem is explicit and feature rich:

Paged KV cache
Quantized KV cache (INT8, FP8, with some combinations still evolving)
Circular buffer KV cache
KV cache reuse, including offloading KV to CPU and reusing it across prompts to reduce TTFT

NVIDIA reports that CPU based KV reuse can reduce time to first token by up to 14× on H100 and even more on GH200 in specific scenarios.

Performance

TensorRT LLM is highly tunable, so results vary. Common patterns from public comparisons and vendor benchmarks:

Very low single request latency on NVIDIA GPUs when engines are compiled for the exact model and configuration.
At moderate concurrency, it can be tuned either for low TTFT or for high throughput; at very high concurrency, throughput optimized profiles push P99 up due to aggressive batching.

KV and memory behavior

Paged KV plus quantized KV gives strong control over memory use and bandwidth.
Executor and memory APIs let you design cache aware routing policies at the application layer.

Where it fits

Latency critical workloads and NVIDIA only environments, where teams can invest in engine builds and per model tuning.

3. Hugging Face TGI v3

Design

Text Generation Inference (TGI) is a server focused stack with:

Rust based HTTP and gRPC server
Continuous batching, streaming, safety hooks
Backends for PyTorch and TensorRT and tight Hugging Face Hub integration

TGI v3 adds a new long context pipeline:

Chunked prefill for long inputs
Prefix KV caching so long conversation histories are not recomputed on each request

Performance

For conventional prompts, recent third party work shows:

vLLM often edges out TGI on raw tokens per second at high concurrency due to PagedAttention, but the difference is not huge on many setups.
TGI v3 processes around 3× more tokens and is up to 13× faster than vLLM on long prompts, under a setup with very long histories and prefix caching enabled.

Latency profile:

P50 for short and mid length prompts is similar to vLLM when both are tuned with continuous batching.
For long chat histories, prefill dominates in naive pipelines; TGI v3’s reuse of earlier tokens gives a large win in TTFT and P50.

KV and memory behavior

TGI uses KV caching with paged attention style kernels and reduces memory footprint through chunking of prefill and other runtime changes.
It integrates quantization through bits and bytes and GPTQ and runs across several hardware backends.

Where it fits

Production stacks already on Hugging Face, especially for chat style workloads with long histories where prefix caching gives large real world gains.

4. LMDeploy

Design

LMDeploy is a toolkit for compression and deployment from the InternLM ecosystem. It exposes two engines:

TurboMind: high performance CUDA kernels for NVIDIA GPUs
PyTorch engine: flexible fallback

Key runtime features:

Persistent, continuous batching
Blocked KV cache with a manager for allocation and reuse
Dynamic split and fuse for attention blocks
Tensor parallelism
Weight only and KV quantization (including AWQ and online INT8 / INT4 KV quant)

LMDeploy delivers up to 1.8× higher request throughput than vLLM, attributing this to persistent batching, blocked KV and optimized kernels.

Performance

Evaluations show:

For 4 bit Llama style models on A100, LMDeploy can reach higher tokens per second than vLLM under comparable latency constraints, especially at high concurrency.
It also reports that 4 bit inference is about 2.4× faster than FP16 for supported models.

Latency:

Single request TTFT is in the same ballpark as other optimized GPU engines when configured without extreme batch limits.
Under heavy concurrency, persistent batching plus blocked KV let LMDeploy sustain high throughput without TTFT collapse.

KV and memory behavior

Blocked KV cache trades contiguous per sequence buffers for a grid of KV chunks managed by the runtime, similar in spirit to vLLM’s PagedAttention but with a different internal layout.
Support for weight and KV quantization targets large models on constrained GPUs.

Where it fits

NVIDIA centric deployments that want maximum throughput and are comfortable using TurboMind and LMDeploy specific tooling.

5. SGLang

Design

SGLang is both:

A DSL for building structured LLM programs such as agents, RAG workflows and tool pipelines
A runtime that implements RadixAttention, a KV reuse mechanism that shares prefixes using a radix tree structure rather than simple block hashes.

RadixAttention:

Stores KV for many requests in a prefix tree keyed by tokens
Enables high KV hit rates when many calls share prefixes, such as few shot prompts, multi turn chat, or tool chains

Performance

Key Insights:

SGLang achieves up to 6.4× higher throughput and up to 3.7× lower latency than baseline systems such as vLLM, LMQL and others on structured workloads.
Improvements are largest when there is heavy prefix reuse, for example multi turn chat or evaluation workloads with repeated context.

Reported KV cache hit rates range from roughly 50% to 99%, and cache aware schedulers get close to the optimal hit rate on the measured benchmarks.

KV and memory behavior

RadixAttention sits on top of paged attention style kernels and focuses on reuse rather than just allocation.
SGLang integrates well with hierarchical context caching systems that move KV between GPU and CPU when sequences are long, although those systems are usually implemented as separate projects.

Where it fits

Agentic systems, tool pipelines, and heavy RAG applications where many calls share large prompt prefixes and KV reuse matters at the application level.

6. DeepSpeed Inference / ZeRO Inference

Design

DeepSpeed provides two pieces relevant for inference:

DeepSpeed Inference: optimized transformer kernels plus tensor and pipeline parallelism
ZeRO Inference / ZeRO Offload: techniques that offload model weights, and in some setups KV cache, to CPU or NVMe to run very large models on limited GPU memory

ZeRO Inference focuses on:

Keeping little or no model weights resident in GPU
Streaming tensors from CPU or NVMe as needed
Targeting throughput and model size rather than low latency

Performance

In the ZeRO Inference OPT 30B example on a single V100 32GB:

Full CPU offload reaches about 43 tokens per second
Full NVMe offload reaches about 30 tokens per second
Both are 1.3–2.4× faster than partial offload configurations, because full offload enables larger batch sizes

These numbers are small compared to GPU resident LLM runtimes on A100 or H100, but they apply to a model that does not fit natively in 32GB.

A recent I/O characterization of DeepSpeed and FlexGen confirms that offload based systems are dominated by small 128 KiB reads and that I/O behavior becomes the main bottleneck.

KV and memory behavior

Model weights and sometimes KV blocks are offloaded to CPU or SSD to fit models beyond GPU capacity.
TTFT and P99 are high compared to pure GPU engines, but the tradeoff is the ability to run very large models that otherwise would not fit.

Where it fits

Offline or batch inference, or low QPS services where model size matters more than latency and GPU count.

Comparison Tables

This table summarizes the main tradeoffs qualitatively:

Runtime	Main design idea	Relative strength	KV strategy	Typical use case
vLLM	PagedAttention, continuous batching	High tokens per second at given TTFT	Paged KV blocks, FP8 KV support	General purpose GPU serving, multi hardware
TensorRT LLM	Compiled kernels on NVIDIA + KV reuse	Very low latency and high throughput on NVIDIA	Paged, quantized KV, reuse and offload	NVIDIA only, latency sensitive
TGI v3	HF serving layer with long prompt path	Strong long prompt performance, integrated stack	Paged KV, chunked prefill, prefix caching	HF centric APIs, long chat histories
LMDeploy	TurboMind kernels, blocked KV, quant	Up to 1.8× vLLM throughput in vendor tests	Blocked KV cache, weight and KV quant	NVIDIA deployments focused on raw throughput
SGLang	RadixAttention and structured programs	Up to 6.4× throughput and 3.7× lower latency on structured workloads	Radix tree KV reuse over prefixes	Agents, RAG, high prefix reuse
DeepSpeed	GPU CPU NVMe offload for huge models	Enables large models on small GPU; throughput oriented	Offloaded weights and sometimes KV	Very large models, offline or low QPS

Choosing a runtime in practice

For a production system, the choice tends to collapse to a few simple patterns:

You want a strong default engine with minimal custom work: You can start with vLLM. It gives you good throughput, reasonable TTFT, and solid KV handling on common hardware.
You are committed to NVIDIA and want fine grained control over latency and KV: You can use TensorRT LLM, likely behind Triton or TGI. Plan for model specific engine builds and tuning.
Your stack is already on Hugging Face and you care about long chats: You can use TGI v3. Its long prompt pipeline and prefix caching are very effective for conversation style traffic.
You want maximum throughput per GPU with quantized models: You can use LMDeploy with TurboMind and blocked KV, especially for 4 bit Llama family models.
You are building agents, tool chains or heavy RAG systems: You can use SGLang and design prompts so that KV reuse via RadixAttention is high.
You must run very large models on limited GPUs: You can use DeepSpeed Inference / ZeRO Inference, accept higher latency, and treat the GPU as a throughput engine with SSD in the loop.

Overall, all these engines are converging on the same idea: KV cache is the real bottleneck resource. The winners are the runtimes that treat KV as a first class data structure to be paged, quantized, reused and offloaded, not just a big tensor slapped into GPU memory.

Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.

🙌 Follow MARKTECHPOST: Add us as a preferred source on Google.

Source_link