• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Thursday, November 13, 2025
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

Comparing the Top 6 Inference Runtimes for LLM Serving in 2025

Josh by Josh
November 7, 2025
in Al, Analytics and Automation
0
0
SHARES
1
VIEWS
Share on FacebookShare on Twitter


Large language models are now limited less by training and more by how fast and cheaply we can serve tokens under real traffic. That comes down to three implementation details: how the runtime batches requests, how it overlaps prefill and decode, and how it stores and reuses the KV cache. Different engines make different tradeoffs on these axes, which show up directly as differences in tokens per second, P50/P99 latency, and GPU memory usage.

This article compares six runtimes that show up repeatedly in production stacks:

  • vLLM
  • TensorRT LLM
  • Hugging Face Text Generation Inference (TGI v3)
  • LMDeploy
  • SGLang
  • DeepSpeed Inference / ZeRO Inference

1. vLLM

Design

vLLM is built around PagedAttention. Instead of storing each sequence’s KV cache in a large contiguous buffer, it partitions KV into fixed size blocks and uses an indirection layer so each sequence points to a list of blocks.

This gives:

  • Very low KV fragmentation (reported <4% waste vs 60–80% in naïve allocators)
  • High GPU utilization with continuous batching
  • Native support for prefix sharing and KV reuse at block level

Recent versions add KV quantization (FP8) and integrate FlashAttention style kernels.

Performance

vLLM evaluation:

  • vLLM achieves 14–24× higher throughput than Hugging Face Transformers and 2.2–3.5× higher than early TGI for LLaMA models on NVIDIA GPUs.

KV and memory behavior

  • PagedAttention provides a KV layout that is both GPU friendly and fragmentation resistant.
  • FP8 KV quantization reduces KV size and improves decode throughput when compute is not the bottleneck.

Where it fits

  • Default high performance engine when you need a general LLM serving backend with good throughput, good TTFT, and hardware flexibility.

2. TensorRT LLM

Design

TensorRT LLM is a compilation based engine on top of NVIDIA TensorRT. It generates fused kernels per model and shape, and exposes an executor API used by frameworks such as Triton.

Its KV subsystem is explicit and feature rich:

  • Paged KV cache
  • Quantized KV cache (INT8, FP8, with some combinations still evolving)
  • Circular buffer KV cache
  • KV cache reuse, including offloading KV to CPU and reusing it across prompts to reduce TTFT

NVIDIA reports that CPU based KV reuse can reduce time to first token by up to 14× on H100 and even more on GH200 in specific scenarios.

Performance

TensorRT LLM is highly tunable, so results vary. Common patterns from public comparisons and vendor benchmarks:

  • Very low single request latency on NVIDIA GPUs when engines are compiled for the exact model and configuration.
  • At moderate concurrency, it can be tuned either for low TTFT or for high throughput; at very high concurrency, throughput optimized profiles push P99 up due to aggressive batching.

KV and memory behavior

  • Paged KV plus quantized KV gives strong control over memory use and bandwidth.
  • Executor and memory APIs let you design cache aware routing policies at the application layer.

Where it fits

  • Latency critical workloads and NVIDIA only environments, where teams can invest in engine builds and per model tuning.

3. Hugging Face TGI v3

Design

Text Generation Inference (TGI) is a server focused stack with:

  • Rust based HTTP and gRPC server
  • Continuous batching, streaming, safety hooks
  • Backends for PyTorch and TensorRT and tight Hugging Face Hub integration

TGI v3 adds a new long context pipeline:

  • Chunked prefill for long inputs
  • Prefix KV caching so long conversation histories are not recomputed on each request

Performance

For conventional prompts, recent third party work shows:

  • vLLM often edges out TGI on raw tokens per second at high concurrency due to PagedAttention, but the difference is not huge on many setups.
  • TGI v3 processes around 3× more tokens and is up to 13× faster than vLLM on long prompts, under a setup with very long histories and prefix caching enabled.

Latency profile:

  • P50 for short and mid length prompts is similar to vLLM when both are tuned with continuous batching.
  • For long chat histories, prefill dominates in naive pipelines; TGI v3’s reuse of earlier tokens gives a large win in TTFT and P50.

KV and memory behavior

  • TGI uses KV caching with paged attention style kernels and reduces memory footprint through chunking of prefill and other runtime changes.
  • It integrates quantization through bits and bytes and GPTQ and runs across several hardware backends.

Where it fits

  • Production stacks already on Hugging Face, especially for chat style workloads with long histories where prefix caching gives large real world gains.

4. LMDeploy

Design

LMDeploy is a toolkit for compression and deployment from the InternLM ecosystem. It exposes two engines:

  • TurboMind: high performance CUDA kernels for NVIDIA GPUs
  • PyTorch engine: flexible fallback

Key runtime features:

  • Persistent, continuous batching
  • Blocked KV cache with a manager for allocation and reuse
  • Dynamic split and fuse for attention blocks
  • Tensor parallelism
  • Weight only and KV quantization (including AWQ and online INT8 / INT4 KV quant)

LMDeploy delivers up to 1.8× higher request throughput than vLLM, attributing this to persistent batching, blocked KV and optimized kernels.

Performance

Evaluations show:

  • For 4 bit Llama style models on A100, LMDeploy can reach higher tokens per second than vLLM under comparable latency constraints, especially at high concurrency.
  • It also reports that 4 bit inference is about 2.4× faster than FP16 for supported models.

Latency:

  • Single request TTFT is in the same ballpark as other optimized GPU engines when configured without extreme batch limits.
  • Under heavy concurrency, persistent batching plus blocked KV let LMDeploy sustain high throughput without TTFT collapse.

KV and memory behavior

  • Blocked KV cache trades contiguous per sequence buffers for a grid of KV chunks managed by the runtime, similar in spirit to vLLM’s PagedAttention but with a different internal layout.
  • Support for weight and KV quantization targets large models on constrained GPUs.

Where it fits

  • NVIDIA centric deployments that want maximum throughput and are comfortable using TurboMind and LMDeploy specific tooling.

5. SGLang

Design

SGLang is both:

  • A DSL for building structured LLM programs such as agents, RAG workflows and tool pipelines
  • A runtime that implements RadixAttention, a KV reuse mechanism that shares prefixes using a radix tree structure rather than simple block hashes.

RadixAttention:

  • Stores KV for many requests in a prefix tree keyed by tokens
  • Enables high KV hit rates when many calls share prefixes, such as few shot prompts, multi turn chat, or tool chains

Performance

Key Insights:

  • SGLang achieves up to 6.4× higher throughput and up to 3.7× lower latency than baseline systems such as vLLM, LMQL and others on structured workloads.
  • Improvements are largest when there is heavy prefix reuse, for example multi turn chat or evaluation workloads with repeated context.

Reported KV cache hit rates range from roughly 50% to 99%, and cache aware schedulers get close to the optimal hit rate on the measured benchmarks.

KV and memory behavior

  • RadixAttention sits on top of paged attention style kernels and focuses on reuse rather than just allocation.
  • SGLang integrates well with hierarchical context caching systems that move KV between GPU and CPU when sequences are long, although those systems are usually implemented as separate projects.

Where it fits

  • Agentic systems, tool pipelines, and heavy RAG applications where many calls share large prompt prefixes and KV reuse matters at the application level.

6. DeepSpeed Inference / ZeRO Inference

Design

DeepSpeed provides two pieces relevant for inference:

  • DeepSpeed Inference: optimized transformer kernels plus tensor and pipeline parallelism
  • ZeRO Inference / ZeRO Offload: techniques that offload model weights, and in some setups KV cache, to CPU or NVMe to run very large models on limited GPU memory

ZeRO Inference focuses on:

  • Keeping little or no model weights resident in GPU
  • Streaming tensors from CPU or NVMe as needed
  • Targeting throughput and model size rather than low latency

Performance

In the ZeRO Inference OPT 30B example on a single V100 32GB:

  • Full CPU offload reaches about 43 tokens per second
  • Full NVMe offload reaches about 30 tokens per second
  • Both are 1.3–2.4× faster than partial offload configurations, because full offload enables larger batch sizes

These numbers are small compared to GPU resident LLM runtimes on A100 or H100, but they apply to a model that does not fit natively in 32GB.

A recent I/O characterization of DeepSpeed and FlexGen confirms that offload based systems are dominated by small 128 KiB reads and that I/O behavior becomes the main bottleneck.

KV and memory behavior

  • Model weights and sometimes KV blocks are offloaded to CPU or SSD to fit models beyond GPU capacity.
  • TTFT and P99 are high compared to pure GPU engines, but the tradeoff is the ability to run very large models that otherwise would not fit.

Where it fits

  • Offline or batch inference, or low QPS services where model size matters more than latency and GPU count.

Comparison Tables

This table summarizes the main tradeoffs qualitatively:

Runtime Main design idea Relative strength KV strategy Typical use case
vLLM PagedAttention, continuous batching High tokens per second at given TTFT Paged KV blocks, FP8 KV support General purpose GPU serving, multi hardware
TensorRT LLM Compiled kernels on NVIDIA + KV reuse Very low latency and high throughput on NVIDIA Paged, quantized KV, reuse and offload NVIDIA only, latency sensitive
TGI v3 HF serving layer with long prompt path Strong long prompt performance, integrated stack Paged KV, chunked prefill, prefix caching HF centric APIs, long chat histories
LMDeploy TurboMind kernels, blocked KV, quant Up to 1.8× vLLM throughput in vendor tests Blocked KV cache, weight and KV quant NVIDIA deployments focused on raw throughput
SGLang RadixAttention and structured programs Up to 6.4× throughput and 3.7× lower latency on structured workloads Radix tree KV reuse over prefixes Agents, RAG, high prefix reuse
DeepSpeed GPU CPU NVMe offload for huge models Enables large models on small GPU; throughput oriented Offloaded weights and sometimes KV Very large models, offline or low QPS

Choosing a runtime in practice

For a production system, the choice tends to collapse to a few simple patterns:

  • You want a strong default engine with minimal custom work: You can start with vLLM. It gives you good throughput, reasonable TTFT, and solid KV handling on common hardware.
  • You are committed to NVIDIA and want fine grained control over latency and KV: You can use TensorRT LLM, likely behind Triton or TGI. Plan for model specific engine builds and tuning.
  • Your stack is already on Hugging Face and you care about long chats: You can use TGI v3. Its long prompt pipeline and prefix caching are very effective for conversation style traffic.
  • You want maximum throughput per GPU with quantized models: You can use LMDeploy with TurboMind and blocked KV, especially for 4 bit Llama family models.
  • You are building agents, tool chains or heavy RAG systems: You can use SGLang and design prompts so that KV reuse via RadixAttention is high.
  • You must run very large models on limited GPUs: You can use DeepSpeed Inference / ZeRO Inference, accept higher latency, and treat the GPU as a throughput engine with SSD in the loop.

Overall, all these engines are converging on the same idea: KV cache is the real bottleneck resource. The winners are the runtimes that treat KV as a first class data structure to be paged, quantized, reused and offloaded, not just a big tensor slapped into GPU memory.


Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.

🙌 Follow MARKTECHPOST: Add us as a preferred source on Google.



Source_link

READ ALSO

Talk to Your TV — Bitmovin’s Agentic AI Hub Quietly Redefines How We Watch

How to Build a Fully Functional Custom GPT-style Conversational AI Locally Using Hugging Face Transformers

Related Posts

Talk to Your TV — Bitmovin’s Agentic AI Hub Quietly Redefines How We Watch
Al, Analytics and Automation

Talk to Your TV — Bitmovin’s Agentic AI Hub Quietly Redefines How We Watch

November 13, 2025
How to Build a Fully Functional Custom GPT-style Conversational AI Locally Using Hugging Face Transformers
Al, Analytics and Automation

How to Build a Fully Functional Custom GPT-style Conversational AI Locally Using Hugging Face Transformers

November 13, 2025
Datasets for Training a Language Model
Al, Analytics and Automation

Datasets for Training a Language Model

November 13, 2025
PR Newswire via Morningstar PR Newswire Introduces AI-Led Platform Redefining the Future of Public Relations
Al, Analytics and Automation

PR Newswire via Morningstar PR Newswire Introduces AI-Led Platform Redefining the Future of Public Relations

November 12, 2025
How to Build an End-to-End Interactive Analytics Dashboard Using PyGWalker Features for Insightful Data Exploration
Al, Analytics and Automation

How to Build an End-to-End Interactive Analytics Dashboard Using PyGWalker Features for Insightful Data Exploration

November 12, 2025
The AI Image Model That Could Redefine Visual Creativity
Al, Analytics and Automation

The AI Image Model That Could Redefine Visual Creativity

November 12, 2025
Next Post
What to Do in Dumbo If You’re Here for Business (2025)

What to Do in Dumbo If You’re Here for Business (2025)

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

POPULAR NEWS

Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
7 Best EOR Platforms for Software Companies in 2025

7 Best EOR Platforms for Software Companies in 2025

June 21, 2025

EDITOR'S PICK

Improving Healthcare Operations with Ambient Listening Technology

Improving Healthcare Operations with Ambient Listening Technology

November 1, 2025
Experiential Trend of the Week: Relaxed Connection

Experiential Trend of the Week: Relaxed Connection

July 16, 2025
The Complete AI Visibility Guide for SEOs, Marketers, and Site Owners

The Complete AI Visibility Guide for SEOs, Marketers, and Site Owners

September 16, 2025
Google promotes Gmail security after scam claims spread online

Google promotes Gmail security after scam claims spread online

September 2, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • Gamification In Financial Literacy: Trends And Examples
  • After‑School Care That Boosts Academic Success
  • Weibo's new open source AI model VibeThinker-1.5B outperforms DeepSeek-R1 on $7,800 post-training budget
  • Talk to Your TV — Bitmovin’s Agentic AI Hub Quietly Redefines How We Watch
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?