• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Tuesday, December 2, 2025
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

vLLM vs TensorRT-LLM vs HF TGI vs LMDeploy, A Deep Technical Comparison for Production LLM Inference

Josh by Josh
November 20, 2025
in Al, Analytics and Automation
0
0
SHARES
1
VIEWS
Share on FacebookShare on Twitter


Production LLM serving is now a systems problem, not a generate() loop. For real workloads, the choice of inference stack drives your tokens per second, tail latency, and ultimately cost per million tokens on a given GPU fleet.

This comparison focuses on 4 widely used stacks:

  • vLLM
  • NVIDIA TensorRT-LLM
  • Hugging Face Text Generation Inference (TGI v3)
  • LMDeploy

1. vLLM, PagedAttention as the open baseline

Core idea

vLLM is built around PagedAttention, an attention implementation that treats the KV cache like paged virtual memory rather than a single contiguous buffer per sequence.

Instead of allocating one big KV region per request, vLLM:

  • Divides KV cache into fixed size blocks
  • Maintains a block table that maps logical tokens to physical blocks
  • Shares blocks between sequences wherever prefixes overlap

This reduces external fragmentation and lets the scheduler pack many more concurrent sequences into the same VRAM.

Throughput and latency

vLLM improves throughput by 2–4× over systems like FasterTransformer and Orca at similar latency, with larger gains for longer sequences.

Key properties for operators:

  • Continuous batching (also called inflight batching) merges incoming requests into existing GPU batches instead of waiting for fixed batch windows.
  • On typical chat workloads, throughput scales close to linearly with concurrency until KV memory or compute saturates.
  • P50 latency remains low for moderate concurrency, but P99 can degrade once queues are long or KV memory is tight, especially for prefill heavy queries.

vLLM exposes an OpenAI compatible HTTP API and integrates well with Ray Serve and other orchestrators, which is why it is widely used as an open baseline.

KV and multi tenant

  • PagedAttention gives near zero KV waste and flexible prefix sharing within and across requests.
  • Each vLLM process serves one model, multi tenant and multi model setups are usually built with an external router or API gateway that fans out to multiple vLLM instances.

2. TensorRT-LLM, hardware maximum on NVIDIA GPUs

Core idea

TensorRT-LLM is NVIDIA’s optimized inference library for their GPUs. The library provides custom attention kernels, inflight batching, paged KV caching, quantization down to FP4 and INT4, and speculative decoding.

It is tightly coupled to NVIDIA hardware, including FP8 tensor cores on Hopper and Blackwell.

Measured performance

NVIDIA’s H100 vs A100 evaluation is the most concrete public reference:

  • On H100 with FP8, TensorRT-LLM reaches over 10,000 output tokens/s at peak throughput for 64 concurrent requests, with ~100 ms time to first token.
  • H100 FP8 achieves up to 4.6× higher max throughput and 4.4× faster first token latency than A100 on the same models.

For latency sensitive modes:

  • TensorRT-LLM on H100 can drive TTFT below 10 ms in batch 1 configurations, at the cost of lower overall throughput.

These numbers are model and shape specific, but they give a realistic scale.

Prefill vs decode

TensorRT-LLM optimizes both phases:

  • Prefill benefits from high throughput FP8 attention kernels and tensor parallelism
  • Decode benefits from CUDA graphs, speculative decoding, quantized weights and KV, and kernel fusion

The result is very high tokens/s across a wide range of input and output lengths, especially when the engine is tuned for that model and batch profile.

KV and multi tenant

TensorRT-LLM provides:

  • Paged KV cache with configurable layout
  • Support for long sequences, KV reuse and offloading
  • Inflight batching and priority aware scheduling primitives

NVIDIA pairs this with Ray based or Triton based orchestration patterns for multi tenant clusters. Multi model support is done at the orchestrator level, not inside a single TensorRT-LLM engine instance.

3. Hugging Face TGI v3, long prompt specialist and multi backend gateway

Core idea

Text Generation Inference (TGI) is a Rust and Python based serving stack that adds:

  • HTTP and gRPC APIs
  • Continuous batching scheduler
  • Observability and autoscaling hooks
  • Pluggable backends, including vLLM style engines, TensorRT-LLM, and other runtimes

Version 3 focuses on long prompt processing through chunking and prefix caching.

Long prompt benchmark vs vLLM

The TGI v3 docs give a clear benchmark:

  • On long prompts with more than 200,000 tokens, a conversation reply that takes 27.5 s in vLLM can be served in about 2 s in TGI v3.
  • This is reported as a 13× speedup on that workload.
  • TGI v3 is able to process about 3× more tokens in the same GPU memory by reducing its memory footprint and exploiting chunking and caching.

The mechanism is:

  • TGI keeps the original conversation context in a prefix cache, so subsequent turns only pay for incremental tokens
  • Cache lookup overhead is on the order of microseconds, negligible relative to prefill compute

This is a targeted optimization for workloads where prompts are extremely long and reused across turns, for example RAG pipelines and analytic summarization.

Architecture and latency behavior

Key components:

  • Chunking, very long prompts are split into manageable segments for KV and scheduling
  • Prefix caching, data structure to share long context across turns
  • Continuous batching, incoming requests join batches of already running sequences
  • PagedAttention and fused kernels in the GPU backends

For short chat style workloads, throughput and latency are in the same ballpark as vLLM. For long, cacheable contexts, both P50 and P99 latency improve by an order of magnitude because the engine avoids repeated prefill.

Multi backend and multi model

TGI is designed as a router plus model server architecture. It can:

  • Route requests across many models and replicas
  • Target different backends, for example TensorRT-LLM on H100 plus CPU or smaller GPUs for low priority traffic

This makes it suitable as a central serving tier in multi tenant environments.

4. LMDeploy, TurboMind with blocked KV and aggressive quantization

Core idea

LMDeploy from the InternLM ecosystem is a toolkit for compressing and serving LLMs, centered around the TurboMind engine. It focuses on:

  • High throughput request serving
  • Blocked KV cache
  • Persistent batching (continuous batching)
  • Quantization of weights and KV cache

Relative throughput vs vLLM

The project states:

  • ‘LMDeploy delivers up to 1.8× higher request throughput than vLLM‘, with the support from persistent batch, blocked KV, dynamic split and fuse, tensor parallelism and optimized CUDA kernels.

KV, quantization and latency

LMDeploy includes:

  • Blocked KV cache, similar to paged KV, that helps pack many sequences into VRAM
  • Support for KV cache quantization, typically int8 or int4, to cut KV memory and bandwidth
  • Weight only quantization paths such as 4 bit AWQ
  • A benchmarking harness that reports token throughput, request throughput, and first token latency

This makes LMDeploy attractive when you want to run larger open models like InternLM or Qwen on mid range GPUs with aggressive compression while still maintaining good tokens/s.

Multi model deployments

LMDeploy provides a proxy server able to handle:

  • Multi model deployments
  • Multi machine, multi GPU setups
  • Routing logic to select models based on request metadata

So architecturally it sits closer to TGI than to a single engine.

What to use when?

  • If you want maximum throughput and very low TTFT on NVIDIA GPUs
    • TensorRT-LLM is the primary choice
    • It uses FP8 and lower precision, custom kernels and speculative decoding to push tokens/s and keep TTFT under 100 ms at high concurrency and under 10 ms at low concurrency
  • If you are dominated by long prompts with reuse, such as RAG over large contexts
    • TGI v3 is a strong default
    • Its prefix cache and chunking give up to 3× token capacity and 13× lower latency than vLLM in published long prompt benchmarks, without extra configuration
  • If you want an open, simple engine with strong baseline performance and an OpenAI style API
    • vLLM remains the standard baseline
    • PagedAttention and continuous batching make it 2–4× faster than older stacks at similar latency, and it integrates cleanly with Ray and K8s
  • If you target open models such as InternLM or Qwen and value aggressive quantization with multi model serving
    • LMDeploy is a good fit
    • Blocked KV cache, persistent batching and int8 or int4 KV quantization give up to 1.8× higher request throughput than vLLM on supported models, with a router layer included

In practice, many dev teams mix these systems, for example using TensorRT-LLM for high volume proprietary chat, TGI v3 for long context analytics, vLLM or LMDeploy for experimental and open model workloads. The key is to align throughput, latency tails, and KV behavior with the actual token distributions in your traffic, then compute cost per million tokens from measured tokens/s on your own hardware.


References

  1. vLLM / PagedAttention
  2. TensorRT-LLM performance and overview
  3. HF Text Generation Inference (TGI v3) long-prompt behavior
  4. LMDeploy / TurboMind


Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.

🙌 Follow MARKTECHPOST: Add us as a preferred source on Google.



Source_link

READ ALSO

Study Shows ChatGPT and Gemini Still Trickable Despite Safety Training

MIT Sea Grant students explore the intersection of technology and offshore aquaculture in Norway | MIT News

Related Posts

Study Shows ChatGPT and Gemini Still Trickable Despite Safety Training
Al, Analytics and Automation

Study Shows ChatGPT and Gemini Still Trickable Despite Safety Training

December 2, 2025
MIT Sea Grant students explore the intersection of technology and offshore aquaculture in Norway | MIT News
Al, Analytics and Automation

MIT Sea Grant students explore the intersection of technology and offshore aquaculture in Norway | MIT News

December 2, 2025
MiniMax-M2: Technical Deep Dive into Interleaved Thinking for Agentic Coding Workflows
Al, Analytics and Automation

MiniMax-M2: Technical Deep Dive into Interleaved Thinking for Agentic Coding Workflows

December 2, 2025
Pretrain a BERT Model from Scratch
Al, Analytics and Automation

Pretrain a BERT Model from Scratch

December 1, 2025
How to Design an Advanced Multi-Page Interactive Analytics Dashboard with Dynamic Filtering, Live KPIs, and Rich Visual Exploration Using Panel
Al, Analytics and Automation

How to Design an Advanced Multi-Page Interactive Analytics Dashboard with Dynamic Filtering, Live KPIs, and Rich Visual Exploration Using Panel

December 1, 2025
The Journey of a Token: What Really Happens Inside a Transformer
Al, Analytics and Automation

The Journey of a Token: What Really Happens Inside a Transformer

December 1, 2025
Next Post
What It Is and How ABM Helps with Visibility

What It Is and How ABM Helps with Visibility

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

POPULAR NEWS

Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
7 Best EOR Platforms for Software Companies in 2025

7 Best EOR Platforms for Software Companies in 2025

June 21, 2025

EDITOR'S PICK

5 questions to move your communications from trend-chasing to culture-making

November 15, 2025
4 Pixel 10 personalization tips from Google

4 Pixel 10 personalization tips from Google

August 29, 2025
Discover 10 Ways Wearable Chatbots Are Changing Health, Productivity & Communication

Discover 10 Ways Wearable Chatbots Are Changing Health, Productivity & Communication

June 24, 2025
Google announces new $4 billion investment in Arkansas

Google announces new $4 billion investment in Arkansas

October 3, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • How Buffer Helped Me Triple My LinkedIn Impressions and Grow My Newsletter
  • Predictive Crisis Communications Using AI and Real-Time Data
  • How To Fix Roblox Thinks You’re On Mobile
  • Arcee aims to reboot U.S. open source AI with new Trinity models released under Apache 2.0
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?