• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Thursday, May 21, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Google Marketing

Blazing fast on-device GenAI with LiteRT-LM

Josh by Josh
May 21, 2026
in Google Marketing
0
Blazing fast on-device GenAI with LiteRT-LM


may2026_liteRT-LM_v2_2x

When it comes to bringing advanced AI to the edge, Google AI Edge’s LiteRT-LM delivers one of the most powerful and optimized experiences for deploying Gemma 4 across platforms. Leveraging LiteRT (formerly TensorFlow Lite) for inference, LiteRT-LM empowers local AI across a multitude of Google products—including Chrome, ChromeOS, the Pixel Watch, and the recent viral Google AI Edge Gallery app (Android / iOS). From unlocking state-of-the-art agentic capabilities with Gemma 4 to scaling our demanding production use cases, this proven engine is now ready to power your own applications. Read on for a deep dive into the underlying stack and how you can use LiteRT-LM for your own edge LLM deployments.

State-of-the-art performance

To fully unlock Gemma 4 on-device, we leverage the Google AI Edge stack, the most performant way to run Gemma 4 across platforms (for even greater performance, Gemma 4 can be run as system-service via Android AICore). To navigate the competing demands of restricted memory, limited compute, and fragmented hardware, this stack supports advanced quantization schemes alongside a foundation of accelerated XNNPACK and MLDrift kernels. By coupling this efficient footprint with the LiteRT runtime, the stack unlocks seamless model execution and broad portability across CPU, GPU, and NPU backends. Finally, at the orchestration layer, LiteRT-LM utilizes optimized pipelines to avoid costly CPU/GPU data transfers, alongside Multi-Token Prediction (MTP) and advanced session management. Together, this complete integration provides the highest-performing runtime environment for Gemma models.

litert_lm_perf_comp_v2

LiteRT-LM prefill and decode performance running Gemma 4 E2B
(Android: Samsung S26 Ultra, iOS: iPhone 17 Pro, Web: Chrome on a MacBook Pro 2024 with Apple M4 Max).

Built for speed across hardware backends and platforms

LiteRT-LM is engineered to deliver exceptional performance across the entire edge ecosystem, ensuring low-latency inference on Android, iOS, and the open web. To achieve this, the runtime provides the most optimal hardware backend optimizations through LiteRT, seamlessly accelerating workloads via CPU, GPU, and NPU (currently on Android). This approach allows developers to build once and achieve peak performance everywhere:

  • When running Gemma 4 E2B without MTP enabled, LiteRT-LM achieves an impressive 52 tokens/sec decode speed via the GPU backend on Android (OpenCL), and 56 tokens/sec on iOS (Metal).
  • On the web, using WebGPU, developers can expect decode speeds of up to 76 tokens/sec decode on a Macbook Pro, proving that state-of-the-art on-device AI is now a reality regardless of the user’s platform or hardware.

Multi-Token Prediction (MTP) for peak throughput

One of the most significant performance milestones in the LiteRT-LM pipeline is our native support for the Multi-Token Prediction (MTP) drafters recently launched with the Gemma 4 model family. By integrating this specialized speculative decoding architecture, LiteRT-LM bypasses traditional latency bottlenecks to deliver up to a 2.2x speedup.

Standard LLM inference is fundamentally memory-bandwidth bound; processors spend the majority of their time moving billions of parameters from VRAM to compute units just to generate a single token. While speculative decoding mitigates this, naive implementations can introduce new bottlenecks. LiteRT-LM prevents this by optimizing the data interplay between the primary Gemma 4 model and the MTP drafter.

To achieve this, LiteRT-LM enforces memory locality by executing both the lightweight MTP drafter and the primary model on the same hardware IP (e.g., the GPU). Managing the shared KV cache and activations within local memory entirely eliminates the latency penalties of cross-IP synchronization and data transfers. Once the drafter predicts future tokens, the primary model evaluates them using optimized kernels that maximize parallelization during verification. This streamlined architecture accelerates multi-token throughput without losing reasoning quality.

Sorry, your browser doesn’t support playback for this video

new_MTP_speedup

Enabling MTP in the LiteRT-LM pipeline requires only two lines of configuration, instantly unlocking up to 2.2x decoding speedup for low-latency applications. Numbers reported are collected on Samsung S26 Ultra using the GPU backend.

Session management for speed and continuity

Advanced session management in LiteRT-LM fundamentally transforms how mobile applications handle long-context interactions. By supporting native session save and restore capabilities, the engine allows large KV cache states—representing longer context histories—to be serialized and safely preserved across sessions. This architecture guarantees seamless user continuity, allowing conversations or workflows to be resumed seamlessly. Beyond user-experience benefits, this mechanism provides better backend efficiency: preserving context states reduces the need for redundant computations and bypasses heavy prefill phases on returning sessions. This efficiency powers dynamic features like the extended Agent Skills in the Google AI Edge Gallery app, driving down overall compute costs while delivering an incredibly fast, end-to-end on-device experience.

Efficient memory utilization

To ensure seamless on-device deployment of Gemma 4’s native vision and audio capabilities, LiteRT-LM employs advanced memory footprint optimizations that maximize efficiency within strict hardware constraints. The runtime strategically reduces overhead by keeping per-layer embeddings (PLEs) out of memory and by dynamically loading image and audio encoders only when a specific task requires them, ensuring that text-only workloads remain exceptionally lightweight. LiteRT-LM also highly optimizes overall memory consumption for CPU execution, allowing developers to achieve robust performance while maintaining a minimal device footprint—be sure to check out the official model cards (E2B, E4B) for specific memory benchmarks.

The result of these combined techniques is a lean runtime footprint — for instance, LiteRT-LM successfully runs the ~2.58GB Gemma 4 E2B model with a physical memory footprint of just 607MB on Apple mobile CPUs utilizing XNNPACK’s weight caching mechanism. This reduction in active memory overhead ensures robust, enterprise-grade AI performance without compromising your app’s overall stability.

Orchestrating agentic workflows: thinking, formatting, and acting

To ensure the model executes highly complex, multi-step tasks before triggering any external actions, LiteRT-LM natively supports Thinking Mode (available in the Gemma 4 model family). By dedicating a scratchpad for step-by-step reasoning before the model commits to an action, LiteRT-LM can significantly improve the output quality. Developers can choose to stream this raw reasoning process directly to the UI or strip it to save critical KV cache space in multi-turn mobile sessions.

Sorry, your browser doesn’t support playback for this video

Once the model has finished its internal reasoning, keeping its output structured is critical. Coupled with robust constrained decoding (CD), developers can enforce strict JSON schemas or specific output grammar on the final generated tool payload, completely eliminating parser breaking.

quality

Quality improvement from thinking + constrained decoding support, on Samsung S25 Ultra CPU.

With deep thinking and strict boundaries established, the model is ready to act. Moving beyond raw generation, LiteRT-LM supports the native function-calling capabilities introduced in FunctionGemma and perfected in Gemma 4. The runtime seamlessly pauses execution, returns structured tool-call requests to your application layer, and resumes upon receiving the tool’s output.

Expanding the integration surface

LiteRT-LM was built from the ground up to be cross-platform, and we are now expanding beyond Android support (Kotlin/C++) with new interfaces for Apple ecosystems (Swift API) and the open web (JavaScript API) .

Native Development with Swift

Expanding its state-of-the-art performance for Gemma models, LiteRT-LM now unlocks native Apple development with a fully open-source iOS Swift API.

new_ios_2

Performance comparison of LiteRT-LM for iOS Swift Vs. MLX, tested on iPhone 17 Pro.

High-Performance browser inference with WebGPU

We’re also bringing the power of LiteRT-LM to the browser. These production-proven inference pipelines are now fully accessible on the web (WASM) through our JavaScript API. Powered by WebGPU, LiteRT-LM delivers lightning-fast LLM routing and execution client-side, unlocking web applications that are serverless, secure, and completely privacy-preserving. Building upon the foundational success of the MediaPipe LLM Inference engine’s web solution, this native web support in LiteRT-LM represents the next evolution in our on-device AI stack.

Sorry, your browser doesn’t support playback for this video

LiteRT-LM web demo running on an Apple MacBook Pro M3 36GB with 18 GPU cores.

Our web solution offers significant performance gains over other web-based LLM frameworks.

new_web_2

Performance comparison of LiteRT-LM.js Vs. ONNX Runtime Web, tested in Chrome on an MacBook Pro 2024 (Apple M4 Max) 48GB with 40 GPU cores.

Looking ahead

We are just scratching the surface of what is possible when you bring powerful LLM inference and true agentic skills to edge devices. LiteRT-LM eliminates the friction of managing memory, hardware acceleration, and cross-platform idiosyncrasies, letting you build the next generation of privacy-first, zero-latency applications.

We want you to try it. Download the LiteRT-LM CLI for desktop or AI Edge Gallery for mobile, or check out the code and APIs today, and we’re excited to see what you build.

Acknowledgements

We’d like to extend a special thanks to our key contributors for their foundational work on this project: Advait Jain, Alice Zheng, Cormac Brick, Byungchul Kim, Fengwu Yao, Jae Yoo, Jenn Lee, Lu Wang, Marissa Ikonomidis, Matthew Chan, Matthew Soulanille, Matthias Grundmann, Mohammadreza Heydary, Ram Iyengar, Sachin Kotwani, Salil Tambe, Suleman Shahid, Tenghui Zhu, Tyler Mullen, Vinod Mamillapalli, Wai Hon Law, Weiyi Wang, Yi-Chun Kuo, Yu-hui Chen.

Explore this announcement and all Google I/O 2026 updates on io.google.



Source_link

READ ALSO

Google announces new community investments in Missouri

Google demoed exactly why Android XR speakers respond quietly

Related Posts

Google announces new community investments in Missouri
Google Marketing

Google announces new community investments in Missouri

May 21, 2026
Google demoed exactly why Android XR speakers respond quietly
Google Marketing

Google demoed exactly why Android XR speakers respond quietly

May 21, 2026
‘Solve all diseases,’ you say?
Google Marketing

‘Solve all diseases,’ you say?

May 20, 2026
One Year of Innovation: Celebrating 100k Members in the Google Cloud x NVIDIA Developer Community
Google Marketing

One Year of Innovation: Celebrating 100k Members in the Google Cloud x NVIDIA Developer Community

May 20, 2026
Google extends partnership with Singapore government on AI
Google Marketing

Google extends partnership with Singapore government on AI

May 20, 2026
Google announces Wear OS 7 with Live Updates, widgets, more
Google Marketing

Google announces Wear OS 7 with Live Updates, widgets, more

May 20, 2026
Next Post
How ERP Integrates with CRM and E-commerce Platforms in 2026

How ERP Integrates with CRM and E-commerce Platforms in 2026

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

November 4, 2025

EDITOR'S PICK

Yoshua Bengio is redesigning AI safety at LawZero

Yoshua Bengio is redesigning AI safety at LawZero

June 21, 2025
41 Tips to Optimize Your Website

41 Tips to Optimize Your Website

November 7, 2025
Top Dental AI Annotation Companies 2026

Top Dental AI Annotation Companies 2026

November 21, 2025
10 Steps for Safe AI Deployment

10 Steps for Safe AI Deployment

September 3, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • I Reviewed the 6 Best Personalization Software for 2026
  • How ERP Integrates with CRM and E-commerce Platforms in 2026
  • Blazing fast on-device GenAI with LiteRT-LM
  • Google unveils Gemini 3.5 Flash and a redesigned ‘intelligent Search box’
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions