Blazing fast on-device GenAI with LiteRT-LM

When it comes to bringing advanced AI to the edge, Google AI Edge’s LiteRT-LM delivers one of the most powerful and optimized experiences for deploying Gemma 4 across platforms. Leveraging LiteRT (formerly TensorFlow Lite) for inference, LiteRT-LM empowers local AI across a multitude of Google products—including Chrome, ChromeOS, the Pixel Watch, and the recent viral Google AI Edge Gallery app (Android / iOS). From unlocking state-of-the-art agentic capabilities with Gemma 4 to scaling our demanding production use cases, this proven engine is now ready to power your own applications. Read on for a deep dive into the underlying stack and how you can use LiteRT-LM for your own edge LLM deployments.

State-of-the-art performance

To fully unlock Gemma 4 on-device, we leverage the Google AI Edge stack, the most performant way to run Gemma 4 across platforms (for even greater performance, Gemma 4 can be run as system-service via Android AICore). To navigate the competing demands of restricted memory, limited compute, and fragmented hardware, this stack supports advanced quantization schemes alongside a foundation of accelerated XNNPACK and MLDrift kernels. By coupling this efficient footprint with the LiteRT runtime, the stack unlocks seamless model execution and broad portability across CPU, GPU, and NPU backends. Finally, at the orchestration layer, LiteRT-LM utilizes optimized pipelines to avoid costly CPU/GPU data transfers, alongside Multi-Token Prediction (MTP) and advanced session management. Together, this complete integration provides the highest-performing runtime environment for Gemma models.

litert_lm_perf_comp_v2

LiteRT-LM prefill and decode performance running Gemma 4 E2B
(Android: Samsung S26 Ultra, iOS: iPhone 17 Pro, Web: Chrome on a MacBook Pro 2024 with Apple M4 Max).

Built for speed across hardware backends and platforms

LiteRT-LM is engineered to deliver exceptional performance across the entire edge ecosystem, ensuring low-latency inference on Android, iOS, and the open web. To achieve this, the runtime provides the most optimal hardware backend optimizations through LiteRT, seamlessly accelerating workloads via CPU, GPU, and NPU (currently on Android). This approach allows developers to build once and achieve peak performance everywhere:

When running Gemma 4 E2B without MTP enabled, LiteRT-LM achieves an impressive 52 tokens/sec decode speed via the GPU backend on Android (OpenCL), and 56 tokens/sec on iOS (Metal).
On the web, using WebGPU, developers can expect decode speeds of up to 76 tokens/sec decode on a Macbook Pro, proving that state-of-the-art on-device AI is now a reality regardless of the user’s platform or hardware.

Multi-Token Prediction (MTP) for peak throughput

One of the most significant performance milestones in the LiteRT-LM pipeline is our native support for the Multi-Token Prediction (MTP) drafters recently launched with the Gemma 4 model family. By integrating this specialized speculative decoding architecture, LiteRT-LM bypasses traditional latency bottlenecks to deliver up to a 2.2x speedup.

Standard LLM inference is fundamentally memory-bandwidth bound; processors spend the majority of their time moving billions of parameters from VRAM to compute units just to generate a single token. While speculative decoding mitigates this, naive implementations can introduce new bottlenecks. LiteRT-LM prevents this by optimizing the data interplay between the primary Gemma 4 model and the MTP drafter.

To achieve this, LiteRT-LM enforces memory locality by executing both the lightweight MTP drafter and the primary model on the same hardware IP (e.g., the GPU). Managing the shared KV cache and activations within local memory entirely eliminates the latency penalties of cross-IP synchronization and data transfers. Once the drafter predicts future tokens, the primary model evaluates them using optimized kernels that maximize parallelization during verification. This streamlined architecture accelerates multi-token throughput without losing reasoning quality.

Sorry, your browser doesn’t support playback for this video

new_MTP_speedup

Enabling MTP in the LiteRT-LM pipeline requires only two lines of configuration, instantly unlocking up to 2.2x decoding speedup for low-latency applications. Numbers reported are collected on Samsung S26 Ultra using the GPU backend.

Session management for speed and continuity

Advanced session management in LiteRT-LM fundamentally transforms how mobile applications handle long-context interactions. By supporting native session save and restore capabilities, the engine allows large KV cache states—representing longer context histories—to be serialized and safely preserved across sessions. This architecture guarantees seamless user continuity, allowing conversations or workflows to be resumed seamlessly. Beyond user-experience benefits, this mechanism provides better backend efficiency: preserving context states reduces the need for redundant computations and bypasses heavy prefill phases on returning sessions. This efficiency powers dynamic features like the extended Agent Skills in the Google AI Edge Gallery app, driving down overall compute costs while delivering an incredibly fast, end-to-end on-device experience.

Efficient memory utilization

To ensure seamless on-device deployment of Gemma 4’s native vision and audio capabilities, LiteRT-LM employs advanced memory footprint optimizations that maximize efficiency within strict hardware constraints. The runtime strategically reduces overhead by keeping per-layer embeddings (PLEs) out of memory and by dynamically loading image and audio encoders only when a specific task requires them, ensuring that text-only workloads remain exceptionally lightweight. LiteRT-LM also highly optimizes overall memory consumption for CPU execution, allowing developers to achieve robust performance while maintaining a minimal device footprint—be sure to check out the official model cards (E2B, E4B) for specific memory benchmarks.

The result of these combined techniques is a lean runtime footprint — for instance, LiteRT-LM successfully runs the ~2.58GB Gemma 4 E2B model with a physical memory footprint of just 607MB on Apple mobile CPUs utilizing XNNPACK’s weight caching mechanism. This reduction in active memory overhead ensures robust, enterprise-grade AI performance without compromising your app’s overall stability.

Orchestrating agentic workflows: thinking, formatting, and acting

To ensure the model executes highly complex, multi-step tasks before triggering any external actions, LiteRT-LM natively supports Thinking Mode (available in the Gemma 4 model family). By dedicating a scratchpad for step-by-step reasoning before the model commits to an action, LiteRT-LM can significantly improve the output quality. Developers can choose to stream this raw reasoning process directly to the UI or strip it to save critical KV cache space in multi-turn mobile sessions.

Sorry, your browser doesn’t support playback for this video

Once the model has finished its internal reasoning, keeping its output structured is critical. Coupled with robust constrained decoding (CD), developers can enforce strict JSON schemas or specific output grammar on the final generated tool payload, completely eliminating parser breaking.

quality

Quality improvement from thinking + constrained decoding support, on Samsung S25 Ultra CPU.

With deep thinking and strict boundaries established, the model is ready to act. Moving beyond raw generation, LiteRT-LM supports the native function-calling capabilities introduced in FunctionGemma and perfected in Gemma 4. The runtime seamlessly pauses execution, returns structured tool-call requests to your application layer, and resumes upon receiving the tool’s output.

Expanding the integration surface

LiteRT-LM was built from the ground up to be cross-platform, and we are now expanding beyond Android support (Kotlin/C++) with new interfaces for Apple ecosystems (Swift API) and the open web (JavaScript API) .

Native Development with Swift

Expanding its state-of-the-art performance for Gemma models, LiteRT-LM now unlocks native Apple development with a fully open-source iOS Swift API.

new_ios_2

Performance comparison of LiteRT-LM for iOS Swift Vs. MLX, tested on iPhone 17 Pro.

High-Performance browser inference with WebGPU

We’re also bringing the power of LiteRT-LM to the browser. These production-proven inference pipelines are now fully accessible on the web (WASM) through our JavaScript API. Powered by WebGPU, LiteRT-LM delivers lightning-fast LLM routing and execution client-side, unlocking web applications that are serverless, secure, and completely privacy-preserving. Building upon the foundational success of the MediaPipe LLM Inference engine’s web solution, this native web support in LiteRT-LM represents the next evolution in our on-device AI stack.

Sorry, your browser doesn’t support playback for this video

LiteRT-LM web demo running on an Apple MacBook Pro M3 36GB with 18 GPU cores.

Our web solution offers significant performance gains over other web-based LLM frameworks.

new_web_2

Performance comparison of LiteRT-LM.js Vs. ONNX Runtime Web, tested in Chrome on an MacBook Pro 2024 (Apple M4 Max) 48GB with 40 GPU cores.

Looking ahead

We are just scratching the surface of what is possible when you bring powerful LLM inference and true agentic skills to edge devices. LiteRT-LM eliminates the friction of managing memory, hardware acceleration, and cross-platform idiosyncrasies, letting you build the next generation of privacy-first, zero-latency applications.

We want you to try it. Download the LiteRT-LM CLI for desktop or AI Edge Gallery for mobile, or check out the code and APIs today, and we’re excited to see what you build.

Acknowledgements

^{We’d like to extend a special thanks to our} ^{key contributors}^{for their foundational work on this project:}^{Advait Jain, Alice Zheng, Cormac Brick, Byungchul Kim, Fengwu Yao, Jae Yoo, Jenn Lee, Lu Wang, Marissa Ikonomidis, Matthew Chan, Matthew Soulanille, Matthias Grundmann, Mohammadreza Heydary, Ram Iyengar, Sachin Kotwani, Salil Tambe, Suleman Shahid, Tenghui Zhu, Tyler Mullen, Vinod Mamillapalli, Wai Hon Law, Weiyi Wang, Yi-Chun Kuo, Yu-hui Chen.}

Explore this announcement and all Google I/O 2026 updates on io.google.

Source_link