• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Friday, January 23, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Google Marketing

Unlocking Peak Performance on Qualcomm NPU with LiteRT

Josh by Josh
November 25, 2025
in Google Marketing
0
Unlocking Peak Performance on Qualcomm NPU with LiteRT
0
SHARES
1
VIEWS
Share on FacebookShare on Twitter


Train a GPT2 model with JAX on TPU for free

Modern smartphones feature sophisticated SoCs (system on a chip), composed of CPU, GPU, and NPU, which can enable compelling, on-device GenAI experiences that are significantly more interactive and real-time than their server-only counterparts. The GPU is the most ubiquitous accelerator for AI tasks, with GPU compute being available on roughly 90% of all Android devices. However, solely relying on it can create performance bottlenecks, especially when building complex, interactive GenAI experiences. Consider the following setting: running a compute-intensive, text-to-image generation model on-device, while simultaneously processing the live camera feed with an ML-based segmentation. Even the most powerful mobile GPU will struggle under this combined load, resulting in jarring frame drops and a broken user experience.

Sorry, your browser doesn’t support playback for this video

Performance bottleneck with full GPU inference (left), and smooth user experiences with NPU/GPU parallel processing (right). Captured on Samsung Galaxy S25 Ultra powered by QC Snapdragon 8 Elite.

This is where the NPU (Neural Processing Unit) comes in. It’s a highly specialized processor that offers tens of TOPS (Tera Operations Per Second) of dedicated AI compute, far more than a modern, mobile GPU can sustain. Crucially, it is significantly more power-efficient per TOP than both CPUs and GPUs, which is essential for battery-operated devices like mobile phones. The NPU is no longer a niche feature; it’s a standard component, with over 80% of recent Qualcomm SoCs now including one. The NPU runs parallel to the GPU and CPU, enabling the heavy AI processing. This concurrency frees the GPU to focus on rendering and the CPU on main-thread logic. This modern architecture unlocks the smooth, responsive, and fast performance that modern AI applications demand.

GPU_NPU_pipeline

Introducing LiteRT Qualcomm AI Engine Direct Accelerator

To bring this NPU power to LiteRT, Google’s high-performance on-device ML framework, we are thrilled to announce a significant leap forward: the LiteRT Qualcomm AI Engine Direct (QNN) Accelerator, developed in close collaboration with Qualcomm, replacing the previous TFLite QNN delegate.

This update introduces two major advantages for developers:

  1. A unified and simplified mobile deployment workflow that frees Android app developers from the biggest complexities of NPU acceleration. You no longer need to:
  • Interact with low-level, vendor-specific SDKs: LiteRT integrates with SoC compilers and runtimes and exposes them through a unified, streamlined developer-facing API.
  • Target individual SoC versions: LiteRT abstracts away fragmentation across SoCs, providing a unified workflow to scale the deployment to multiple SoCs at the same time.

You can now deploy your model seamlessly across all supported devices, with either ahead-of-time (AOT) or on-device compilation. This makes integrating pre-trained .tflite models in production from sources like Qualcomm AI Hub easier than ever.

  1. State-of-the-Art on-device performance. The accelerator supports an extensive range of LiteRT ops, enabling maximum NPU usage and full model delegation, a critical factor for securing the best performance. Furthermore, it is packed with the specialized kernels and optimizations required for sophisticated LLMs and GenAI models, achieving SOTA performance for models like Gemma and FastVLM.

Superior performance, real-world results

We benchmarked the new LiteRT QNN accelerator across 72 canonical ML models, encompassing vision, audio, and NLP domains. The results show a massive jump in raw performance: the NPU acceleration provides up to a 100x speedup over CPU and a 10x speedup over GPU. Our new accelerator enables this by supporting 90 LiteRT ops, allowing 64 of the 72 models to delegate fully to the NPU.

This speed translates to real interactive performance. On Qualcomm’s latest flagship SoC, the Snapdragon 8 Elite Gen 5, the performance benefit is substantial: over 56 models run in under 5ms with the NPU, while only 13 models achieve that on the CPU. This unlocks a host of live AI experiences that were previously unreachable.

Here is a selection of 20 representative models from the benchmark:

LiteRT NPU and GPU latency relative to CPU (1)

Figure: LiteRT inference latency measured on Snapdragon 8 Elite Gen 5 powering the Xiaomi 17 Pro Max. The values are normalized to the CPU baseline (100%), demonstrating significant speedups, with GPU reducing latency to ~5–70% and NPU reducing latency to ~1–20%.

Unlocking the full power of NPU for LLM inference

The LiteRT QNN Accelerator exhibits cutting-edge performance with sophisticated LLMs. To demonstrate this, we benchmarked the FastVLM-0.5B research model, a state-of-the-art vision model for on-device AI, using LiteRT for both AOT compilation and on-device NPU inference.

FastVLM_table_new

The model is optimized with int8 weight quantization and int16 activation quantization. This is the key to unlocking the NPU’s most powerful, high-speed int16 kernels. We also went beyond simple delegation and added special NPU kernels for performance-critical transformer layers to the LiteRT QNN Accelerator, particularly for the Attention mechanism, ensuring these layers run efficiently.

This delivers a level of performance that creates an AI experience rarely seen on mobile devices. Running on the Snapdragon 8 Elite Gen 5 NPU, our FastVLM integration delivers time-to-first-token (TTFT) in just 0.12 second on high-resolution images (1024×1024). It achieves over 11,000 tokens/sec for prefill and over 100 tokens/sec for decode. This extreme throughput is what makes a smooth, real-time, interactive experience possible. To showcase this, we built a live scene understanding demo that processes and describes the world around you.

Sorry, your browser doesn’t support playback for this video

Scene understanding using FastVLM vision modality running on Snapdragon 8 Elite Gen 5 with Xiaomi 17 Pro Max.

Getting started in 3 steps

Here’s how simple it is to deploy a .tflite model on NPU across different Qualcomm SoC versions using the unified workflow with LiteRT. Pre-trained production-quality .tflite models can be downloaded from sources like Qualcomm AI Hub.

Step 1: (optional) AOT Compilation for the target SoCs with LiteRT

While pre-compiling your .tflite model offline (AOT) is optional, we highly recommend it for large models where on-device compilation can result in longer initialization times and higher peak memory consumption.

You can compile for all supported SoCs or target specific SoC versions using LiteRT on the host in a few lines of Python code:

from ai_edge_litert.aot import aot_compile as aot_lib
from ai_edge_litert.aot.vendors.qualcomm import target as qnn_target

# --- Compile to all available SoCs ---
compiled_models = aot_lib.aot_compile(tflite_model_path)

# --- Or, compile to specific Qualcomm SoC versions ---
# Example: Targeting Qualcomm Snapdragon 8 Elite Gen5 Mobile Platform (SM8850)
sm8850_target = qnn_target.Target(qnn_target.SocModel.SM8850)
compiled_models = aot_lib.aot_compile(
    tflite_model_path,
    target=[sm8850_target]
)

Python

After compilation, export your compiled models across target SoCs into a single Google Play AI Pack. You then upload this pack to Google Play, which uses Play for On-device AI (PODAI) to automatically deliver the correct compiled models to each users’ devices.

from ai_edge_litert.aot.ai_pack import export_lib as ai_pack_export

# --- Export the AI Pack ---
# This bundles model variants and metadata so Google Play can
# deliver the correct compiled model to the right device.
ai_pack_export.export(
    compiled_models,
    ai_pack_dir,
    ai_pack_name,
    litert_model_name
)

Python

See a full example in the LiteRT AOT compilation notebook.

Step 2: Deploy to the target SoCs with Google Play for On-device AI

Add your model to the Android app project. You have two distinct options depending on your chosen workflow:

  • For On-Device compilation: Copy the original .tflite model file directly into your app’s assets/ directory.
  • For AOT compilation: Copy the entire AI Pack from Step 1 into your project’s root directory. You must then add this AI Pack to your gradle configuration, as shown below:
// my_app/settings.gradle.kts
...
include(":ai_pack:my_model")

// my_app/app/build.gradle.kts
android {
  ...
  assetPacks.add(":ai_pack:my_model")
}

Kotlin

Next, run the script to fetch the QNN libraries. This downloads NPU runtime (for both AOT and on-device compilation) and the compiler library (essential for on-device compilation).

# Download and unpack NPU runtime libraries to the root directory. 
# For AOT compilation, download litert_npu_runtime_libraries.zip.
# For on-device compilation, download litert_npu_runtime_libraries_jit.zip. 
$ ./litert_npu_runtime_libraries/fetch_qualcomm_library.sh

Shell

Add NPU runtime libraries as feature modules to the gradle configuration:

// my_app/settings.gradle.kts
include(":litert_npu_runtime_libraries:runtime_strings")
include(":litert_npu_runtime_libraries:qualcomm_runtime_v79")
...

// my_app/app/build.gradle.kts
android {
  dynamicFeatures.add(":litert_npu_runtime_libraries:qualcomm_runtime_v79")
  ...
}

dependencies {
  // Strings for NPU runtime libraries
  implementation(project(":litert_npu_runtime_libraries:runtime_strings"))
  ...
}

Kotlin

For a complete guide on configuring your app for Play for On-device AI, please refer to this tutorial.

Step 3: Inference on NPU using LiteRT Runtime API

LiteRT abstracts away the complexity of developing against specific SoC versions, letting you run your model on the NPU with just a few lines of code. It also provides a robust, built-in fallback mechanism: you can specify CPU, GPU, or both as options, and LiteRT will automatically use them if the NPU is unavailable. Conveniently, AOT compilation also supports fallback. It provides partial delegation on NPU where unsupported subgraphs seamlessly run on CPU or GPU as specified.

// 1. Load model and initialize runtime. 
// If NPU is unavailable, inference will fallback to GPU.
val model =
    CompiledModel.create(
        context.assets,
        "model/mymodel.tflite",
        CompiledModel.Options(Accelerator.NPU, Accelerator.GPU)
    )

// 2. Pre-allocate input/output buffers
val inputBuffers = model.createInputBuffers()
val outputBuffers = model.createOutputBuffers()

// 3. Fill the first input
inputBuffers[0].writeFloat(...)

// 4. Invoke
model.run(inputBuffers, outputBuffers)

// 5. Read the output
val outputFloatArray = outputBuffers[0].readFloat()

Kotlin

Check out our image segmentation sample app of how to use all the features

What’s next

The new LiteRT Qualcomm AI Engine Direct (QNN) Accelerator is a major achievement for LiteRT, closing the gap between raw hardware potential and real-world application performance. We’re incredibly excited to see what you build with this power.

We encourage you to explore our LiteRT DevSite and our LiteRT Github repository. Happy building!

Acknowledgements

Special thanks to the Google ODML team and Qualcomm team for their significant contributions in this effort:

Google ODML team: Alice Zheng, Advait Jain, Andrew Zhang, Arian Arfaian, Chintan Parikh, Chunlei Niu, Cormac Brick, Gerardo Carranza, Gregory Karpiak, Jingjiang Li, Jing Jin, Julius Kammerl, Lu Wang, Luke Boyer, Marissa Ikonomidis, Maria Lyubimtseva, Matt Kreileder, Matthias Grundmann, Na Li, Ping Yu, Quentin Khan, Rishika Sinha, Sachin Kotwani, Steven Toribio, Teng-Hui Zhu, Terry (Woncheol) Heoi, Vitalii Dziuba, Weiyi Wang, Yu-Hui Chen, Zichuan We

Qualcomm LiteRT team: Alen Huang, Bastiaan Aarts, Brett Taylor, Chun-Hsueh Lee (Jack), Chun-Po Chang (Jerry), Chun-Ting, Lin (Graham), Felix Baum, Jiun-Kai Yang (Kelvin), Krishna Sridhar, Ming-Che Lin (Vincent), William Lin



Source_link

READ ALSO

Google brings Personal Intelligence to AI Mode in Search

Is this the end of Epic and Google’s feud?

Related Posts

Google brings Personal Intelligence to AI Mode in Search
Google Marketing

Google brings Personal Intelligence to AI Mode in Search

January 23, 2026
Is this the end of Epic and Google’s feud?
Google Marketing

Is this the end of Epic and Google’s feud?

January 22, 2026
Discover what’s new in Demand Gen with January’s Drop
Google Marketing

Discover what’s new in Demand Gen with January’s Drop

January 22, 2026
New ChromeOS tools to support classroom collaboration
Google Marketing

New ChromeOS tools to support classroom collaboration

January 22, 2026
This midrange Android phone also runs Windows and Linux
Google Marketing

This midrange Android phone also runs Windows and Linux

January 22, 2026
YouTube CEO Neal Mohan’s annual letter for 2026
Google Marketing

YouTube CEO Neal Mohan’s annual letter for 2026

January 22, 2026
Next Post
I Evaluated 5 Best Bot Platforms to Help You Choose Smarter

I Evaluated 5 Best Bot Platforms to Help You Choose Smarter

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Google announced the next step in its nuclear energy plans 

Google announced the next step in its nuclear energy plans 

August 20, 2025

EDITOR'S PICK

VARA-Compliant Blockchain Solutions in Dubai

VARA-Compliant Blockchain Solutions in Dubai

October 9, 2025
Comment analyser les étapes de son parcours client ?

Comment analyser les étapes de son parcours client ?

June 19, 2025
Rétrospective 2024 – Webmecanik

Rétrospective 2024 – Webmecanik

June 13, 2025
BentoML Released llm-optimizer: An Open-Source AI Tool for Benchmarking and Optimizing LLM Inference

BentoML Released llm-optimizer: An Open-Source AI Tool for Benchmarking and Optimizing LLM Inference

September 12, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • How I Got AI to Quote Us with 4 Simple Strategies
  • List of Spin a Baddie Codes
  • Sennheiser introduces new TV headphones bundle with Auracast
  • Microsoft Releases VibeVoice-ASR: A Unified Speech-to-Text Model Designed to Handle 60-Minute Long-Form Audio in a Single Pass
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?