• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Wednesday, March 18, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

Google LiteRT NeuroPilot Stack Turns MediaTek Dimensity NPUs into First Class Targets for on Device LLMs

Josh by Josh
December 11, 2025
in Al, Analytics and Automation
0
Google LiteRT NeuroPilot Stack Turns MediaTek Dimensity NPUs into First Class Targets for on Device LLMs


The new LiteRT NeuroPilot Accelerator from Google and MediaTek is a concrete step toward running real generative models on phones, laptops, and IoT hardware without shipping every request to a data center. It takes the existing LiteRT runtime and wires it directly into MediaTek’s NeuroPilot NPU stack, so developers can deploy LLMs and embedding models with a single API surface instead of per chip custom code.

What is LiteRT NeuroPilot Accelerator?

LiteRT is the successor of TensorFlow Lite. It is a high performance runtime that sits on device, runs models in .tflite FlatBuffer format, and can target CPU, GPU and now NPU backends through a unified hardware acceleration layer.

LiteRT NeuroPilot Accelerator is the new NPU path for MediaTek hardware. It replaces the older TFLite NeuroPilot delegate with a direct integration to the NeuroPilot compiler and runtime. Instead of treating the NPU as a thin delegate, LiteRT now uses a Compiled Model API that understands Ahead of Time (AOT) compilation and on device compilation, and exposes both through the same C++ and Kotlin APIs.

On the hardware side, the integration currently targets MediaTek Dimensity 7300, 8300, 9000, 9200, 9300 and 9400 SoCs, which together cover a large part of the Android mid range and flagship device space.

Why Developers Care, Unified Workflow For Fragmented NPUs??

Historically, on device ML stacks were CPU and GPU first. NPU SDKs shipped as vendor specific toolchains that required separate compilation flows per SoC, custom delegates, and manual runtime packaging. The result was a combinatorial explosion of binaries and a lot of device specific debugging.

LiteRT NeuroPilot Accelerator replaces that with a three step workflow that is the same regardless of which MediaTek NPU is present:

  • Convert or load a .tflite model as usual.
  • Optionally use the LiteRT Python tools to run AOT compilation and produce an AI Pack that is tied to one or more target SoCs.
  • Ship the AI Pack through Play for On-device AI (PODAI), then select Accelerator.NPU at runtime. LiteRT handles device targeting, runtime loading, and falls back to GPU or CPU if the NPU is not available.

For you as an engineer, the main change is that device targeting logic moves into a structured configuration file and Play delivery, while the app code mostly interacts with CompiledModel and Accelerator.NPU.

AOT and on device compilation are both supported. AOT compiles for a known SoC ahead of time and is recommended for larger models because it removes the cost of compiling on the user device. On device compilation is better for small models and generic .tflite distribution, at the cost of higher first run latency. The blog shows that for a model such as Gemma-3-270M, pure on device compilation can take more than 1 minute, which makes AOT the realistic option for production LLM use.

Gemma, Qwen, And Embedding Models On MediaTek NPU

The stack is built around open weight models rather than a single proprietary NLU path. Google and MediaTek list explicit, production oriented support for:

  • Qwen3 0.6B, for text generation in markets such as mainland China.
  • Gemma-3-270M, a compact base model that is easy to fine tune for tasks like sentiment analysis and entity extraction.
  • Gemma-3-1B, a multilingual text only model for summarization and general reasoning.
  • Gemma-3n E2B, a multimodal model that handles text, audio and vision for things like real time translation and visual question answering.
  • EmbeddingGemma 300M, a text embedding model for retrieval augmented generation, semantic search and classification.

On the latest Dimensity 9500, running on a Vivo X300 Pro, the Gemma 3n E2B variant reaches more than 1600 tokens per second in prefill and 28 tokens per second in decode at a 4K context length when executed on the NPU.

For text generation use cases, LiteRT-LM sits on top of LiteRT and exposes a stateful engine with a text in text out API. A typical C++ flow is to create ModelAssets, build an Engine with litert::lm::Backend::NPU, then create a Session and call GenerateContent per conversation. For embedding workloads, EmbeddingGemma uses the lower level LiteRT CompiledModel API in a tensor in tensor out configuration, again with the NPU selected through hardware accelerator options.

Developer Experience, C++ Pipeline And Zero Copy Buffers

LiteRT introduces a new C++ API that replaces the older C entry points and is designed around explicit Environment, Model, CompiledModel and TensorBuffer objects.

For MediaTek NPUs, this API integrates tightly with Android’s AHardwareBuffer and GPU buffers. You can construct input TensorBuffer instances directly from OpenGL or OpenCL buffers with TensorBuffer::CreateFromGlBuffer, which lets image processing code feed NPU inputs without an intermediate copy through CPU memory. This is important for real time camera and video processing where multiple copies per frame quickly saturate memory bandwidth.

A typical high level C++ path on device looks like this, omitting error handling for clarity:

// Load model compiled for NPU
auto model = Model::CreateFromFile("model.tflite");
auto options = Options::Create();
options->SetHardwareAccelerators(kLiteRtHwAcceleratorNpu);

// Create compiled model
auto compiled = CompiledModel::Create(*env, *model, *options);

// Allocate buffers and run
auto input_buffers = compiled->CreateInputBuffers();
auto output_buffers = compiled->CreateOutputBuffers();
input_buffers[0].Write<float>(input_span);
compiled->Run(input_buffers, output_buffers);
output_buffers[0].Read(output_span);

The same Compiled Model API is used whether you are targeting CPU, GPU or the MediaTek NPU, which reduces the amount of conditional logic in application code.

Key Takeaways

  1. LiteRT NeuroPilot Accelerator is the new, first class NPU integration between LiteRT and MediaTek NeuroPilot, replacing the old TFLite delegate and exposing a unified Compiled Model API with AOT and on device compilation on supported Dimensity SoCs.
  2. The stack targets concrete open weight models, including Qwen3-0.6B, Gemma-3-270M, Gemma-3-1B, Gemma-3n-E2B and EmbeddingGemma-300M, and runs them through LiteRT and LiteRT LM on MediaTek NPUs with a single accelerator abstraction.
  3. AOT compilation is strongly recommended for LLMs, for example Gemma-3-270M can take more than 1 minute to compile on device, so production deployments should compile once in the pipeline and ship AI Packs via Play for On device AI.
  4. On a Dimensity 9500 class NPU, Gemma-3n-E2B can reach more than 1600 tokens per second in prefill and 28 tokens per second in decode at 4K context, with measured throughput up to 12 times CPU and 10 times GPU for LLM workloads.
  5. For developers, the C++ and Kotlin LiteRT APIs provide a common path to select Accelerator.NPU, manage compiled models and use zero copy tensor buffers, so CPU, GPU and MediaTek NPU targets can share one code path and one deployment workflow.

Check out the Docs and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.


Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.

🙌 Follow MARKTECHPOST: Add us as a preferred source on Google.



Source_link

READ ALSO

MIT-IBM Watson AI Lab seed to signal: Amplifying early-career faculty impact | MIT News

How to Build High-Performance GPU-Accelerated Simulations and Differentiable Physics Workflows Using NVIDIA Warp Kernels

Related Posts

MIT-IBM Watson AI Lab seed to signal: Amplifying early-career faculty impact | MIT News
Al, Analytics and Automation

MIT-IBM Watson AI Lab seed to signal: Amplifying early-career faculty impact | MIT News

March 18, 2026
How to Build High-Performance GPU-Accelerated Simulations and Differentiable Physics Workflows Using NVIDIA Warp Kernels
Al, Analytics and Automation

How to Build High-Performance GPU-Accelerated Simulations and Differentiable Physics Workflows Using NVIDIA Warp Kernels

March 17, 2026
Users, Growth, and Global Trends
Al, Analytics and Automation

Users, Growth, and Global Trends

March 17, 2026
Al, Analytics and Automation

Google AI Releases WAXAL: A Multilingual African Speech Dataset for Training Automatic Speech Recognition and Text-to-Speech Models

March 17, 2026
Mistral AI Releases Mistral Small 4: A 119B-Parameter MoE Model that Unifies Instruct, Reasoning, and Multimodal Workloads
Al, Analytics and Automation

Mistral AI Releases Mistral Small 4: A 119B-Parameter MoE Model that Unifies Instruct, Reasoning, and Multimodal Workloads

March 17, 2026
Moonshot AI Releases 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔 to Replace Fixed Residual Mixing with Depth-Wise Attention for Better Scaling in Transformers
Al, Analytics and Automation

Moonshot AI Releases 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔 to Replace Fixed Residual Mixing with Depth-Wise Attention for Better Scaling in Transformers

March 16, 2026
Next Post
The 70% factuality ceiling: why Google’s new ‘FACTS’ benchmark is a wake-up call for enterprise AI

The 70% factuality ceiling: why Google’s new ‘FACTS’ benchmark is a wake-up call for enterprise AI

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Google announced the next step in its nuclear energy plans 

Google announced the next step in its nuclear energy plans 

August 20, 2025

EDITOR'S PICK

Top 12 FinTech Compliance Pitfalls & How to Avoid Them

Top 12 FinTech Compliance Pitfalls & How to Avoid Them

September 29, 2025
Insider One Launches Native ChatGPT App Integration. Activate AI Conversations That Convert

Insider One Launches Native ChatGPT App Integration. Activate AI Conversations That Convert

January 27, 2026
Logo & Branding for 12 by Base Design — BP&O

Logo & Branding for 12 by Base Design — BP&O

July 17, 2025
Guide to Real-Time Monitoring: Benefits & Best Practices

Guide to Real-Time Monitoring: Benefits & Best Practices

June 3, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • What’s the difference between measurement, metrics and analytics?
  • Kalshi’s legal troubles pile up, as Arizona files first ever criminal charges over ‘illegal gambling business’
  • Measuring Progress Towards AGI: A Cognitive Framework
  • When Weird Stuff Happens to Meta Ads
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions