• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Thursday, June 4, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Google Marketing

Gemma 4 12B: The Developer Guide

Josh by Josh
June 4, 2026
in Google Marketing
0
Gemma 4 12B: The Developer Guide


Following the announcement in our launch blog, we are releasing Gemma 4 12B, a dense multimodal model with a unified, encoder-free architecture.

Gemma 4 12B introduces several milestones for local AI:

  1. A multimodal encoder-free architecture: Bypassing heavy multi-stage vision and audio encoders entirely, multimodal data is fed straight into the LLM backbone, reducing multimodal latency.
  2. Our first medium-sized model with audio input: In the Gemma family, audio inputs were restricted to small, lightweight edge architectures (e.g. E4B). Gemma 4 12B is the first medium-sized model capable of natively ingesting audio.
  3. Developer-friendly size: Small enough to run locally on dedicated GPU laptops with 16GB VRAM or unified memory. To maximize local inference speeds, we are additionally releasing a dedicated multi-token prediction (MTP) model.
  4. New MacOS desktop experience: For the first time, we are releasing downloadable macOS desktop applications, letting developers experience fully local spoken and visual interaction directly on consumer-grade devices.

The Architecture

Traditional multimodal models rely on frozen, separate vision encoders (e.g., Gemma 4 uses a 150M parameter vision model for edge sizes and 550M for medium-sized models) and audio encoders (300M parameters for Gemma 4 E2B and E4B). Processing multimodal inputs with multiple separate encoders before feeding them to the LLM leads to increased latency and fragmented memory footprints.

Gemma 4 12B solves these issues by utilizing a single decoder-only transformer containing the same advanced decoder structure as the Gemma 4 31B Dense model.

  • Vision embedder (35M parameters): Replaces the 27 vision transformer layers of the other medium-sized Gemma 4 models. Raw 48×48 pixel patches are projected to the LLM hidden dimension with a single matmul. A factorized coordinate lookup (X and Y matrices) attaches spatial location information directly to the input.
  • Audio wave projection: Eliminates the separate audio encoder (skipping the 12 conformer layers used in Gemma 4 E2B and E4B). Raw 16 kHz audio signals are sliced into 40ms frames (640 floats each) and projected linearly to the LLM input space.
  • Unified fine-tuning advantage: Because vision, audio, and text inputs share the exact same weights, you no longer have to co-tune separate frozen encoders. Downstream adapter (e.g. LoRA) or full tuning naturally update the entire multimodal token loop in a single pass (via Hugging Face or Unsloth).

For a more in-depth overview of how this encoder-free architecture works, check out A Visual Guide to Gemma 4 12B.

Capabilities

Gemma 4 12B achieves outstanding performance, with capabilities such as automatic speech recognition, agentic reasoning, diarization, video understanding, coding, and more.

See below examples for a demonstration of the model’s agentic and multimodal capabilities:

Example 1: Gemma 4 12B creates a local image processing app that uses Gemma 4 12B

Thanks to its agentic and multimodal understanding capabilities, Gemma 4 12B can be easily used with existing agent harnesses like OpenCode. In this example, we served it locally using llama.cpp using the gemma-skills to code a Gradio app that helped the user process images. This app was powered by the same Gemma 4 12B model that built it!

Example 2: Processing 5 minutes of Video at 1 FPS with audio

We used Gemma 4 12B to analyse a part from the Google IO Keynote moment from May 19, specifically the 5 minutes between 00:15:32 and 00:20:45. To do that, we extracted all the frames of that segment (at 1 FPS), plus the prompt and the audio from the video:

Prompt:

  1. 313 frames (at 1FPS, images resized to visual token budget 70)
  2. "what happens when the man takes a selfie?"
  3. the Video’s audio + the prompt below

When the man takes a “selfie” or is shown holding a smartphone in front of his face in these demonstration videos, it is a clever way to visually represent how the AI models can take existing media—like a personal photo or a video clip—and “reimagine” it. In these specific clips, the model is taking a selfie and using it as a base to generate various scenarios (like a person in a space station, or walking through a forest).

Essentially, the man is not actually taking a selfie; rather, he is **acting out a visual metaphor for the AI’s capability to take one specific input (a “selfie”) and generate a whole world of new content based on it.** This is part of the “Swap” and “Build worlds” demonstrations of the Gemini Omni model, showing its ability to perform complex, multi-modal reasoning and creative generation.

On-Device & Desktop Serving: Powered by LiteRT-LM

In tandem with the Gemma 4 12B launch, we are officially introducing powerful on-device developer integrations powered by LiteRT-LM, bringing zero-latency local AI execution natively to standard desktop environments:

1.Native MacOS Apps: The mobile Google AI Edge Gallery is officially expanding to desktop platforms, running Gemma 4 12B offline, natively on Apple Silicon GPUs. It comes with a secure sandboxed Python execution loop to write, execute, and plot scientific charts inside the chat bubble. In parallel, the Google AI Edge Eloquent app on Mac launches support for Gemma 12B to power Voice Edit conversational inputs.

2. Drop-in Local API Servers (litert-lm serve): Run Gemma 4 12B as a local, OpenAI-compatible API server using the new litert-lm serve CLI command. Seamlessly connect standard integrations (e.g., Continue, Aider, OpenClaw, Hermes or OpenCode), leveraging stateless prefix caching in memory to match context history and instantly bypass prefill latency.

litert-lm import --from-huggingface-repo=litert-community/gemma-4-12B-it-litert-lm  gemma-4-12B-it.litertlm gemma4-12b

# Start the OpenAI-compatible server
litert-lm serve

Shell

Find a deep dive about it on the Google AI Edge Gallery blog.

Getting Started Today

Ready to build local multimodal agents with the first encoder-free architecture of the Gemma family? Here is how you can jump in today

  • Try it yourself: Experiment with a couple of clicks in LM Studio, Ollama, Google AI Edge Gallery App, the Google AI Edge Eloquent app and the LiteRT-LM CLI
  • Download the weights: Download the pre-trained and instruction-tuned checkpoints directly from Hugging Face and Kaggle.
  • Integrate & learn: Review the developer documentation and the quick start notebook.
  • Use your favorite development tools: Implement local inference pipelines with Hugging Face Transformers, llama.cpp, MLX, SGLang, and vLLM, or fine-tune with efficiency using Unsloth.
  • Unlock Agentic Development with Gemma Skills: To support agents to build with the latest Gemma advancements, we are releasing our official Skills Repository. This is a library of skills designed specifically to enable agents to build with Gemma models.
  • Deploy your way: Spin up endpoints in production using Google Cloud. Deploy your way through Gemini Enterprise Agent Platform Model Garden, Cloud Run and GKE.



Source_link

READ ALSO

YouTube gains MRC brand safety approval on Shorts

Google will replenish more water than it uses at data centers

Related Posts

YouTube gains MRC brand safety approval on Shorts
Google Marketing

YouTube gains MRC brand safety approval on Shorts

June 4, 2026
Google will replenish more water than it uses at data centers
Google Marketing

Google will replenish more water than it uses at data centers

June 3, 2026
AI has a water problem — Google thinks it has a fix
Google Marketing

AI has a water problem — Google thinks it has a fix

June 3, 2026
Google announces water stewardship commitments and initiatives
Google Marketing

Google announces water stewardship commitments and initiatives

June 3, 2026
Google and Voltus sign agreement for smart energy capacity
Google Marketing

Google and Voltus sign agreement for smart energy capacity

June 3, 2026
Google’s Phone app will tell you if a scammer is impersonating one of your contacts
Google Marketing

Google’s Phone app will tell you if a scammer is impersonating one of your contacts

June 3, 2026
Next Post
5 Features To Look for in Top Recruitment Marketing Platforms

5 Features To Look for in Top Recruitment Marketing Platforms

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

November 4, 2025

EDITOR'S PICK

A history of 10 generations of Pixel

A history of 10 generations of Pixel

August 21, 2025
Android XR on Sphere in Las Vegas for CES 2026

Android XR on Sphere in Las Vegas for CES 2026

January 5, 2026
AI Overviews’ Impact on Search in 2025

AI Overviews’ Impact on Search in 2025

December 15, 2025
Top 20+ Fitness Business Ideas in 2026

Top 20+ Fitness Business Ideas in 2026

May 1, 2026

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • How to prepare for more declines in traditional Google search traffic
  • What are the latest Hootsuite product features? [May 2026]
  • Google's new open source Gemma 4 12B analyzes audio, video — and runs entirely locally on a typical 16GB enterprise laptop
  • Teaching AI agents to ask better questions by playing “Battleship” | MIT News
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions