• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Monday, June 8, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

Apple Released FastVLM: A Novel Hybrid Vision Encoder which is 85x Faster and 3.4x Smaller than Comparable Sized Vision Language Models (VLMs)

Josh by Josh
September 2, 2025
in Al, Analytics and Automation
0
Apple Released FastVLM: A Novel Hybrid Vision Encoder which is 85x Faster and 3.4x Smaller than Comparable Sized Vision Language Models (VLMs)


Introduction

Vision Language Models (VLMs) allow both text inputs and visual understanding. However, image resolution is crucial for VLM performance for processing text and chart-rich data. Increasing image resolution creates significant challenges. First, pretrained vision encoders often struggle with high-resolution images due to inefficient pretraining requirements. Running inference on high-resolution images increases computational costs and latency during visual token generation, whether through single high-resolution processing or multiple lower-resolution tile strategies. Second, high-resolution images produce more tokens, which leads to an increase in LLM prefilling time and time-to-first-token (TTFT), which is the sum of the vision encoder latency and the LLM prefilling time.

Existing VLM Architectures

Large multimodal models such as Frozen and Florence used cross-attention to combine image and text embeddings within the intermediate LLM layers. Auto-regressive architectures like LLaVA, mPLUG-Owl, MiniGPT-4, and Cambrian-1 are effective. For efficient image encoding, CLIP-pretrained vision transformers remain widely adopted, with variants like SigLIP, EVA-CLIP, InternViT, and DFNCLIP. Methods like LLaVA-PruMerge and Matryoshka-based token sampling attempt dynamic token pruning, while hierarchical backbones such as ConvNeXT and FastViT reduce token count through progressive downsampling.  Recently, ConvLLaVA was introduced, which uses a pure-convolutional vision encoder to encode images for a VLM. 

Apple’s FastVLM

Researchers from Apple have proposed FastVLM, a model that achieves an optimized tradeoff between resolution, latency, and accuracy by analyzing how image quality, processing time, number of tokens, and LLM size affect each other. It utilizes FastViTHD, a hybrid vision encoder designed to output fewer tokens and reduce encoding time for high-resolution images. FastVLM achieves an optimal balance between visual token count and image resolution only by scaling the input image. It shows a 3.2 times improvement in TTFT in the LLaVA1.5 setup and achieves superior performance on key benchmarks using the same 0.5B LLM when compared to LLaVA-OneVision at maximum resolution. It delivers 85 times faster TTFT while using a 3.4 times smaller vision encoder.

All FastVLM models are trained on a single node with 8 times NVIDIA H100-80GB GPUs, where stage 1 training of VLM is fast, taking around 30 minutes to train with a Qwen2-7B decoder. Further, FastViTHD enhances the base FastViT architecture by introducing an additional stage with a downsampling layer. This ensures self-attention operates on tensors downsampled by a factor of 32 rather than 16, reducing image encoding latency while generating 4 times fewer tokens for the LLM decoder. The FastViTHD architecture contains five stages: the first three stages utilize RepMixer blocks for efficient processing, while the final two stages employ multi-headed self-attention blocks, creating an optimal balance between computational efficiency and high-resolution image understanding.

Benchmark Comparisons

When compared with ConvLLaVA using the same LLM and similar training data, FastVLM achieves 8.4% better performance on TextVQA and 12.5% improvement on DocVQA while operating 22% faster. The performance advantage increases at higher resolutions, where FastVLM maintains 2× faster processing speeds than ConvLLaVA across various benchmarks. FastVLM matches or surpasses MM1 performance across diverse benchmarks by using intermediate pretraining with 15M samples for resolution scaling, while generating 5 times fewer visual tokens. Moreover, FastVLM not only outperforms Cambrian-1 but also runs 7.9 times faster. With scaled instruction tuning, it delivers better results while using 2.3 times fewer visual tokens.

Conclusion

In conclusion, researchers introduced FastVLM, an advancement in VLM by utilizing the FastViTHD vision backbone for efficient high-resolution image encoding. The hybrid architecture, pretrained on reinforced image-text data, reduces visual token output while maintaining minimal accuracy sacrifice compared to existing approaches. FastVLM achieves competitive performance across VLM benchmarks while delivering notable efficiency improvements in both TTFT and vision backbone parameter count. Rigorous benchmarking on M1 MacBook Pro hardware shows that FastVLM offers a state-of-the-art resolution-latency-accuracy trade-off superior to the current methods.


Check out the Paper and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.


Sajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.



Source_link

READ ALSO

Microsoft AI Introduces MAI-Transcribe-1.5: 2.4% WER on Artificial Analysis, Best-in-Class FLEURS Accuracy, and Up to 5x Faster Long-Audio Transcription

Building Reflective Prompt Optimization with GEPA: Multi-Component Prompts, Structured Feedback, and Held-Out Validation

Related Posts

Microsoft AI Introduces MAI-Transcribe-1.5: 2.4% WER on Artificial Analysis, Best-in-Class FLEURS Accuracy, and Up to 5x Faster Long-Audio Transcription
Al, Analytics and Automation

Microsoft AI Introduces MAI-Transcribe-1.5: 2.4% WER on Artificial Analysis, Best-in-Class FLEURS Accuracy, and Up to 5x Faster Long-Audio Transcription

June 8, 2026
Building Reflective Prompt Optimization with GEPA: Multi-Component Prompts, Structured Feedback, and Held-Out Validation
Al, Analytics and Automation

Building Reflective Prompt Optimization with GEPA: Multi-Component Prompts, Structured Feedback, and Held-Out Validation

June 7, 2026
Best 21 Low-Code and No-Code AI Tools in 2026
Al, Analytics and Automation

Best 21 Low-Code and No-Code AI Tools in 2026

June 7, 2026
Tod Machover receives George Peabody Medal for contributions to music and technology | MIT News
Al, Analytics and Automation

Tod Machover receives George Peabody Medal for contributions to music and technology | MIT News

June 6, 2026
Moonshot AI Releases Kimi Code CLI: A Terminal AI Coding Agent Built in TypeScript for Next-Gen Agents
Al, Analytics and Automation

Moonshot AI Releases Kimi Code CLI: A Terminal AI Coding Agent Built in TypeScript for Next-Gen Agents

June 6, 2026
The crucial human component in computing and AI | MIT News
Al, Analytics and Automation

The crucial human component in computing and AI | MIT News

June 6, 2026
Next Post

Cutting the fluff, not the impact, in your comms strategy

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

November 4, 2025

EDITOR'S PICK

Become a Social Media Creator in 2026: Your 9-Step Plan

Become a Social Media Creator in 2026: Your 9-Step Plan

December 17, 2025
AANHPI Heritage Month comes to life across Google

AANHPI Heritage Month comes to life across Google

May 10, 2026

Indonesian Coffees Recognized in French Gourmet Competition

March 23, 2025
Why Text Messaging Must Be Core to Nonprofit Fundraising in 2026

Why Text Messaging Must Be Core to Nonprofit Fundraising in 2026

January 23, 2026

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • Google will sell you a soccer ball just in time for the World Cup final
  • We Built a Custom Workflow with the Buffer API — and Tripled Our X Impressions
  • Apple Reintroduces The AI-Powered Siri It Announced At WWDC 2024
  • How to Level-up From SEO Tactician to Search Visibility Leader
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions