• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Tuesday, June 9, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

Jina AI Releases Jina-VLM: A 2.4B Multilingual Vision Language Model Focused on Token Efficient Visual QA

Josh by Josh
December 9, 2025
in Al, Analytics and Automation
0


Jina AI has released Jina-VLM, a 2.4B parameter vision language model that targets multilingual visual question answering and document understanding on constrained hardware. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone and uses an attention pooling connector to reduce visual tokens while preserving spatial structure. Among open 2B scale VLMs, it reaches state of the art results on multilingual benchmarks such as MMMB and Multilingual MMBench.

https://arxiv.org/pdf/2512.04032

Architecture, overlapping tiles with attention pooling connector

Jina-VLM keeps the standard VLM layout, but optimizes the vision side for arbitrary resolution and low token count. The vision encoder is SigLIP2 So400M/14 384, a 27 layer Vision Transformer with about 400M parameters. It processes 378×378 pixel crops into a 27×27 grid of 14×14 patches, so each tile produces 729 patch tokens.

To handle high resolution images, the model does not resize the full input to a single square. Instead, it constructs a grid of up to 12 overlapping tiles along with a global thumbnail. Each tile is a 378×378 crop, adjacent tiles overlap by 112 pixels, and the stride between tile origins is 266 pixels. A 4×3 grid covers an effective resolution of 1176×910 pixels before downscaling larger images to fit inside the tile budget.

The core design is the vision language connector. Rather than using the final ViT layer, Jina-VLM concatenates features from two intermediate layers, the third from last and ninth from last, that correspond to layers 24 and 18. This combines high level semantics and mid level spatial detail. The connector then applies attention pooling over 2×2 patch neighborhoods. It computes a mean pooled query for each 2×2 region, attends over the full concatenated feature map, and outputs a single pooled token per neighborhood. This reduces 729 visual tokens per tile to 182 tokens, which is a 4 times compression. A SwiGLU projection maps the pooled features to the Qwen3 embedding dimension.

With the default 12 tile configuration plus thumbnail, a naive connector would feed 9,477 visual tokens into the language model. Attention pooling cuts this to 2,366 visual tokens. The ViT compute does not change, but for the language backbone this yields about 3.9 times fewer prefill FLOPs and 4 times smaller KV cache. When including the shared ViT cost, the overall FLOPs drop by about 2.3 times for the default setting.

The language decoder is Qwen3-1.7B-Base. The model introduces special tokens for images, with <im_start> and <im_end> around the tile sequence and <im_col> to mark rows in the patch grid. Visual tokens from the connector and text embeddings are concatenated and passed to Qwen3 to generate answers.

Training pipeline and multilingual data mix

Training proceeds in 2 stages. All components, encoder, connector and decoder, are updated jointly, without freezing. The full corpus contains about 5M multimodal samples and 12B text tokens across more than 30 languages. Roughly half of the text is English, and the rest covers high and mid resource languages such as Chinese, Arabic, German, Spanish, French, Italian, Japanese and Korean.

Stage 1 is alignment training. The goal is cross language visual grounding, not instruction following. The team uses caption heavy datasets PixmoCap and PangeaIns, which span natural images, documents, diagrams and infographics. They add 15 percent text only data from the PleiAS common corpus to control degradation on pure language tasks. The connector uses a higher learning rate and shorter warmup than the encoder and decoder to speed up adaptation without destabilizing the backbones.

Stage 2 is instruction fine tuning. Here Jina VLM learns to follow prompts for visual question answering and reasoning. The mix combines LLaVA OneVision, Cauldron, Cambrian, PangeaIns and FineVision, plus Aya style multilingual text only instructions. The Jina research team first train for 30,000 steps with single source batches, then for another 30,000 steps with mixed source batches. This schedule stabilizes learning in the presence of very heterogeneous supervision.

Across pretraining and fine tuning, the model sees about 10B tokens in the first stage and 37B tokens in the second stage, with a total of roughly 1,300 GPU hours reported for the main experiments.

Benchmark profile, 2.4B model with multilingual strength

On standard English VQA tasks that include diagrams, charts, documents, OCR and mixed scenes, Jina-VLM reaches an average score of 72.3 across 8 benchmarks. These are AI2D, ChartQA, TextVQA, DocVQA, InfoVQA, OCRBench, SEED Bench 2 Plus and CharXiv. This is the best average among the 2B scale comparison models in this research paper from Jina AI.

On multimodal comprehension and real world understanding tasks, the model scores 67.4 on the multimodal group, which includes MME, MMB v1.1 and MMStar. It scores 61.9 on the real world group, which includes RealWorldQA, MME RealWorld and R Bench, and it reaches 68.2 accuracy on RealWorldQA itself, which is the best result among the baselines considered.

https://arxiv.org/pdf/2512.04032

Multi image reasoning is a weaker area. On BLINK, MuirBench and MMT, Jina-VLM reaches an average of 47.3. The research team point to limited multi-image training data as the reason. In contrast, hallucination control is strong. On the POPE benchmark, which measures object hallucination, the model scores 90.3, the best score in the comparison table.

For mathematical and structured reasoning, the model uses the same architecture, without thinking mode. It reaches 59.5 on MMMU and an overall math score of 33.3 across MathVista, MathVision, MathVerse, WeMath and LogicVista. Jina-VLM is comparable to InternVL3-2B on this set and clearly ahead of Qwen2-VL-2B, while InternVL3.5-2B remains stronger due to its larger scale and more specialized math training.

On pure text benchmarks, the picture is mixed. The research team reports that Jina-VLM keeps most of the Qwen3-1.7B performance on MMLU, GSM 8K, ARC C and HellaSwag. However, MMLU-Pro drops from 46.4 for the base model to 30.3 after multimodal tuning. The research team attribute this to instruction tuning that pushes the model toward very short answers, which clashes with the long multi step reasoning required by MMLU Pro.

The main highlight is multilingual multimodal understanding. On MMMB across Arabic, Chinese, English, Portuguese, Russian and Turkish, Jina-VLM reaches an average of 78.8. On Multilingual MMBench across the same languages, it reaches 74.3. The research team reports these as state of the art averages among open 2B scale VLMs.

Comparison Table

Model Params VQA Avg MMMB Multi. MMB DocVQA OCRBench
Jina-VLM 2.4B 72.3 78.8 74.3 90.6 778
Qwen2-VL-2B 2.1B 66.4 71.3 69.4 89.2 809
Qwen3-VL-2B 2.8B 71.6 75.0 72.3 92.3 858
InternVL3-2B 2.2B 69.2 73.6 71.9 87.4 835
InternVL3.5-2B 2.2B 71.6 74.6 70.9 88.5 836

Key Takeaways

  1. Jina-VLM is a 2.4B parameter VLM that couples SigLIP2 So400M as vision encoder with Qwen3-1.7B as language backbone through an attention pooling connector that cuts visual tokens by 4 times while keeping spatial structure.
  2. The model uses overlapping 378×378 tiles, 12 tiles plus a global thumbnail, to handle arbitrary resolution images up to roughly 4K, then feeds only pooled visual tokens to the LLM which reduces prefill FLOPs and KV cache size by about 4 times compared to naive patch token usage.
  3. Training uses about 5M multimodal samples and 12B text tokens across nearly 30 languages in a 2 stage pipeline, first alignment with caption style data, then instruction fine tuning with LLaVA OneVision, Cauldron, Cambrian, PangeaIns, FineVision and multilingual instruction sets.
  4. On English VQA, Jina-VLM reaches 72.3 average across 8 VQA benchmarks, and on multilingual multimodal benchmarks it leads the open 2B scale class with 78.8 on MMMB and 74.3 on Multilingual MMBench while keeping competitive text only performance.

Check out the Paper, Model on HF and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.


Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.

🙌 Follow MARKTECHPOST: Add us as a preferred source on Google.



Source_link

READ ALSO

ClawHub Security Signals: A Coding Guide to End-to-End Security Signal Analysis and Verdict Classification on the AI Skills Dataset

Microsoft AI Introduces MAI-Transcribe-1.5: 2.4% WER on Artificial Analysis, Best-in-Class FLEURS Accuracy, and Up to 5x Faster Long-Audio Transcription

Related Posts

ClawHub Security Signals: A Coding Guide to End-to-End Security Signal Analysis and Verdict Classification on the AI Skills Dataset
Al, Analytics and Automation

ClawHub Security Signals: A Coding Guide to End-to-End Security Signal Analysis and Verdict Classification on the AI Skills Dataset

June 8, 2026
Microsoft AI Introduces MAI-Transcribe-1.5: 2.4% WER on Artificial Analysis, Best-in-Class FLEURS Accuracy, and Up to 5x Faster Long-Audio Transcription
Al, Analytics and Automation

Microsoft AI Introduces MAI-Transcribe-1.5: 2.4% WER on Artificial Analysis, Best-in-Class FLEURS Accuracy, and Up to 5x Faster Long-Audio Transcription

June 8, 2026
Building Reflective Prompt Optimization with GEPA: Multi-Component Prompts, Structured Feedback, and Held-Out Validation
Al, Analytics and Automation

Building Reflective Prompt Optimization with GEPA: Multi-Component Prompts, Structured Feedback, and Held-Out Validation

June 7, 2026
Best 21 Low-Code and No-Code AI Tools in 2026
Al, Analytics and Automation

Best 21 Low-Code and No-Code AI Tools in 2026

June 7, 2026
Tod Machover receives George Peabody Medal for contributions to music and technology | MIT News
Al, Analytics and Automation

Tod Machover receives George Peabody Medal for contributions to music and technology | MIT News

June 6, 2026
Moonshot AI Releases Kimi Code CLI: A Terminal AI Coding Agent Built in TypeScript for Next-Gen Agents
Al, Analytics and Automation

Moonshot AI Releases Kimi Code CLI: A Terminal AI Coding Agent Built in TypeScript for Next-Gen Agents

June 6, 2026
Next Post
Global ABM CTV and ABM Audio Advertising Expansion

Global ABM CTV and ABM Audio Advertising Expansion

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

November 4, 2025

EDITOR'S PICK

Sales and Marketing Analytics for Business Owners

Sales and Marketing Analytics for Business Owners

June 25, 2025
Crisis Management in Health Tech: A Leadership Guide For AI-Driven Medicine

Crisis Management in Health Tech: A Leadership Guide For AI-Driven Medicine

August 27, 2025
What C-Suite Leaders Must Prioritize Now

What C-Suite Leaders Must Prioritize Now

February 4, 2026
Google-Agent vs Googlebot: Google Defines the Technical Boundary Between User Triggered AI Access and Search Crawling Systems Today

Google-Agent vs Googlebot: Google Defines the Technical Boundary Between User Triggered AI Access and Search Crawling Systems Today

March 29, 2026

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • The Scoop: Tim Cook makes a play for his legacy at final WWDC
  • 12 best online reputation management tools for 2026
  • Researchers trained an open source AI search agent, Harness-1, that outperforms GPT-5.4 on recalling relevant information
  • Stephen Curry and Curry Brand Enter Long-Term Deal with LI-NING
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions