TII Releases Falcon Perception: A 0.6B-Parameter Early-Fusion Transformer for Open-Vocabulary Grounding and Segmentation from Natural Language Prompts

In the current landscape of computer vision, the standard operating procedure involves a modular ‘Lego-brick’ approach: a pre-trained vision encoder for feature extraction paired with a separate decoder for task prediction. While effective, this architectural separation complicates scaling and bottlenecks the interaction between language and vision. The Technology Innovation Institute (TII) research team is challenging this paradigm with Falcon Perception, a 600M-parameter unified dense Transformer. By processing image patches and text tokens in a shared parameter space from the very first layer, TII research team has developed an early-fusion stack that handles perception and task modeling with extreme efficiency.

The Architecture: A Single Stack for Every Modality

The core design of Falcon Perception is built on the hypothesis that a single Transformer can simultaneously learn visual representations and perform task-specific generation^{.
Hybrid Attention and GGROPE
Unlike standard language models that use strict causal masking, Falcon Perception employs a hybrid attention strategy^{^{^{. Image tokens attend to each other bidirectionally to build a global visual context, while text and task tokens attend to all preceding tokens (causal masking) to enable autoregressive prediction^{^{^{^{^{^{^{^{^{.
To maintain 2D spatial relationships in a flattened sequence, the research team uses 3D Rotary Positional Embeddings. This decomposes the head dimension into a sequential component and a spatial component using Golden Gate ROPE (GGROPE). GGROPE allows attention heads to attend to relative positions along arbitrary angles, making the model robust to rotation and aspect ratio variations.
Minimalist Sequence Logic
The basic architectural sequence follows a Chain-of-Perception format:
[Image] [Text] <coord> <size> <seg> ... <eos>^{^{^{^{^{^{^{^{^{.
This ensures that the model resolves spatial ambiguity (position and size) as a conditioning signal before generating the final segmentation mask^{^{^{^{.
Engineering for Scale: Muon, FlexAttention, and Raster Ordering
TII research team introduced several optimizations to stabilize training and maximize GPU utilization for these heterogeneous sequences.

Muon Optimization: The research team report that employing the Muon optimizer for specialized heads (coordinates, size, and segmentation) led to lower training losses and improved performance on benchmarks compared to standard AdamW.
FlexAttention and Sequence Packing: To process images at native resolutions without wasting compute on padding, the model uses a scatter-and-pack strategy. Valid patches are packed into fixed-length blocks, and FlexAttention is used to restrict self-attention within each image sample’s boundaries.
Raster Ordering: When multiple objects are present, Falcon Perception predicts them in raster order (top-to-bottom, left-to-right). This was found to converge faster and produce lower coordinate loss than random or size-based ordering.

The Training Recipe: Distillation to 685GT
The model uses multi-teacher distillation for initialization, distilling knowledge from DINOv3 (ViT-H) for local features and SigLIP2 (So400m) for language-aligned features^{^{^{^{^{^{^{^{^{. Following initialization, the model undergoes a three-stage perception training pipeline totaling approximately 685 Gigatokens (GT)^{:

In-Context Listing (450 GT): Learning to ‘list’ the scene inventory to build global context.
Task Alignment (225 GT): Transitioning to independent-query tasks using Query Masking to ensure the model grounds each query solely on the image.
Long-Context Finetuning (10 GT): Short adaptation for extreme density, increasing the mask limit to 600 per expression.

During these stages, the task-specific serialization is used:
<image>expr1<present><coord><size><seg> <eoq>expr2<absent> <eoq> <eos>^{.
The <present> and <absent> tokens force the model to commit to a binary decision on an object’s existence before localization^{.
PBench: Profiling Capabilities Beyond Saturated Baselines
To measure progress, TII research team introduced PBench, a benchmark that organizes samples into five levels of semantic complexity to disentangle model failure modes.
Main Results: Falcon Perception vs. SAM 3 (Macro-F1)

Benchmark Split
SAM 3
Falcon Perception (600M)

L0: Simple Objects
64.3
65.1

L1: Attributes
54.4
63.6

L2: OCR-Guided
24.6
38.0

L3: Spatial Understanding
31.6
53.5

L4: Relations
33.3
49.1

Dense Split
58.4
72.6

Falcon Perception significantly outperforms SAM 3 on complex semantic tasks, particularly showing a +21.9 point gain on spatial understanding (Level 3)^{.

https://arxiv.org/pdf/2603.27365

FalconOCR: The 300M Document specialist
TII team also extended this early-fusion recipe to FalconOCR, a compact 300M-parameter model initialized from scratch to prioritize fine-grained glyph recognition. FalconOCR is competitive with several larger proprietary and modular OCR systems:

olmOCR: Achieves 80.3% accuracy, matching or exceeding Gemini 3 Pro (80.2%) and GPT 5.2 (69.8%).
OmniDocBench: Reaches an overall score of 88.64, ahead of GPT 5.2 (86.56) and Mistral OCR 3 (85.20), though it trails the top modular pipeline PaddleOCR VL 1.5 (94.37).

Key Takeaways

Unified Early-Fusion Architecture: Falcon Perception replaces modular encoder-decoder pipelines with a single dense Transformer that processes image patches and text tokens in a shared parameter space from the first layer. It utilizes a hybrid attention mask—bidirectional for visual tokens and causal for task tokens—to act simultaneously as a vision encoder and an autoregressive decoder.
Chain-of-Perception Sequence: The model serializes instance segmentation into a structured sequence $(\langle coord\rangle \rightarrow \langle size\rangle \rightarrow \langle seg\rangle)$ , which forces it to resolve spatial position and size as a conditioning signal before generating the pixel-level mask.
Specialized Heads and GGROPE: To manage dense spatial data, the model uses Fourier Feature encoders for high-dimensional coordinate mapping and Golden Gate ROPE (GGROPE) to enable isotropic 2D spatial attention. The Muon optimizer is employed for these specialized heads to balance learning rates against the pre-trained backbone.
Semantic Performance Gains: On the new PBench benchmark, which disentangles semantic capabilities (Levels 0-4), the 600M model demonstrates significant gains over SAM 3 in complex categories, including a +13.4 point lead in OCR-guided queries and a +21.9 point lead in spatial understanding.
High-Efficiency OCR Extension: The architecture scales down to Falcon OCR, a 300M-parameter model that achieves 80.3% on olmOCR and 88.64 on OmniDocBench. It matches or exceeds the accuracy of much larger systems like Gemini 3 Pro and GPT 5.2 while maintaining high throughput for large-scale document processing.

Check out the Paper, Model Weight, Repo and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

READ ALSO

Mistral AI Releases Leanstral 1.5: An Apache-2.0 Lean 4 Code Agent Model Solving 587 of 672 PutnamBench Problems

Meet WebBrain: An Open-Source, Local-First AI Browser Agent That Reads Pages and Automates Tasks in Chrome and Firefox}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}

Benchmark Split	SAM 3	Falcon Perception (600M)
L0: Simple Objects	64.3	65.1
L1: Attributes	54.4	63.6
L2: OCR-Guided	24.6	38.0
L3: Spatial Understanding	31.6	53.5
L4: Relations	33.3	49.1
Dense Split	58.4	72.6

Source_link

TII Releases Falcon Perception: A 0.6B-Parameter Early-Fusion Transformer for Open-Vocabulary Grounding and Segmentation from Natural Language Prompts

Minimalist Sequence Logic

Engineering for Scale: Muon, FlexAttention, and Raster Ordering

The Training Recipe: Distillation to 685GT

PBench: Profiling Capabilities Beyond Saturated Baselines

Main Results: Falcon Perception vs. SAM 3 (Macro-F1)

FalconOCR: The 300M Document specialist

Key Takeaways

READ ALSO

Mistral AI Releases Leanstral 1.5: An Apache-2.0 Lean 4 Code Agent Model Solving 587 of 672 PutnamBench Problems

Meet WebBrain: An Open-Source, Local-First AI Browser Agent That Reads Pages and Automates Tasks in Chrome and Firefox

Related Posts

Mistral AI Releases Leanstral 1.5: An Apache-2.0 Lean 4 Code Agent Model Solving 587 of 672 PutnamBench Problems

Meet WebBrain: An Open-Source, Local-First AI Browser Agent That Reads Pages and Automates Tasks in Chrome and Firefox

RAG-Anything Tutorial: Build a Multimodal Retrieval Pipeline for Text, Tables, Equations, and Images in Colab

MIT in the media: Innovating and educating for the next 250 years of America | MIT News

Using Lift to Turn Research PDFs into Structured JSON with Controlled, Schema-Guided Field-Level Evaluation

3 Questions: Beyond data-driven aesthetics | MIT News

Arcee's new, open source Trinity-Large-Thinking is the rare, powerful U.S.-made AI model that enterprises can download and customize

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

Communication Effectiveness Skills For Business Leaders

App Development Cost in Singapore: Pricing Breakdown & Insights

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

EDITOR'S PICK

Social media approval workflow explained (with free template)

G2’s AI in HR Report: 2026 Market Reality Check

CRM et marketing automation : comment (enfin) aligner les équipes Sales et Marketing ?

How Digital Transformation in Manufacturing Is Rewriting the Future of Production

About

Categories

Recent Posts