TII Releases Falcon Perception: A 0.6B-Parameter Early-Fusion Transformer for Open-Vocabulary Grounding and Segmentation from Natural Language Prompts

In the current landscape of computer vision, the standard operating procedure involves a modular ‘Lego-brick’ approach: a pre-trained vision encoder for feature extraction paired with a separate decoder for task prediction. While effective, this architectural separation complicates scaling and bottlenecks the interaction between language and vision. The Technology Innovation Institute (TII) research team is challenging this paradigm with Falcon Perception, a 600M-parameter unified dense Transformer. By processing image patches and text tokens in a shared parameter space from the very first layer, TII research team has developed an early-fusion stack that handles perception and task modeling with extreme efficiency.

The Architecture: A Single Stack for Every Modality

The core design of Falcon Perception is built on the hypothesis that a single Transformer can simultaneously learn visual representations and perform task-specific generation^{.
Hybrid Attention and GGROPE
Unlike standard language models that use strict causal masking, Falcon Perception employs a hybrid attention strategy^{^{^{. Image tokens attend to each other bidirectionally to build a global visual context, while text and task tokens attend to all preceding tokens (causal masking) to enable autoregressive prediction^{^{^{^{^{^{^{^{^{.
To maintain 2D spatial relationships in a flattened sequence, the research team uses 3D Rotary Positional Embeddings. This decomposes the head dimension into a sequential component and a spatial component using Golden Gate ROPE (GGROPE). GGROPE allows attention heads to attend to relative positions along arbitrary angles, making the model robust to rotation and aspect ratio variations.
Minimalist Sequence Logic
The basic architectural sequence follows a Chain-of-Perception format:
[Image] [Text] <coord> <size> <seg> ... <eos>^{^{^{^{^{^{^{^{^{.
This ensures that the model resolves spatial ambiguity (position and size) as a conditioning signal before generating the final segmentation mask^{^{^{^{.
Engineering for Scale: Muon, FlexAttention, and Raster Ordering
TII research team introduced several optimizations to stabilize training and maximize GPU utilization for these heterogeneous sequences.

Muon Optimization: The research team report that employing the Muon optimizer for specialized heads (coordinates, size, and segmentation) led to lower training losses and improved performance on benchmarks compared to standard AdamW.
FlexAttention and Sequence Packing: To process images at native resolutions without wasting compute on padding, the model uses a scatter-and-pack strategy. Valid patches are packed into fixed-length blocks, and FlexAttention is used to restrict self-attention within each image sample’s boundaries.
Raster Ordering: When multiple objects are present, Falcon Perception predicts them in raster order (top-to-bottom, left-to-right). This was found to converge faster and produce lower coordinate loss than random or size-based ordering.

The Training Recipe: Distillation to 685GT
The model uses multi-teacher distillation for initialization, distilling knowledge from DINOv3 (ViT-H) for local features and SigLIP2 (So400m) for language-aligned features^{^{^{^{^{^{^{^{^{. Following initialization, the model undergoes a three-stage perception training pipeline totaling approximately 685 Gigatokens (GT)^{:

In-Context Listing (450 GT): Learning to ‘list’ the scene inventory to build global context.
Task Alignment (225 GT): Transitioning to independent-query tasks using Query Masking to ensure the model grounds each query solely on the image.
Long-Context Finetuning (10 GT): Short adaptation for extreme density, increasing the mask limit to 600 per expression.

During these stages, the task-specific serialization is used:
<image>expr1<present><coord><size><seg> <eoq>expr2<absent> <eoq> <eos>^{.
The <present> and <absent> tokens force the model to commit to a binary decision on an object’s existence before localization^{.
PBench: Profiling Capabilities Beyond Saturated Baselines
To measure progress, TII research team introduced PBench, a benchmark that organizes samples into five levels of semantic complexity to disentangle model failure modes.
Main Results: Falcon Perception vs. SAM 3 (Macro-F1)

Benchmark Split
SAM 3
Falcon Perception (600M)

L0: Simple Objects
64.3
65.1

L1: Attributes
54.4
63.6

L2: OCR-Guided
24.6
38.0

L3: Spatial Understanding
31.6
53.5

L4: Relations
33.3
49.1

Dense Split
58.4
72.6

Falcon Perception significantly outperforms SAM 3 on complex semantic tasks, particularly showing a +21.9 point gain on spatial understanding (Level 3)^{.

https://arxiv.org/pdf/2603.27365

FalconOCR: The 300M Document specialist
TII team also extended this early-fusion recipe to FalconOCR, a compact 300M-parameter model initialized from scratch to prioritize fine-grained glyph recognition. FalconOCR is competitive with several larger proprietary and modular OCR systems:

olmOCR: Achieves 80.3% accuracy, matching or exceeding Gemini 3 Pro (80.2%) and GPT 5.2 (69.8%).
OmniDocBench: Reaches an overall score of 88.64, ahead of GPT 5.2 (86.56) and Mistral OCR 3 (85.20), though it trails the top modular pipeline PaddleOCR VL 1.5 (94.37).

Key Takeaways

Unified Early-Fusion Architecture: Falcon Perception replaces modular encoder-decoder pipelines with a single dense Transformer that processes image patches and text tokens in a shared parameter space from the first layer. It utilizes a hybrid attention mask—bidirectional for visual tokens and causal for task tokens—to act simultaneously as a vision encoder and an autoregressive decoder.
Chain-of-Perception Sequence: The model serializes instance segmentation into a structured sequence $(\langle coord\rangle \rightarrow \langle size\rangle \rightarrow \langle seg\rangle)$ , which forces it to resolve spatial position and size as a conditioning signal before generating the pixel-level mask.
Specialized Heads and GGROPE: To manage dense spatial data, the model uses Fourier Feature encoders for high-dimensional coordinate mapping and Golden Gate ROPE (GGROPE) to enable isotropic 2D spatial attention. The Muon optimizer is employed for these specialized heads to balance learning rates against the pre-trained backbone.
Semantic Performance Gains: On the new PBench benchmark, which disentangles semantic capabilities (Levels 0-4), the 600M model demonstrates significant gains over SAM 3 in complex categories, including a +13.4 point lead in OCR-guided queries and a +21.9 point lead in spatial understanding.
High-Efficiency OCR Extension: The architecture scales down to Falcon OCR, a 300M-parameter model that achieves 80.3% on olmOCR and 88.64 on OmniDocBench. It matches or exceeds the accuracy of much larger systems like Gemini 3 Pro and GPT 5.2 while maintaining high throughput for large-scale document processing.

Check out the Paper, Model Weight, Repo and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

READ ALSO

Arcee AI Releases Trinity Large Thinking: An Apache 2.0 Open Reasoning Model for Long-Horizon Agents and Tool Use

The End of Clicking? AI Is Quietly Turning Software Into Something That Just… Listens}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}

Benchmark Split	SAM 3	Falcon Perception (600M)
L0: Simple Objects	64.3	65.1
L1: Attributes	54.4	63.6
L2: OCR-Guided	24.6	38.0
L3: Spatial Understanding	31.6	53.5
L4: Relations	33.3	49.1
Dense Split	58.4	72.6

Source_link

TII Releases Falcon Perception: A 0.6B-Parameter Early-Fusion Transformer for Open-Vocabulary Grounding and Segmentation from Natural Language Prompts

Minimalist Sequence Logic

Engineering for Scale: Muon, FlexAttention, and Raster Ordering

The Training Recipe: Distillation to 685GT

PBench: Profiling Capabilities Beyond Saturated Baselines

Main Results: Falcon Perception vs. SAM 3 (Macro-F1)

FalconOCR: The 300M Document specialist

Key Takeaways

READ ALSO

Arcee AI Releases Trinity Large Thinking: An Apache 2.0 Open Reasoning Model for Long-Horizon Agents and Tool Use

The End of Clicking? AI Is Quietly Turning Software Into Something That Just… Listens

Related Posts

Arcee AI Releases Trinity Large Thinking: An Apache 2.0 Open Reasoning Model for Long-Horizon Agents and Tool Use

The End of Clicking? AI Is Quietly Turning Software Into Something That Just… Listens

IBM Releases Granite 4.0 3B Vision: A New Vision Language Model for Enterprise Grade Document Data Extraction

Silicon Dreams Meet Real-World Rules: The AI Gold Rush Hits Its First Wall

Evaluating the ethics of autonomous systems | MIT News

Z.ai Launches GLM-5V-Turbo: A Native Multimodal Vision Coding Model Optimized for OpenClaw and High-Capacity Agentic Engineering Workflows Everywhere

Arcee's new, open source Trinity-Large-Thinking is the rare, powerful U.S.-made AI model that enterprises can download and customize

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Communication Effectiveness Skills For Business Leaders

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

App Development Cost in Singapore: Pricing Breakdown & Insights

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

EDITOR'S PICK

Boost Training Goodput: How Continuous Checkpointing Optimizes Reliability in Orbax and MaxText

Google rethinks search results with its new AI-curated ‘Web Guide’

Brand Strategy Work Is No Job For Ad Agencies

How to Build an Autonomous Machine Learning Research Loop in Google Colab Using Andrej Karpathy’s AutoResearch Framework for Hyperparameter Discovery and Experiment Tracking

About

Categories

Recent Posts