TII’s Falcon H1R 7B can out-reason models up to 7x its size — and it’s (mostly) open

For the last two years, the prevailing logic in generative AI has been one of brute force: if you want better reasoning, you need a bigger model.

Dutch intelligence services warn of Russian hackers targeting Signal and WhatsApp

Our Favorite Wireless Headphones Are $60 Off

While "small" models (under 10 billion parameters) have become capable conversationalists, they have historically crumbled when asked to perform multi-step logical deduction or complex mathematical proofs.

Today, the Technology Innovation Institute (TII) in Abu Dhabi is challenging that scaling law with the release of Falcon H1R 7B.

By abandoning the pure Transformer orthodoxy in favor of a hybrid architecture, TII claims to have built a 7-billion parameter model that not only rivals but outperforms competitors nearly 7X its size — including the 32B and 47B variants of Alibaba's Qwen and Nvidia's Nemotron.

The release marks a significant shift in the open-weight ecosystem, moving the battleground from raw parameter count to architectural efficiency and inference-time scaling.

The full model code is available now at Hugging Face and can be tested by individuals in a live demo inference on Falcon Chat (a chatbot experience). TII further released a seemingly quite comprehensive technical report on the approach and training methodology for Falcon H1 7B, as well.

Moving Beyond the Foundational LLM Tech, the Transformer

The defining feature of Falcon H1R 7B is its "hybrid" backbone. Most modern LLMs rely exclusively on the Transformer architecture, which scales predictably but suffers from high memory costs when processing long sequences.

Falcon H1R 7B integrates Mamba, a state-space model (SSM) architecture, alongside standard Transformer attention layers.

Originally developed by researchers Albert Gu and Tri Dao at Carnegie Mellon University and Princeton University, Mamba was first introduced in the paper "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" published on December 1, 2023.

The architecture processes data sequences differently than Transformers: while Transformers compare every piece of data to every other piece (quadratic scaling), Mamba processes tokens sequentially, allowing it to handle vast amounts of information with linear scaling and significantly reduced compute costs.

This combination addresses one of the most persistent bottlenecks in deploying reasoning models: the cost of "thinking." Reasoning models require generating long "chains of thought"—step-by-step internal monologues—before arriving at an answer. For standard Transformers, these long contexts explode computational costs.

According to TII’s technical report, the hybrid approach allows Falcon H1R 7B to maintain high throughput even as response lengths grow. At a batch size of 64, the model processes approximately 1,500 tokens per second per GPU—nearly double the speed of the competing Qwen3 8B model.

Benchmark Performance: Punching Up

In the benchmarks released by TII, the disparity between Falcon H1R 7B’s size and its performance is stark. On the AIME 2025 leaderboard—a rigorous test of mathematical reasoning—Falcon H1R 7B scored 83.1%, a result that disrupts the traditional hierarchy of model sizing.

While the 7B model naturally trails massive proprietary frontiers like GPT-5.2 (99.0%) and Gemini 3 Flash (97.0%) on the separate Artificial Analysis index (run by the independent organization of the same name, which has not yet benchmarked Falcon H1R 7B yet), it has effectively collapsed the gap between "efficient" open weights and mid-tier proprietary systems.

Beating Larger "Thinkers": Falcon H1R 7B (83.1%) outperforms the 15-billion parameter Apriel-v1.6-Thinker (82.7%) and the 32-billion parameter OLMo 3 Think (73.7%), validating TII's claim that hybrid architectures can out-reason larger Transformers.
Chasing Proprietary Leaders: It sits within striking distance of Claude 4.5 Sonnet (88.0%) and Amazon Nova 2.0 Lite (88.7%), suggesting that for specific math-heavy workflows, this 7B model is a viable, low-latency alternative to expensive commercial APIs.
Outperforming Legacy Giants: On this specific reasoning metric, it decisively beats broadly capable but older architectures like Mistral Large 3 (38.0%) and Llama 4 Maverick (19.3%), highlighting how specialized reasoning training ("Deep Think") has become more critical than raw scale for logic tasks.

Other key domain wins include:

Coding: The model achieved 68.6% on the LCB v6 benchmark, a score TII claims is the highest among all tested models, including those four times its size.
General Reasoning: While it dominates in math and code, its general reasoning score (49.48%) remains competitive, sitting just below the 14B and 15B parameter models but comfortably ahead of comparable 8B models.

Training Techniques

Falcon H1R 7B’s performance is not just architectural; it stems from a rigorous, two-stage training pipeline designed to maximize reasoning density without inflating parameter count, according to TII's technical report on the model.

Stage 1: Cold-Start Supervised Fine-Tuning (SFT). The model underwent "cold-start" SFT on a curated dataset dominated by mathematics (56.8% of tokens) and code (29.8%), with response lengths stretching up to 48,000 tokens.

Difficulty-Aware Weighting: TII rejected the standard practice of treating all data equally. Instead, they applied a weighting scheme where "hard" problems were up-weighted by 1.25x to 1.75x, while easy problems were down-weighted or removed entirely to prevent overfitting to trivial tasks.
Single-Teacher Consistency: Ablation studies revealed that mixing reasoning traces from multiple "teacher" models actually degraded performance due to conflicting reasoning styles. Consequently, TII opted for a single-teacher approach to maintain coherent internal logic.
Balanced Token Normalization: To handle the massive variance in sequence lengths (short instructions vs. massive reasoning chains), the team introduced a Balanced Data-Parallel Token Normalization strategy. This technique equalizes the gradient contribution of each token across GPUs, preventing ranks with shorter sequences from destabilizing the loss—a change that yielded a consistent 4-10% accuracy boost during training.

Stage 2: Reinforcement Learning via Group Relative Policy Optimization (GRPO). Following SFT, the model was refined using GRPO a reinforcement learning algorithm that rewards correct outcomes without needing a separate value model.

The "No-KL" Shift: In a deviation from standard RLHF, TII removed the KL-divergence penalty (beta=0) entirely. This allowed the model to drift significantly from its base SFT policy, encouraging aggressive exploration of novel reasoning paths.
Math-Only Curriculum: Surprisingly, TII found that training exclusively on math problems during the RL stage yielded better generalization across all domains—including code and science—than mixed strategies. Ablations showed that "code-only" training improved coding scores but harmed general reasoning, whereas math-focused RL lifted performance globally.

TII optimized the model specifically for Test-Time Scaling (TTS), a technique where a model generates multiple reasoning paths in parallel to find the best solution.

The model utilizes Deep Think with Confidence (DeepConf), which leverages the model's internal confidence scores to dynamically prune low-quality reasoning traces.

Adaptive Pruning: During generation, the system initiates a "warm-up" phase with 16 traces to establish a confidence baseline. It then aggressively filters subsequent traces, terminating any chain that falls below the 10th percentile of the baseline confidence.
Efficiency Gains: This method creates a new Pareto frontier for deployment. In benchmark tests, Falcon H1R 7B achieved 96.7% accuracy on AIME 25 while reducing token usage by 38% compared to the DeepSeek-R1-0528-Qwen3-8B baseline.

Licensing: Open For Commercial Usage, But With Strings Attached

TII has released Falcon H1R 7B under the custom Falcon LLM License 1.0 based on Apache 2.0 — but with notable modifications — chiefly among them: not to litigate against TII, and also to always credit it.

For developers and startups, the license is largely permissive:

Royalty-Free: Users can run, modify, and distribute the model commercially without paying TII.
Attribution: Any derivative work (including fine-tunes) must prominently state: "[Name of work] is built using Falcon LLM technology from the Technology Innovation Institute".

However, unlike a pure Open Source Initiative (OSI) license, the Falcon license includes a strict Acceptable Use Policy (AUP).

The license terminates automatically if the model is used to create work that conflicts with the AUP or if the user initiates patent litigation against TII.

Specifically, the AUP prohibits using Falcon H1R 7B or its derivatives for:

Violating Laws: Any use that violates applicable national, federal, state, local, or international laws or regulations.
Harm to Minors or Living Beings: Exploiting, harming, or attempting to exploit or harm minors or any living beings.
Disinformation: Generating or disseminating verifiably false information with the purpose of harming others.
Harassment: Defaming, disparaging, or otherwise harassing others.

The Hybrid Wave: Nvidia, IBM, AI21, and Mistral

TII is not alone in betting on this hybrid future; the industry is increasingly moving toward architectures that blend the strengths of SSMs and Transformers.

Nvidia recently debuted the Nemotron 3 family on December 15, 2025, which utilizes a hybrid mixture-of-experts (MoE) and Mamba-Transformer design to drive efficient agentic AI.
IBM launched its Granite 4.0 family on October 2, 2025, using a hybrid Mamba-Transformer architecture to cut memory requirements by over 70% while maintaining high performance on enterprise benchmarks.
AI21 has pursued this path with its Jamba (Joint Attention and Mamba) models, releasing the Jamba 1.5 family on August 22, 2024, to boost agentic AI capabilities through a hybrid SSM-Transformer approach.
Mistral entered the space early with Codestral Mamba on July 16, 2024, a model specifically optimized for faster, longer code generation.