• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Friday, January 23, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

Polaris-4B and Polaris-7B: Post-Training Reinforcement Learning for Efficient Math and Logic Reasoning

Josh by Josh
June 27, 2025
in Al, Analytics and Automation
0
Polaris-4B and Polaris-7B: Post-Training Reinforcement Learning for Efficient Math and Logic Reasoning
0
SHARES
2
VIEWS
Share on FacebookShare on Twitter


The Rising Need for Scalable Reasoning Models in Machine Intelligence

Advanced reasoning models are at the frontier of machine intelligence, especially in domains like math problem-solving and symbolic reasoning. These models are designed to perform multi-step calculations and logical deductions, often generating solutions that mirror human reasoning processes. Reinforcement learning techniques are used to improve accuracy after pretraining; however, scaling these methods while retaining efficiency remains a complex challenge. As demand increases for smaller, more resource-efficient models that still exhibit high reasoning capability, researchers are now turning to strategies that address data quality, exploration methods, and long-context generalization.

Challenges in Reinforcement Learning for Large Reasoning Architectures

A persistent problem with reinforcement learning for large-scale reasoning models is the mismatch between the model’s capability and the difficulty of the training data. When a model is exposed to tasks that are too simple, its learning curve stagnates. Conversely, overly difficult data can overwhelm the model and yield no learning signal. This difficulty imbalance is especially pronounced when applying recipes that work well for small models to larger ones. Another issue is the lack of methods to efficiently adapt rollout diversity and output length during both training and inference, which further constrains a model’s reasoning abilities on complex benchmarks.

Limitations of Existing Post-Training Approaches on Advanced Models

Earlier approaches, such as DeepScaleR and GRPO, have demonstrated that reinforcement learning can improve the performance of small-scale reasoning models with as few as 1.5 billion parameters. However, applying these same recipes to more capable models, such as Qwen3-4B or Deepseek-R1-Distill-Qwen-7B, results in only marginal gains or even performance drops. One key limitation is the static nature of data distribution and the limited diversity of sampling. Most of these approaches do not filter data based on model capability, nor do they adjust sampling temperature or response length over time. As a result, they often fail to scale effectively when used on more advanced architectures.

Introducing Polaris: A Tailored Recipe for Scalable RL in Reasoning Tasks

Researchers from the University of Hong Kong, Bytedance Seed, and Fudan University introduced Polaris, a post-training recipe designed specifically to scale reinforcement learning for advanced reasoning tasks. Polaris includes two preview models: Polaris-4B-Preview and Polaris-7B-Preview. Polaris-4B-Preview is fine-tuned from Qwen3-4B, while Polaris-7B-Preview is based on Deepseek-R1-Distill-Qwen-7B. The researchers focused on building a model-agnostic framework that modifies data difficulty, encourages diverse exploration through controlled sampling temperatures, and extends inference capabilities through length extrapolation. These strategies were developed using open-source datasets and training pipelines, and both models are optimized to run on consumer-grade graphics processing units (GPUs).

Polaris Innovations: Difficulty Balancing, Controlled Sampling, and Long-Context Inference

Polaris implements multiple innovations. First, the training data is curated by removing problems that are either too easy or unsolvable, creating a mirrored J-shape distribution of difficulty. This ensures that the training data evolves with the model’s growing capabilities. Second, the researchers dynamically adjust the sampling temperature across training stages—using 1.4, 1.45, and 1.5 for Polaris-4B and 0.7, 1.0, and 1.1 for Polaris-7B—to maintain rollout diversity. Furthermore, the method employs a Yarn-based extrapolation technique to extend the inference context length to 96K tokens without requiring additional training. This addresses the inefficiency of long-sequence training by enabling a “train-short, test-long” approach. The model also employs techniques such as the Rollout Rescue Mechanism and Intra-Batch Informative Substitution to prevent zero-reward batches and ensure that useful training signals are preserved, even when the rollout size is kept small at 8.

Benchmark Results: Polaris Outperforms Larger Commercial Models

Polaris models achieve state-of-the-art results across multiple math benchmarks. Polaris-4B-Preview records 81.2% accuracy on AIME24 and 79.4% on AIME25, outperforming even Qwen3-32B on the same tasks while using less than 2% of its parameters. It scores 44.0% on Minerva Math, 69.1% on Olympiad Bench, and 94.8% on AMC23. Polaris-7B-Preview also performs strongly, scoring 72.6% on AIME24 and 52.6% on AIME25. These results demonstrate consistent improvement over models such as Claude-4-Opus and Grok-3-Beta, establishing Polaris as a competitive, lightweight model that bridges the performance gap between small open models and commercial 30B+ models.

Conclusion: Efficient Reinforcement Learning Through Smart Post-Training Strategies

The researchers demonstrated that the key to scaling reasoning models is not just larger model size but intelligent control over training data difficulty, sampling diversity, and inference length. Polaris offers a reproducible recipe that effectively tunes these elements, allowing smaller models to rival the reasoning ability of massive commercial systems.


Check out the Model and Code. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.


Nikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.



Source_link

READ ALSO

Joi Chatbot Access, Pricing, and Feature Overview

Qwen Researchers Release Qwen3-TTS: an Open Multilingual TTS Suite with Real-Time Latency and Fine-Grained Voice Control

Related Posts

Joi Chatbot Access, Pricing, and Feature Overview
Al, Analytics and Automation

Joi Chatbot Access, Pricing, and Feature Overview

January 23, 2026
Qwen Researchers Release Qwen3-TTS: an Open Multilingual TTS Suite with Real-Time Latency and Fine-Grained Voice Control
Al, Analytics and Automation

Qwen Researchers Release Qwen3-TTS: an Open Multilingual TTS Suite with Real-Time Latency and Fine-Grained Voice Control

January 23, 2026
Quality Data Annotation for Cardiovascular AI
Al, Analytics and Automation

Quality Data Annotation for Cardiovascular AI

January 23, 2026
A Missed Forecast, Frayed Nerves and a Long Trip Back
Al, Analytics and Automation

A Missed Forecast, Frayed Nerves and a Long Trip Back

January 23, 2026
Microsoft Releases VibeVoice-ASR: A Unified Speech-to-Text Model Designed to Handle 60-Minute Long-Form Audio in a Single Pass
Al, Analytics and Automation

Microsoft Releases VibeVoice-ASR: A Unified Speech-to-Text Model Designed to Handle 60-Minute Long-Form Audio in a Single Pass

January 23, 2026
Slow Down the Machines? Wall Street and Silicon Valley at Odds Over A.I.’s Nearest Future
Al, Analytics and Automation

Slow Down the Machines? Wall Street and Silicon Valley at Odds Over A.I.’s Nearest Future

January 22, 2026
Next Post
The AI Agent That Works in Your Shell

The AI Agent That Works in Your Shell

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Google announced the next step in its nuclear energy plans 

Google announced the next step in its nuclear energy plans 

August 20, 2025

EDITOR'S PICK

Grow a Garden Clam Pet Wiki

Grow a Garden Clam Pet Wiki

November 24, 2025
6 Best Employee Engagement Software: My Evaluation

6 Best Employee Engagement Software: My Evaluation

September 21, 2025
Google launched a new 45W fast charger for the Pixel 9 series

Google launched a new 45W fast charger for the Pixel 9 series

December 8, 2025
Lollipop AI Chatbot App: My Unfiltered Thoughts

Lollipop AI Chatbot App: My Unfiltered Thoughts

September 20, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • B2B Attendees Demand More Value at Events—8 Brands Weigh In
  • Celebrating the art of teaching and the science of learning
  • What is ChatGPT Atlas? – Moz
  • What Still Matters and What Doesn’t
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?