Weibo's new open source AI model VibeThinker-1.5B outperforms DeepSeek-R1 on $7,800 post-training budget

Another day in late 2025, another impressive result from a Chinese company in open source artificial intelligence.

Our favorite 2025 advent calendars from Lego, Pokémon, Funko Pop, Magna-Tiles and more

DHS Kept Chicago Police Records for Months in Violation of Domestic Espionage Rules

Chinese social networking company Weibo's AI division recently released its open source VibeThinker-1.5B—a 1.5 billion parameter large language model (LLM) that is a fine-tuned variant of rival Chinese tech firm Alibaba's Qwen2.5-Math-1.5B.

It's available now for free download and usage by researchers and enterprise developers—even for commercial purposes—under a permissive MIT License on Hugging Face, GitHub and ModelScope, with a technical report on open access science publishing site arxiv.org.

And yet, despite its compact size, VibeThinker-1.5B achieves benchmark-topping reasoning performance on math and code tasks, rivaling or surpassing models hundreds of times its size, even outperforming Chinese rival DeepSeek's famed R1 that went viral at the start of this year—a 671-billion parameter model—on formal reasoning benchmark.

It further eclipses Mistral AI's Magistral Medium and holds its own against Anthropic's Claude Opus 4 and OpenAI's gpt-oss-20B Medium, all while requiring a fraction of the infrastructure and investment.

It also does so having been post-trained on a budget of merely $7800 USD for compute resources (3900 GPU hours on Nvidia H800s) — far less than the tens, or even hundreds, of thousands of dollars typically required to fine-tune models of similar or larger scale.

Recall this is not the total cost of the model's development, however: LLMs are trained in stages. First comes pre-training, when the model learns basic language structure and general knowledge by predicting the next word across enormous amounts of text from the internet, books, and articles. This gives it fluency but not much sense of how to follow instructions or hold a conversation

Post-training comes next, using much smaller, higher-quality datasets—typically collections of example questions, prompts, and expert-written answers—to teach the model how to respond helpfully, reason through problems, and align with human expectations. Still, Weibo's post-training cost effectiveness on VibeThinker-1.5B is noteworthy and should be commended.

The open-source release upends assumptions about parameter scale, compute intensity, and the minimum viable size for high-performance LLMs.

A Different Training Approach: Spectrum-to-Signal

VibeThinker-1.5B owes its performance not to scale, but to the training framework behind it: the Spectrum-to-Signal Principle (SSP).

Instead of optimizing a model purely for single-answer correctness (Pass@1), the SSP framework decouples supervised fine-tuning (SFT) and reinforcement learning (RL) into two distinct phases with different goals:

SFT (“Spectrum Phase”): The model is trained to maximize diversity across potential correct answers, improving its Pass@K score. This builds a wide range of plausible solution paths.
RL (“Signal Phase”): A second-stage reinforcement learning system (called MaxEnt-Guided Policy Optimization, or MGPO) is used to identify and amplify the most correct paths from this diverse solution pool. MGPO prioritizes problems where the model is most uncertain, using entropy-based weighting to focus learning.

The authors argue this separation allows small models to explore reasoning space more effectively—achieving signal amplification without relying on massive parameter counts.

VibeThinker-1.5B makes a compelling case that the industry’s reliance on parameter scaling as the only route to better reasoning performance may be outdated.

By adopting a diversity-first training pipeline, WeiboAI has shown that smaller, more accessible models can match and even outperform billion-dollar systems in logic-heavy tasks.

The low resource footprint is among the most significant aspects of VibeThinker-1.5B. At under $8,000, the post-training cost is 30–60x lower than models like DeepSeek R1 and MiniMax-M1, which cost between $294K and $535K to train.

Performance Across Domains

Despite its small size, VibeThinker-1.5B delivers cross-domain reasoning that outpaces many larger open-source and commercial models:

Model	AIME25	LiveCodeBench v6	GPQA-Diamond
VibeThinker-1.5B	74.4	51.1	46.7
GPT-OSS-20B-Medium	72.1	54.9	66.0
Claude Opus 4	69.2	56.6	79.6
MiniMax M1 (456B)	74.6	62.3	69.2
DeepSeek R1 (671B)	70.0	65.9	71.5
Kimi K2 (1.09T)	49.5	53.7	75.1

VibeThinker was benchmarked against both reasoning-centric models (Magistral, Claude, OpenAI o3-mini) and non-reasoning LLMs (GPT-4.1, Kimi K2, DeepSeek V3). Across structured reasoning benchmarks, the model consistently outperformed non-reasoning models, regardless of size:

On AIME24 (math), it beat Kimi K2 (1.09T) by over 10 points (80.3 vs. 69.6).
On LiveCodeBench v6, it surpassed Claude Opus 4 (51.1 vs. 47.4).
On GPQA, it scored below GPT-4.1 and Claude, but still doubled its base model (from 16.4 to 46.7).

This supports the authors’ claim that size is not the only path to reasoning capability—with proper training design, smaller models can reach or even exceed the performance of far larger systems in targeted tasks.

Notably, it achieves parity with models hundreds of times larger on math and code, though it lags behind in general knowledge reasoning (GPQA), where larger models maintain an edge.

This suggests a potential specialization trade-off: while VibeThinker excels at structured logical tasks, it has less capacity for wide-ranging encyclopedic recall, a known limitation of smaller architectures.

Guidance for Enterprise Adoption

The release includes recommended inference settings (temperature = 0.6, top_p = 0.95, max tokens = 40960).

The model is small enough to be deployed on edge devices, including mobile phones and vehicle-embedded systems, while inference costs are estimated to be 20–70x cheaper than with large models.

This positions VibeThinker-1.5B not just as a research achievement, but as a potential foundation for cost-efficient, locally deployable reasoning systems.

Weibo’s Strategy and Market Position

Weibo, launched by Sina Corporation in 2009, remains a cornerstone of China’s social media ecosystem. Often described as China’s version of X (formerly Twitter), the platform blends microblogging, multimedia content, and trending-topic features with a regulatory environment shaped by tight government oversight.

Despite counting 600 million monthly active users (more than twice that of X), investors are not optimistic about its advertising revenue growth potential in the near term, and Weibo is navigating intensifying competition from video-first platforms like Douyin, which are drawing younger users and increasing time-spent elsewhere.

In response, Weibo has leaned into creator-economy monetization, live-streaming, and vertical video—adding tools for influencer engagement, e-commerce integration, and richer analytics for brands.

The platform’s role as a digital public square also makes it a focus of regulatory scrutiny. Chinese authorities continue to apply pressure on issues ranging from content governance to data security. In September 2025, Weibo was among the platforms cited in official warnings, highlighting its ongoing exposure to policy risks.

Weibo’s push into AI R&D—exemplified by the release of VibeThinker-1.5B—signals a shift in ambition. Beyond being a media platform, Weibo is positioning itself as a player in the next phase of Chinese AI development, using its capital reserves, user behavior data, and in-house research capacity to pursue adjacent technical domains.

What It Means for Enterprise Technical Decision Makers

For engineering leaders and enterprise AI teams, VibeThinker’s release has practical implications for everything from orchestration pipelines to cost modeling.

A 1.5B-parameter model that outperforms 100x larger models on math and programming tasks doesn’t just save compute—it shifts the architectural balance. It enables LLM inference on constrained infrastructure, reduces latency at the edge, and lowers the barrier to entry for applications that otherwise would have required API access to closed, frontier-scale models.

That matters for enterprise ML leads trying to deploy reasoning-capable agents within existing systems, or for platform owners tasked with integrating LLMs into automated workflows.

It also speaks to those running reinforcement learning from human feedback (RLHF) pipelines or managing inference optimization across hybrid cloud environments.

The model’s post-training methodology—particularly its entropy-targeted reinforcement learning approach—offers a roadmap for teams looking to refine smaller checkpoints instead of relying on large-scale pretraining.

VibeThinker’s benchmark transparency and data decontamination steps also address another emerging priority in enterprise AI: auditability. While its performance on general-knowledge tests still trails large frontier models, its task-specific reliability makes it an attractive candidate for controlled environments where correctness matters more than coverage.

In short, VibeThinker-1.5B isn’t just a research milestone—it’s a strong candidate for practical enterprise use, deployment and learnings. It suggests that a new class of compact, reasoning-optimized models is viable for enterprise use cases that were previously the domain of far larger systems. For organizations trying to balance cost, latency, interpretability, and control, it’s a good new option to the long, growing list of Chinese open source offerings.

Source_link