The initial reactions to OpenAI’s landmark open source gpt-oss models are highly varied and mixed

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now

OpenAI’s long-awaited return to the “open” of its namesake occurred yesterday with the release of two new large language models (LLMs): gpt-oss-120B and gpt-oss-20B.

But despite achieving technical benchmarks on par with OpenAI’s other powerful proprietary AI model offerings, the broader AI developer and user community’s initial response has so far been all over the map. If this release were a movie premiering and being graded on Rotten Tomatoes, we’d be looking at a near 50% split, based on my observations.

Dutch intelligence services warn of Russian hackers targeting Signal and WhatsApp

Our Favorite Wireless Headphones Are $60 Off

First some background: OpenAI has released these two new text-only language models (no image generation or analysis) both under the permissive open source Apache 2.0 license — the first time since 2019 (before ChatGPT) that the company has done so with a cutting-edge language model.

The entire ChatGPT era of the last 2.7 years has so far been powered by proprietary or closed-source models, ones that OpenAI controlled and that users had to pay to access (or use a free tier subject to limits), with limited customizability and no way to run them offline or on private computing hardware.

AI Scaling Hits Its Limits

Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are:

Turning energy into a strategic advantage

Architecting efficient inference for real throughput gains

Unlocking competitive ROI with sustainable AI systems

Secure your spot to stay ahead: https://bit.ly/4mwGngO

But that all changed thanks to the release of the pair of gpt-oss models yesterday, one larger and more powerful for use on a single Nvidia H100 GPU at say, a small or medium-sized enterprise’s data center or server farm, and an even smaller one that works on a single consumer laptop or desktop PC like the kind in your home office.

Of course, the models being so new, it’s taken several hours for the AI power user community to independently run and test them out on their own individual benchmarks (measurements) and tasks.

And now we’re getting a wave of feedback ranging from optimistic enthusiasm about the potential of these powerful, free, and efficient new models to an undercurrent of dissatisfaction and dismay with what some users see as significant problems and limitations, especially compared to the wave of similarly Apache 2.0-licensed powerful open source, multimodal LLMs from Chinese startups (which can also be taken, customized, run locally on U.S. hardware for free by U.S. companies, or companies anywhere else around the world).

High benchmarks, but still behind Chinese open source leaders

Intelligence benchmarks place the gpt-oss models ahead of most American open-source offerings. According to independent third-party AI benchmarking firm Artificial Analysis, gpt-oss-120B is “the most intelligent American open weights model,” though it still falls short of Chinese heavyweights like DeepSeek R1 and Qwen3 235B.

“On reflection, that’s all they did. Mogged on benchmarks,” wrote self-proclaimed DeepSeek “stan” @teortaxesTex. “No good derivative models will be trained… No new usecases created… Barren claim to bragging rights.”

That skepticism is echoed by pseudonymous open source AI researcher Teknium (@Teknium1), co-founder of rival open source AI model provider Nous Research, who called the release “a legitimate nothing burger,” on X, and predicted a Chinese model will soon eclipse it. “Overall very disappointed and I legitimately came open minded to this,” they wrote.

Bench-maxxing on math and coding at the expense of writing?

Other criticism focused on the gpt-oss models’ apparent narrow usefulness.

AI influencer “Lisan al Gaib (@scaling01)” noted that the models excel at math and coding but “completely lack taste and common sense.” He added, “So it’s just a math model?”

In creative writing tests, some users found the model injecting equations into poetic outputs. “This is what happens when you benchmarkmax,” Teknium remarked, sharing a screenshot where the model added an integral formula mid-poem.

And @kalomaze, a researcher at decentralized AI model training company Prime Intellect, wrote that “gpt-oss-120b knows less about the world than what a good 32b does. probably wanted to avoid copyright issues so they likely pretrained on majority synth. pretty devastating stuff”

Former Googler and independent AI developer Kyle Corbitt agreed that the gpt-oss pair of models seemed to have been trained primarily on synthetic data — that is, data generated by an AI model specifically for the purposes of training a new one — making it “extremely spiky.”

It’s “great at the tasks it’s trained on, really bad at everything else,” Corbitt wrote, i.e., great on coding and math problems, and bad at more linguistic tasks like creative writing or report generation.

In other words, the charge is that OpenAI deliberately trained the model on more synthetic data than real world facts and figures to avoid using copyrighted data scraped from websites and other repositories it doesn’t own or have license to use, which is something it and many other leading gen AI companies have been accused of in the past and are facing down ongoing lawsuits as a result of.

Others speculated OpenAI may have trained the model on primarily synthetic data to avoid safety and security issues, resulting in worse quality than if it had been trained on more real world (and presumably copyrighted) data.

Concerning third-party benchmark results

Moreover, evaluating the models on third-party benchmarking tests have turned up concerning metrics in some users’ eyes.

SpeechMap — which measures the performance of LLMs in complying with user prompts to generate disallowed, biased, or politically sensitive outputs — showed compliance scores for gpt-oss 120B hovering under 40%, near the bottom of peer open models, which indicates resistance to follow user requests and defaulting to guardrails, potentially at the expense of providing accurate information.

In Aider’s Polyglot evaluation, gpt-oss-120B scored just 41.8% in multilingual reasoning—far below competitors like Kimi-K2 (59.1%) and DeepSeek-R1 (56.9%).

Some users also said their tests indicated the model is oddly resistant to generating criticism of China or Russia, a contrast to its treatment of the US and EU, raising questions about bias and training data filtering.

Other experts have applauded the release and what it signals for U.S. open source AI

To be fair, not all the commentary is negative. Software engineer and close AI watcher Simon Willison called the release “really impressive” on X, elaborating in a blog post on the models’ efficiency and ability to achieve parity with OpenAI’s proprietary o3-mini and o4-mini models.

He praised their strong performance on reasoning and STEM-heavy benchmarks, and hailed the new “Harmony” prompt template format — which offers developers more structured terms for guiding model responses — and support for third-party tool use as meaningful contributions.

In a lengthy X post, Clem Delangue, CEO and co-founder of AI code sharing and open source community Hugging Face, encouraged users not to rush to judgment, pointing out that inference for these models is complex, and early issues could be due to infrastructure instability and insufficient optimization among hosting providers.

“The power of open-source is that there’s no cheating,” Delangue wrote. “We’ll uncover all the strengths and limitations… progressively.”

Even more cautious was Wharton School of Business at the University of Pennsylvania professor Ethan Mollick, who wrote on X that “The US now likely has the leading open weights models (or close to it)”, but questioned whether this is a one-off by OpenAI. “The lead will evaporate quickly as others catch up,” he noted, adding that it’s unclear what incentives OpenAI has to keep the models updated.

Nathan Lambert, a leading AI researcher at the rival open source lab Allen Institute for AI (Ai2) and commentator, praised the symbolic significance of the release on his blog Interconnects, calling it “a phenomenal step for the open ecosystem, especially for the West and its allies, that the most known brand in the AI space has returned to openly releasing models.”

But he cautioned on X that gpt-oss is “unlikely to meaningfully slow down [Chinese e-commerce giant Aliaba’s AI team] Qwen,” citing its usability, performance, and variety.

He argued the release marks an important shift in the U.S. toward open models, but that OpenAI still has a “long path back” to catch up in practice.

A split verdict

The verdict, for now, is split.

OpenAI’s gpt-oss models are a landmark in terms of licensing and accessibility.

But while the benchmarks look solid, the real-world “vibes” — as many users describe it — are proving less compelling.

Whether developers can build strong applications and derivatives on top of gpt-oss will determine whether the release is remembered as a breakthrough or a blip.

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.