• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Thursday, July 3, 2025
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

How Good Are AI Agents at Real Research? Inside the Deep Research Bench Report

Josh by Josh
June 2, 2025
in Al, Analytics and Automation
0
How Good Are AI Agents at Real Research? Inside the Deep Research Bench Report
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter


As large language models (LLMs) rapidly evolve, so does their promise as powerful research assistants. Increasingly, they’re not just answering simple factual questions—they’re tackling “deep research” tasks, which involve multi-step reasoning, evaluating conflicting information, sourcing data from across the web, and synthesizing it into a coherent output.

This emerging capability is now being marketed under different brand names by major labs—OpenAI calls it “Deep Research”, Anthropic refers to it as “Extended Thinking”, Google’s Gemini offers “Search + Pro” features, and Perplexity labels theirs “Pro Search” or “Deep Research”. But how effective are these offerings in practice? A new report by FutureSearch, titled Deep Research Bench (DRB): Evaluating Web Research Agents, offers the most rigorous evaluation to date—and the results reveal both impressive capabilities and critical shortcomings.

What Is Deep Research Bench?

Created by the FutureSearch team, Deep Research Bench is a meticulously constructed benchmark designed to assess AI agents’ performance on multi-step, web-based research tasks. These aren’t simple questions with straightforward answers—they reflect the messy, open-ended challenges faced by analysts, policymakers, and researchers in real-world settings.

The benchmark includes 89 distinct tasks across 8 categories such as:

  • Find Number: e.g. “How many FDA Class II medical device recalls occurred?”
  • Validate Claim: e.g. “Is ChatGPT 10x more energy-intensive than Google Search?”
  • Compile Dataset: e.g. “Job trends for US software developers from 2019–2023”

Each task type is carefully structured with human-verified answers and evaluated using a frozen dataset of scraped web pages, known as RetroSearch. This ensures consistency across model evaluations, avoiding the fluctuating state of the live web.

The Agent Architecture: ReAct and RetroSearch

At the heart of Deep Research Bench lies the ReAct architecture, short for “Reason + Act.” This method mimics how a human researcher might tackle a problem—by thinking through the task, taking an action like performing a web search, observing the results, and then deciding whether to iterate or conclude.

While earlier models follow this loop explicitly, newer “thinking” models often streamline the process, embedding reasoning more fluidly into their actions. To ensure consistency across evaluations, DRB introduces RetroSearch—a custom-built, static version of the web. Rather than relying on the live internet, which constantly changes, agents tap into a curated archive of web pages scraped using tools like Serper, Playwright, and ScraperAPI. The scale is impressive: for high-complexity tasks such as “Gather Evidence,” RetroSearch can provide access to over 189,000 pages, all frozen in time, ensuring a fair and replicable testing environment.

Which AI Agents Perform Best?

Among all the contenders, OpenAI’s o3 emerged as the top performer, scoring 0.51 out of a possible 1.0 on the Deep Research Bench. While that might sound modest, it’s important to understand the benchmark’s difficulty: due to ambiguity in task definitions and scoring, even a flawless agent would likely top out around 0.8—what researchers call the “noise ceiling.” In other words, even the best models today still fall short of well-informed, methodical human researchers.

Still, the leaderboard offers revealing insights. o3 not only led the pack but did so with speed and consistency, showing strong performance across nearly all task types. Claude 3.7 Sonnet from Anthropic followed closely, demonstrating versatility in both its “thinking” and “non-thinking” modes. Gemini 2.5 Pro, Google’s flagship model, stood out for its ability to handle tasks requiring structured planning and step-by-step reasoning. Meanwhile, the open-weight DeepSeek-R1 delivered a pleasant surprise—keeping pace with GPT-4 Turbo and narrowing the performance gap between open and closed models.

Across the board, a clear pattern emerged: newer, “thinking-enabled” models consistently outperformed their earlier counterparts, and closed-source models maintained a notable edge over open-weight alternatives.

Where Do Agents Struggle?

Reading through the failure patterns highlighted in the Deep Research Bench report felt surprisingly familiar. One of the most frustrating aspects I’ve personally encountered—especially during long research or content creation sessions—is when an AI agent simply forgets what we were doing. As the context window stretches, the model often begins to lose the thread: key details fade, goals get muddled, and suddenly, the responses feel disjointed or aimless. At some point, I’ve learned it’s often better to cut losses and start from scratch, even if it means throwing away everything that’s been generated so far.

That kind of forgetfulness isn’t just anecdotal—it’s the most significant predictor of failure in the Deep Research Bench evaluation. But it’s not the only recurring issue. The report also highlights how some models fall into repetitive tool use, running the same search over and over as if stuck in a loop. Others show poor query crafting, lazily keyword-matching instead of thinking critically about how to search effectively. And far too often, agents fall victim to premature conclusions—delivering a half-formed answer that technically checks the box but falls short of real insight.

Even among the top models, the differences are stark. GPT-4 Turbo, for example, showed a notable tendency to forget prior steps, while DeepSeek-R1 was more likely to hallucinate or invent plausible-sounding—but incorrect—information. Across the board, models frequently failed to cross-check sources or validate findings before finalizing their output. For anyone who’s relied on AI for serious work, these issues will feel all too familiar—and they underscore how far we still have to go in building agents that can truly think and research like humans.

What About Memory-Based Performance?

Interestingly, Deep Research Bench also evaluated what it calls “toolless” agents—language models operating without any access to external tools, such as web search or document retrieval. These agents rely entirely on their internal training data and memory, generating answers based solely on what they’ve previously learned during training. In practice, this means they can’t look anything up or verify information—they’re guessing based on what they “remember.”

Surprisingly, these toolless agents performed almost as well as full research agents on certain tasks. For example, on the Validate Claim task—where the goal is to assess the plausibility of a statement—they scored 0.61, nearly matching the 0.62 average of tool-enabled agents. This suggests that models like o3 and Claude have strong internal priors and can often recognize the truthfulness of common claims without needing to search the web.

But on more demanding tasks—like Derive Number, which requires piecing together multiple values from various sources, or Gather Evidence, which depends on finding and evaluating diverse facts in context—these toolless models completely fell apart. Without fresh information or real-time lookup capabilities, they simply lacked the means to produce accurate or comprehensive answers.

This contrast highlights an important nuance: while today’s LLMs can simulate “knowing” a lot, deep research depends not just on recall, but on reasoning with up-to-date, verifiable information—something only tool-augmented agents can truly deliver.

Final Thoughts

The DRB report makes one thing clear: while today’s best AI agents can outpace average humans on narrowly defined tasks, they still lag behind skilled generalist researchers—especially when it comes to planning strategically, adapting mid-process, and reasoning with nuance.

This gap becomes especially obvious during long or complex sessions—something I’ve experienced firsthand, where an agent gradually loses track of the task’s purpose, leading to a frustrating breakdown in coherence and utility.

What makes Deep Research Bench so valuable is that it doesn’t just test surface-level knowledge—it probes the intersection of tool use, memory, reasoning, and adaptation, offering a closer analog to real-world research than benchmarks like MMLU or GSM8k.

As LLMs continue to integrate into serious knowledge work, FutureSearch tools like DRB will be essential for assessing not just what these systems know, but how well they actually work.



Source_link

READ ALSO

Artificial intelligence enhances air mobility planning | MIT News

DeepSeek R1T2 Chimera: 200% Faster Than R1-0528 With Improved Reasoning and Compact Output

Related Posts

Artificial intelligence enhances air mobility planning | MIT News
Al, Analytics and Automation

Artificial intelligence enhances air mobility planning | MIT News

July 3, 2025
DeepSeek R1T2 Chimera: 200% Faster Than R1-0528 With Improved Reasoning and Compact Output
Al, Analytics and Automation

DeepSeek R1T2 Chimera: 200% Faster Than R1-0528 With Improved Reasoning and Compact Output

July 3, 2025
Confronting the AI/energy conundrum
Al, Analytics and Automation

Confronting the AI/energy conundrum

July 3, 2025
Baidu Open Sources ERNIE 4.5: LLM Series Scaling from 0.3B to 424B Parameters
Al, Analytics and Automation

Baidu Open Sources ERNIE 4.5: LLM Series Scaling from 0.3B to 424B Parameters

July 2, 2025
Novel method detects microbial contamination in cell cultures | MIT News
Al, Analytics and Automation

Novel method detects microbial contamination in cell cultures | MIT News

July 2, 2025
Baidu Researchers Propose AI Search Paradigm: A Multi-Agent Framework for Smarter Information Retrieval
Al, Analytics and Automation

Baidu Researchers Propose AI Search Paradigm: A Multi-Agent Framework for Smarter Information Retrieval

July 2, 2025
Next Post
AI Regulation Isn’t Over; It’s Evolving: What Marketers Need to Know

AI Regulation Isn't Over; It's Evolving: What Marketers Need to Know

POPULAR NEWS

Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
7 Best EOR Platforms for Software Companies in 2025

7 Best EOR Platforms for Software Companies in 2025

June 21, 2025
Eating Bugs – MetaDevo

Eating Bugs – MetaDevo

May 29, 2025
Top B2B & Marketing Podcasts to Lead You to Succeed in 2025 – TopRank® Marketing

Top B2B & Marketing Podcasts to Lead You to Succeed in 2025 – TopRank® Marketing

May 30, 2025
Entries For The Elektra Awards 2025 Are Now Open!

Entries For The Elektra Awards 2025 Are Now Open!

May 30, 2025

EDITOR'S PICK

ChatGPT-maker OpenAI partners with Barbie-maker Mattel to make AI toys

ChatGPT-maker OpenAI partners with Barbie-maker Mattel to make AI toys

June 19, 2025
Key Stats on Business Messaging Preferences

Key Stats on Business Messaging Preferences

June 7, 2025
We’re investing for a cleaner energy future with TAE Technologies, a leading nuclear fusion company.

We’re investing for a cleaner energy future with TAE Technologies, a leading nuclear fusion company.

June 2, 2025
How I Use AI to Save Hours Every Week—And You Can Too

How I Use AI to Save Hours Every Week—And You Can Too

June 23, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • Cyber Incident Planning And Response – A Business Imperative In 2025
  • New Test Features for AI Generation
  • Google Launches Veo 3 for Realistic AI Video Creation
  • Artificial intelligence enhances air mobility planning | MIT News
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?