How Good Are AI Agents at Real Research? Inside the Deep Research Bench Report

As large language models (LLMs) rapidly evolve, so does their promise as powerful research assistants. Increasingly, they’re not just answering simple factual questions—they’re tackling “deep research” tasks, which involve multi-step reasoning, evaluating conflicting information, sourcing data from across the web, and synthesizing it into a coherent output.

This emerging capability is now being marketed under different brand names by major labs—OpenAI calls it “Deep Research”, Anthropic refers to it as “Extended Thinking”, Google’s Gemini offers “Search + Pro” features, and Perplexity labels theirs “Pro Search” or “Deep Research”. But how effective are these offerings in practice? A new report by FutureSearch, titled Deep Research Bench (DRB): Evaluating Web Research Agents, offers the most rigorous evaluation to date—and the results reveal both impressive capabilities and critical shortcomings.

What Is Deep Research Bench?

Created by the FutureSearch team, Deep Research Bench is a meticulously constructed benchmark designed to assess AI agents’ performance on multi-step, web-based research tasks. These aren’t simple questions with straightforward answers—they reflect the messy, open-ended challenges faced by analysts, policymakers, and researchers in real-world settings.

The benchmark includes 89 distinct tasks across 8 categories such as:

Find Number: e.g. “How many FDA Class II medical device recalls occurred?”
Validate Claim: e.g. “Is ChatGPT 10x more energy-intensive than Google Search?”
Compile Dataset: e.g. “Job trends for US software developers from 2019–2023”

Each task type is carefully structured with human-verified answers and evaluated using a frozen dataset of scraped web pages, known as RetroSearch. This ensures consistency across model evaluations, avoiding the fluctuating state of the live web.

The Agent Architecture: ReAct and RetroSearch

At the heart of Deep Research Bench lies the ReAct architecture, short for “Reason + Act.” This method mimics how a human researcher might tackle a problem—by thinking through the task, taking an action like performing a web search, observing the results, and then deciding whether to iterate or conclude.

While earlier models follow this loop explicitly, newer “thinking” models often streamline the process, embedding reasoning more fluidly into their actions. To ensure consistency across evaluations, DRB introduces RetroSearch—a custom-built, static version of the web. Rather than relying on the live internet, which constantly changes, agents tap into a curated archive of web pages scraped using tools like Serper, Playwright, and ScraperAPI. The scale is impressive: for high-complexity tasks such as “Gather Evidence,” RetroSearch can provide access to over 189,000 pages, all frozen in time, ensuring a fair and replicable testing environment.

Which AI Agents Perform Best?

Among all the contenders, OpenAI’s o3 emerged as the top performer, scoring 0.51 out of a possible 1.0 on the Deep Research Bench. While that might sound modest, it’s important to understand the benchmark’s difficulty: due to ambiguity in task definitions and scoring, even a flawless agent would likely top out around 0.8—what researchers call the “noise ceiling.” In other words, even the best models today still fall short of well-informed, methodical human researchers.

Still, the leaderboard offers revealing insights. o3 not only led the pack but did so with speed and consistency, showing strong performance across nearly all task types. Claude 3.7 Sonnet from Anthropic followed closely, demonstrating versatility in both its “thinking” and “non-thinking” modes. Gemini 2.5 Pro, Google’s flagship model, stood out for its ability to handle tasks requiring structured planning and step-by-step reasoning. Meanwhile, the open-weight DeepSeek-R1 delivered a pleasant surprise—keeping pace with GPT-4 Turbo and narrowing the performance gap between open and closed models.

Across the board, a clear pattern emerged: newer, “thinking-enabled” models consistently outperformed their earlier counterparts, and closed-source models maintained a notable edge over open-weight alternatives.

Where Do Agents Struggle?

Reading through the failure patterns highlighted in the Deep Research Bench report felt surprisingly familiar. One of the most frustrating aspects I’ve personally encountered—especially during long research or content creation sessions—is when an AI agent simply forgets what we were doing. As the context window stretches, the model often begins to lose the thread: key details fade, goals get muddled, and suddenly, the responses feel disjointed or aimless. At some point, I’ve learned it’s often better to cut losses and start from scratch, even if it means throwing away everything that’s been generated so far.

That kind of forgetfulness isn’t just anecdotal—it’s the most significant predictor of failure in the Deep Research Bench evaluation. But it’s not the only recurring issue. The report also highlights how some models fall into repetitive tool use, running the same search over and over as if stuck in a loop. Others show poor query crafting, lazily keyword-matching instead of thinking critically about how to search effectively. And far too often, agents fall victim to premature conclusions—delivering a half-formed answer that technically checks the box but falls short of real insight.

Even among the top models, the differences are stark. GPT-4 Turbo, for example, showed a notable tendency to forget prior steps, while DeepSeek-R1 was more likely to hallucinate or invent plausible-sounding—but incorrect—information. Across the board, models frequently failed to cross-check sources or validate findings before finalizing their output. For anyone who’s relied on AI for serious work, these issues will feel all too familiar—and they underscore how far we still have to go in building agents that can truly think and research like humans.

What About Memory-Based Performance?

Interestingly, Deep Research Bench also evaluated what it calls “toolless” agents—language models operating without any access to external tools, such as web search or document retrieval. These agents rely entirely on their internal training data and memory, generating answers based solely on what they’ve previously learned during training. In practice, this means they can’t look anything up or verify information—they’re guessing based on what they “remember.”

Surprisingly, these toolless agents performed almost as well as full research agents on certain tasks. For example, on the Validate Claim task—where the goal is to assess the plausibility of a statement—they scored 0.61, nearly matching the 0.62 average of tool-enabled agents. This suggests that models like o3 and Claude have strong internal priors and can often recognize the truthfulness of common claims without needing to search the web.

But on more demanding tasks—like Derive Number, which requires piecing together multiple values from various sources, or Gather Evidence, which depends on finding and evaluating diverse facts in context—these toolless models completely fell apart. Without fresh information or real-time lookup capabilities, they simply lacked the means to produce accurate or comprehensive answers.

This contrast highlights an important nuance: while today’s LLMs can simulate “knowing” a lot, deep research depends not just on recall, but on reasoning with up-to-date, verifiable information—something only tool-augmented agents can truly deliver.

Final Thoughts

The DRB report makes one thing clear: while today’s best AI agents can outpace average humans on narrowly defined tasks, they still lag behind skilled generalist researchers—especially when it comes to planning strategically, adapting mid-process, and reasoning with nuance.

This gap becomes especially obvious during long or complex sessions—something I’ve experienced firsthand, where an agent gradually loses track of the task’s purpose, leading to a frustrating breakdown in coherence and utility.

What makes Deep Research Bench so valuable is that it doesn’t just test surface-level knowledge—it probes the intersection of tool use, memory, reasoning, and adaptation, offering a closer analog to real-world research than benchmarks like MMLU or GSM8k.

As LLMs continue to integrate into serious knowledge work, FutureSearch tools like DRB will be essential for assessing not just what these systems know, but how well they actually work.

Source_link