Refresh

This website www.unite.ai/zh-CN/how-good-are-ai-agents-at-real-research-inside-the-deep-research-bench-report/ is currently offline. Cloudflare's Always Online™ shows a snapshot of this web page from the Internet Archive's Wayback Machine. To check for the live version, click Refresh.

关注我们.

人工智能

AI 代理在实际研究中表现如何?深入研究基准报告

mm

发布时间

 on

As 大型语言模型 (LLM) rapidly evolve, so does their promise as powerful research assistants. Increasingly, they’re not just answering simple factual questions—they’re tackling “deep research” tasks, which involve multi-step reasoning, evaluating conflicting information, sourcing data from across the web, and synthesizing it into a coherent output.

This emerging capability is now being marketed under different brand names by major labs—OpenAI calls it “Deep Research”, Anthropic refers to it as “Extended Thinking”, Google’s Gemini offers “Search + Pro” features, and Perplexity labels theirs “Pro Search” or “Deep Research”. But how effective are these offerings in practice? A new report by FutureSearch名为 Deep Research Bench (DRB): Evaluating Web Research Agents, offers the most rigorous evaluation to date—and the results reveal both impressive capabilities and critical shortcomings.

What Is Deep Research Bench?

Created by the FutureSearch team, Deep Research Bench is a meticulously constructed benchmark designed to assess AI agents' performance on multi-step, web-based research tasks. These aren't simple questions with straightforward answers—they reflect the messy, open-ended challenges faced by analysts, policymakers, and researchers in real-world settings.

The benchmark includes 89 distinct tasks across 8 categories such as:

  • 查找号码: e.g. “How many FDA Class II medical device recalls occurred?”
  • Validate Claim: e.g. “Is ChatGPT 10x more energy-intensive than Google Search?”
  • Compile Dataset: e.g. “Job trends for US software developers from 2019–2023”

Each task type is carefully structured with human-verified answers and evaluated using a frozen dataset of scraped web pages, known as RetroSearch. This ensures consistency across model evaluations, avoiding the fluctuating state of the live web.

The Agent Architecture: ReAct and RetroSearch

At the heart of Deep Research Bench lies the ReAct architecture, short for “Reason + Act.” This method mimics how a human researcher might tackle a problem—by thinking through the task, taking an action like performing a web search, observing the results, and then deciding whether to iterate or conclude.

While earlier models follow this loop explicitly, newer “thinking” models often streamline the process, embedding reasoning more fluidly into their actions. To ensure consistency across evaluations, DRB introduces RetroSearch—a custom-built, static version of the web. Rather than relying on the live internet, which constantly changes, agents tap into a curated archive of web pages scraped using tools like 毒蛇, 剧作家爬虫API. The scale is impressive: for high-complexity tasks such as “Gather Evidence,” RetroSearch can provide access to over 189,000 pages, all frozen in time, ensuring a fair and replicable testing environment.

Which AI Agents Perform Best?

Among all the contenders, OpenAI’s o3 emerged as the top performer, scoring 0.51 out of a possible 1.0 on the Deep Research Bench. While that might sound modest, it’s important to understand the benchmark’s difficulty: due to ambiguity in task definitions and scoring, even a flawless agent would likely top out around 0.8—what researchers call the “noise ceiling.” In other words, even the best models today still fall short of well-informed, methodical human researchers.

Still, the leaderboard offers revealing insights. o3 not only led the pack but did so with speed and consistency, showing strong performance across nearly all task types. Claude 3.7 Sonnet from Anthropic followed closely, demonstrating versatility in both its “thinking” and “non-thinking” modes. Gemini 2.5 Pro, Google’s flagship model, stood out for its ability to handle tasks requiring structured planning and step-by-step reasoning. Meanwhile, the open-weight DeepSeek-R1 delivered a pleasant surprise—keeping pace with GPT-4 Turbo and narrowing the performance gap between open and closed models.

Across the board, a clear pattern emerged: newer, “thinking-enabled” models consistently outperformed their earlier counterparts, and closed-source models maintained a notable edge over open-weight alternatives.

Where Do Agents Struggle?

Reading through the failure patterns highlighted in the Deep Research Bench report felt surprisingly familiar. One of the most frustrating aspects I’ve personally encountered—especially during long research or content creation sessions—is when an AI agent simply forgets what we were doing. As the context window stretches, the model often begins to lose the thread: key details fade, goals get muddled, and suddenly, the responses feel disjointed or aimless. At some point, I’ve learned it’s often better to cut losses and start from scratch, even if it means throwing away everything that’s been generated so far.

That kind of forgetfulness isn’t just anecdotal—it’s the most significant predictor of failure in the Deep Research Bench evaluation. But it’s not the only recurring issue. The report also highlights how some models fall into repetitive tool use, running the same search over and over as if stuck in a loop. Others show poor query crafting, lazily keyword-matching instead of thinking critically about how to search effectively. And far too often, agents fall victim to premature conclusions—delivering a half-formed answer that technically checks the box but falls short of real insight.

Even among the top models, the differences are stark. GPT-4 Turbo, for example, showed a notable tendency to forget prior steps, while DeepSeek-R1 was more likely to 幻觉的 or invent plausible-sounding—but incorrect—information. Across the board, models frequently failed to cross-check sources or validate findings before finalizing their output. For anyone who’s relied on AI for serious work, these issues will feel all too familiar—and they underscore how far we still have to go in building agents that can truly think and research like humans.

What About Memory-Based Performance?

Interestingly, Deep Research Bench also evaluated what it calls “toolless” agents—language models operating without any access to external tools, such as web search or document retrieval. These agents rely entirely on their internal training data and memory, generating answers based solely on what they've previously learned during training. In practice, this means they can’t look anything up or verify information—they’re guessing based on what they “remember.”

Surprisingly, these toolless agents performed almost as well as full research agents on certain tasks. For example, on the Validate Claim task—where the goal is to assess the plausibility of a statement—they scored 0.61, nearly matching the 0.62 average of tool-enabled agents. This suggests that models like o3 and Claude have strong internal priors and can often recognize the truthfulness of common claims without needing to search the web.

But on more demanding tasks—like Derive Number, which requires piecing together multiple values from various sources, or Gather Evidence, which depends on finding and evaluating diverse facts in context—these toolless models completely fell apart. Without fresh information or real-time lookup capabilities, they simply lacked the means to produce accurate or comprehensive answers.

This contrast highlights an important nuance: while today’s LLMs can simulate “knowing” a lot, deep research depends not just on recall, but on reasoning with up-to-date, verifiable information—something only tool-augmented agents can truly deliver.

总结

The DRB report makes one thing clear: while today’s best AI agents can outpace average humans on narrowly defined tasks, they still lag behind skilled generalist researchers—especially when it comes to planning strategically, adapting mid-process, and reasoning with nuance.

This gap becomes especially obvious during long or complex sessions—something I’ve experienced firsthand, where an agent gradually loses track of the task’s purpose, leading to a frustrating breakdown in coherence and utility.

是什么让 Deep Research Bench so valuable is that it doesn’t just test surface-level knowledge—it probes the intersection of tool use, memory, reasoning, and adaptation, offering a closer analog to real-world research than benchmarks like MMLU or GSM8k.

As LLMs continue to integrate into serious knowledge work, FutureSearch tools like DRB will be essential for assessing not just what these systems know, but how well they actually work.

Antoine 是一位富有远见的领导者,也是 Unite.AI 的创始合伙人,他对塑造和推动人工智能和机器人技术的未来有着坚定不移的热情。作为一名连续创业者,他相信人工智能将像电力一样颠覆社会,并经常对颠覆性技术和 AGI 的潜力赞不绝口。

作为一个 未来学家他致力于探索这些创新将如何塑造我们的世界。此外,他还是 证券一个专注于投资重新定义未来和重塑整个行业的尖端技术的平台。