Why AI search agents need better questions, not more searches

DiscoBench tests whether AI search agents can recognize ambiguity, ask useful follow-up questions, and recover during multi-step research. The results show that more searching can hurt when the model never turns uncertainty into a question for the user.

WTF Index IDIOCRACY
◄ Terminator 1 Idiocracy 2 ►

The story highlights AI agents failing to handle ambiguity and potentially degrading research quality, but it is mainly a benchmark story rather than a major societal harm.

Why AI search agents need better questions, not more searches

AI search agents are often judged by how well they retrieve information. A new benchmark from a team at Tencent Hunyuan and Tsinghua University points to a different weakness: agents can search cleanly and still fail because they do not ask for clarification when the task is unclear.

The benchmark, called DiscoBench, focuses on what happens inside longer research chains. It shows that ambiguity is not a small edge case. When a model chooses the wrong interpretation early, later searches can look reasonable while moving farther away from the answer the user actually needed.

DiscoBench tests the moment when an agent should stop and ask

DiscoBench was built to evaluate whether language models can notice ambiguity during deep search tasks, ask targeted follow-up questions, and adjust their research path. That makes it different from benchmarks such as GAIA or BrowseComp, which treat user prompts as complete and clear.

The benchmark contains 211 tasks and 463 ambiguous points across eleven knowledge domains. Those domains include video games, sports, music, film, science, and politics. Most of the dataset is written in Chinese to reflect common search behavior on the Chinese-language web.

Each task is divided into checkpoints. At every checkpoint, the agent can take one of three actions: continue searching, ask the user for clarification, or provide an answer. When the agent asks a useful follow-up question, an LLM-based user simulator supplies a predefined clue that narrows the task. The searches run through Tavily, while Gemini 3 Flash acts as the simulator.

Ambiguity comes in several forms

The researchers identify four types of ambiguity. A prompt may point to more than one entity. It may depend on a specific time period or version. It may leave the ranking or evaluation standard open. It may also contain a factual error.

These categories matter because they do not create the same kind of signal for a model. Factual errors are easier to catch because they can produce clear contradictions during research. Entity and criteria ambiguities are harder because several plausible answers can exist at the same time without an obvious conflict.

That is why a search agent needs more than retrieval. It must know when its current evidence is not enough to disambiguate the request. It also has to form a question that will actually move the search forward, rather than asking something broad or irrelevant.

Top models still struggle to finish the full task

The team tested eleven models released in the past six months: Claude Opus 4.7, GPT 5.4, Gemini 3.1 Pro Preview, Doubao Seed 2.0 Pro, DeepSeek V4 Pro, Kimi K2.6, GLM 5.1, Qwen3.6 Max, MiniMax M2.7, MiMo v2.5 Pro, and Hunyuan 3.0 Preview.

Without a prompt that explicitly warned about ambiguity, Doubao Seed 2.0 Pro reached the strongest end-to-end result at 43.1 percent. Gemini 3.1 Pro followed at 40.8 percent, and Claude Opus 4.7 reached 39.8 percent. MiniMax M2.7 and Qwen3.6 Max were much lower, at 16.1 and 12.3 percent.

The gap between individual checkpoint performance and full task success is important. Claude Opus 4.7 solves 57 percent of checkpoints correctly, but its end-to-end accuracy is 39.8 percent. In a multi-step search process, one unresolved ambiguity can undermine the entire chain.

The researchers also tested a “Guided” mode, where the system prompt tells the agent to watch for ambiguity and ask a follow-up question when uncertain. Across ten models, end-to-end accuracy increased from 28.6 to 33.7 percent. Detection F1 rose more sharply, from 45.3 to 64.9 percent.

That result suggests the warning helped models notice unclear prompts, but did not reliably help them complete the research. For Claude Opus 4.7, end-to-end accuracy even fell slightly under the guided prompt, despite a higher checkpoint pass rate.

More searches can make the answer worse

The behavioral analysis shows why tool use alone is not enough. Agents that searched first and then asked a follow-up question, labeled “SearchThenAsk,” averaged a 93.4 percent success rate at ambiguous checkpoints. Agents that guessed without asking, labeled “DirectGuess,” fell to 56.5 percent.

The weakest pattern was repeated searching followed by a guess. This “SearchHeavyGuess” behavior averaged 51.9 percent. According to the authors, repeated search can indicate that the model has noticed uncertainty, but has not converted that uncertainty into a user interaction.

This helps explain why more tool calls do not automatically lead to better research. Claude Opus 4.7 searches more often than most other tested models, yet trails Gemini 3.1 Pro and Doubao Seed 2.0 Pro in accuracy. The missing step is not another query; it is the right clarifying question.

Future search agents need interaction, not just retrieval

The benchmark also shows that stored model knowledge is not enough. Without search tools, Doubao Seed 2.0 Pro drops from 43.1 to 2.4 percent, while Gemini 3.1 Pro drops from 40.8 to 19.9 percent. DiscoBench depends on active search, but search alone still does not solve the ambiguity problem.

When ambiguity is removed from the questions, model accuracy rises by 26.8 to 40.2 points, depending on the model. That makes the practical lesson straightforward: the next generation of AI search agents needs mechanisms that turn uncertainty into interaction with the user.

Other recent work points in the same direction. On LiveBrowseComp, where facts sit beyond the knowledge cutoff, all systems dropped by 25 to 40 points. Halluhard showed that Claude Opus 4.5 with web search hallucinates in about 30 percent of cases, mainly while checking cited source content.

Different labs are already trying related approaches. Claude Opus 4.8 is supposed to flag uncertainties more often and leaves bugs in its own code uncommented about four times less frequently than its predecessor. Perplexity is testing Search as Code, which lets models write search workflows as Python programs instead of using a prebuilt API.

DiscoBench makes the central issue plain: a capable research agent must know when the available evidence is insufficient. The most useful system is not the one that searches the most, but the one that recognizes when a human answer is needed before the next search begins.