Ars Technica AI February 5, 2025 TERMINATOR

How a 24-hour build pushed open source AI research agents forward

Hugging Face released Open Deep Research, an open source AI research agent built as a 24-hour challenge after OpenAI launched Deep Research. The project does not yet match OpenAI’s benchmark performance, but it shows how much an agent framework can improve an AI model’s ability to gather and synthesize information.

WTF Index TERMINATOR

◄ Terminator 1 Idiocracy 0 ►

The story mildly advances toward more autonomous AI agents that can browse, plan, and synthesize research, though without a clear harm angle.

How a 24-hour build pushed open source AI research agents forward

Hugging Face has turned the race to build AI research agents into a public experiment. After OpenAI launched its Deep Research feature, an in-house Hugging Face team created and released an open source alternative called Open Deep Research, aiming to show how much of the capability comes from the surrounding agent framework rather than the model alone.

What Hugging Face Built

Open Deep Research is an AI research agent designed to browse the web, collect information, and assemble research reports through a multi-step process. Hugging Face created it as a 24-hour challenge after the launch of OpenAI’s Deep Research feature.

The goal was not only to compete with OpenAI’s product, but also to make the underlying approach available to developers. Hugging Face wrote on its announcement page: “While powerful LLMs are now freely available in open-source, OpenAI didn’t disclose much about the agentic framework underlying Deep Research,” adding, “So we decided to embark on a 24-hour mission to reproduce their results and open-source the needed framework along the way!”

The project sits in the same broad category as OpenAI’s Deep Research and Google’s Deep Research implementation using Gemini, first introduced in December before OpenAI. In each case, the key idea is to place an agent layer around an AI model so it can plan, search, inspect sources, and build an answer over several steps.

Why The Benchmark Result Matters

Hugging Face’s early result was notable because the system reached 55.15 percent accuracy on the General AI Assistants benchmark after only a day’s work. The GAIA benchmark tests whether an AI model can gather and synthesize information from multiple sources.

OpenAI’s Deep Research scored 67.36 percent accuracy on the same benchmark with a single-pass response. OpenAI’s score rose to 72.57 percent when 64 responses were combined using a consensus mechanism.

Those numbers show that Open Deep Research is not yet ahead of OpenAI’s implementation. But they also show that an open source research agent can move close to commercial performance quickly when it has the right structure around the model.

The benchmark also matters because GAIA questions are not simple search prompts. Hugging Face highlighted one example involving the 2008 painting “Embroidery from Uzbekistan,” an October 1949 breakfast menu, an ocean liner, and the film “The Last Voyage.” To answer correctly, an AI agent has to connect information across different sources and return a precise result.

That kind of task is difficult because it requires more than language fluency. The system must decide what to look up, keep track of partial findings, compare sources, and produce a coherent final answer.

The Model Is Only Part Of The System

Open Deep Research currently builds on OpenAI’s large language models, such as GPT-4o, or simulated reasoning models, such as o1 and o3-mini, through an API. Hugging Face says it can also be adapted to open-weights AI models.

Aymeric Roucher, who leads the Open Deep Research project, told Ars Technica: “It’s not ‘open weights’ since we used a closed weights model just because it worked well, but we explain all the development process and show the code,” adding, “It can be switched to any other model, so [it] supports a fully open pipeline.”

Roucher also said: “I tried a bunch of LLMs including [Deepseek] R1 and o3-mini,” and added, “And for this use case o1 worked best. But with the open-R1 initiative that we’ve launched, we might supplant o1 with a better open model.”

The larger lesson is that the agentic structure can strongly affect results. The source notes that OpenAI’s GPT-4o alone, without an agentic framework, scores 29 percent on average on GAIA, compared with OpenAI Deep Research’s 67 percent.

Hugging Face used its open source smolagents library as a foundation. That library uses “code agents” rather than JSON-based agents. These code agents write actions in programming code, which reportedly makes them 30 percent more efficient at completing tasks and helps the system manage complex action sequences more concisely.

Open Source Speed And Its Limits

Open Deep Research also shows how quickly open source AI projects can evolve when teams build on existing work. Hugging Face used web browsing and text inspection tools borrowed from Microsoft Research’s Magnetic-One agent project from late 2024.

Outside contributors are already part of the process. Hugging Face has posted its code publicly on GitHub and opened positions for engineers to help expand the project’s capabilities.

Still, the project has limits. Roucher told Ars Technica: “I think [the benchmarks are] quite indicative for difficult questions,” but added, “But in terms of speed and UX, our solution is far from being as optimized as theirs.”

Future improvements may include support for more file formats and vision-based web browsing capabilities. Hugging Face is also working on cloning OpenAI’s Operator, which can perform tasks such as viewing computer screens and controlling mouse and keyboard inputs within a web browser environment.

For developers, the significance is practical. Open Deep Research gives them a way to study and modify an AI research agent rather than only use a closed commercial product. It also suggests that the next phase of AI capability may depend as much on agent design, tools, and workflow orchestration as on the model at the center.