TechCrunch AI January 19, 2025 TERMINATOR

Delayed OpenAI disclosure puts FrontierMath trust in focus

Epoch AI revealed on December 20 that OpenAI supported the creation of FrontierMath, a math benchmark later used in OpenAI’s o3 demo. The timing drew criticism because contributors were not all told about OpenAI’s involvement or access before the disclosure.

WTF Index TERMINATOR

◄ Terminator 2 Idiocracy 1 ►

The story mildly leans Terminator because hidden benchmark access around a powerful model raises concerns about capability evaluation and transparency.

Delayed OpenAI disclosure puts FrontierMath trust in focus

FrontierMath was built to test advanced mathematical ability in AI systems. But the benchmark is now at the center of a transparency dispute after Epoch AI disclosed only recently that OpenAI helped support its creation.

The issue is not simply that OpenAI funded work on a benchmark. The criticism centers on timing, contributor awareness and OpenAI’s access to many of the benchmark’s problems and solutions before the arrangement was made public.

What Epoch AI disclosed

Epoch AI, a nonprofit primarily funded by Open Philanthropy, revealed on December 20 that OpenAI had supported the creation of FrontierMath. FrontierMath is described as a test with expert-level problems designed to measure an AI’s mathematical skills.

OpenAI used FrontierMath as one of the benchmarks in a demo of its upcoming flagship AI, o3. That connection made the disclosure especially sensitive, because benchmark credibility depends on confidence that test results are not shaped by special access or hidden arrangements.

According to the source article, OpenAI had visibility into many of the problems and solutions in FrontierMath. Epoch AI did not disclose that fact before December 20, when o3 was announced.

Why contributors objected

A contractor for Epoch AI using the username “Meemi” wrote on LessWrong that many contributors to FrontierMath were not told about OpenAI’s involvement until it became public. Meemi criticized the way the information was handled.

“The communication about this has been non-transparent,” Meemi wrote. “In my view Epoch AI should have disclosed OpenAI funding, and contractors should have transparent information about the potential of their work being used for capabilities, when choosing whether to work on a benchmark.”

That concern goes beyond routine disclosure. Contributors to a benchmark may care not only about whether their work is used for evaluation, but also about who can access it and whether it could help improve AI capabilities.

Stanford PhD mathematics student Carina Hong also raised concerns in a post on X. She alleged that OpenAI has privileged access to FrontierMath through its arrangement with Epoch AI and said some contributors were uncomfortable with that access.

“Six mathematicians who significantly contributed to the FrontierMath benchmark confirmed [to me] … that they are unaware that OpenAI will have exclusive access to this benchmark (and others won’t),” Hong said. “Most express they are not sure they would have contributed had they known.”

Epoch AI’s response

Tamay Besiroglu, associate director of Epoch AI and one of the organization’s co-founders, replied to Meemi’s post and said the integrity of FrontierMath had not been compromised. At the same time, he acknowledged that Epoch AI “made a mistake” by not being more transparent.

“We were restricted from disclosing the partnership until around the time o3 launched, and in hindsight we should have negotiated harder for the ability to be transparent to the benchmark contributors as soon as possible,” Besiroglu wrote.

Besiroglu also wrote that the mathematicians working on the benchmark should have known who might access their work. He said Epoch AI should have made transparency with contributors a non-negotiable part of its agreement with OpenAI, even though the organization was contractually limited in what it could say.

Epoch AI also pointed to safeguards. Besiroglu said OpenAI has access to FrontierMath but has a “verbal agreement” with Epoch AI not to use FrontierMath’s problem set to train its AI. He also said Epoch AI maintains a “separate holdout set” for independent verification of FrontierMath benchmark results.

“OpenAI has … been fully supportive of our decision to maintain a separate, unseen holdout set,” Besiroglu wrote.

The unresolved verification issue

The controversy is not fully settled by those safeguards. Epoch AI lead mathematician Elliot Glazer wrote on Reddit that Epoch AI has not yet been able to independently verify OpenAI’s FrontierMath o3 results.

“My personal opinion is that [OpenAI’s] score is legit (i.e., they didn’t train on the dataset), and that they have no incentive to lie about internal benchmarking performances,” Glazer said. “However, we can’t vouch for them until our independent evaluation is complete.”

That leaves the benchmark in a difficult position. Epoch AI says the benchmark’s integrity has not been compromised, but contributors and observers are questioning whether the process around disclosure gave them enough information at the right time.

What the dispute says about AI benchmarks

The FrontierMath debate highlights a broader challenge in AI evaluation: benchmarks need resources to be built, but funding and access arrangements can create the perception of conflicts of interest.

For a benchmark to function as a trusted measure, its process matters as much as its problem set. Contributors need to understand who may use their work. The public needs clarity about whether a company being evaluated had special access. Benchmark developers need enough support to create serious tests without undermining confidence in the results.

In this case, the central facts are narrow but important:

Epoch AI disclosed on December 20 that OpenAI supported FrontierMath.
OpenAI used FrontierMath in a demo of o3.
Some contributors say they were not told about OpenAI’s involvement before the public disclosure.
OpenAI had visibility into many FrontierMath problems and solutions.
Epoch AI says it has a “separate holdout set” for independent verification.
Epoch AI has not yet been able to independently verify OpenAI’s FrontierMath o3 results.

The result is a credibility test for the people building the test. FrontierMath was designed to measure AI’s mathematical skills, but the dispute around it now measures something else as well: whether benchmark organizations can balance funding, access and transparency without weakening trust in their own work.