OpenAI has introduced MLE-bench, a benchmark designed to measure whether AI agents can do more than answer questions about machine learning. The goal is to test how well they can develop working machine learning solutions on competitive tasks with measurable outcomes.
The benchmark uses 75 Kaggle competitions and compares AI agent results with human performance. Early results put o1-preview with the AIDE framework ahead of the tested alternatives, while also showing that more attempts and more processing time matter more than simply adding GPU power.
What MLE-bench Is Built to Measure
MLE-bench focuses on autonomous AI systems in ML engineering. That means the benchmark is not just asking whether a model understands a concept, but whether an agent can move through the practical work of building a solution.
OpenAI designed MLE-bench around two main priorities. First, it selects challenging tasks that reflect current machine learning development. Second, it compares AI performance against human performance, using Kaggle competitions as the testing ground.
The 75 competitions span several areas of machine learning, including natural language processing, computer vision, and signal processing. The tasks are not abstract puzzles. Some connect to real-world applications, including predicting COVID-19 mRNA vaccine degradation and decoding ancient scrolls.
That variety matters because ML engineering usually involves many different kinds of judgment. An agent may need to prepare data, choose an approach, train models, inspect poor results, and try again. MLE-bench is meant to capture some of those core skills in a repeatable way.
Why Kaggle Competitions Are Useful for This Test
Kaggle competitions give MLE-bench a structured environment. They provide clear problems, datasets, and evaluation metrics, which makes it possible to score AI agents in a consistent way.
That structure also makes comparison easier. If an AI agent submits a solution to a known competition, its result can be judged against the kind of performance humans achieved on the same task. For a benchmark, that is a practical way to connect model behavior to an external standard.
But the same structure is also a limitation. OpenAI acknowledges that MLE-bench does not cover every part of AI research and development. The benchmark focuses on tasks where the problem is clearly defined, the data is clean, and the evaluation is straightforward.
Real-world machine learning work is often less tidy. Problems may be unclear, datasets may be incomplete, and success may not be captured by a simple metric. MLE-bench therefore measures important ML engineering abilities, but it does not claim to represent the full complexity of AI research.
What the Early Results Show
OpenAI tested several AI models and agent frameworks on MLE-bench. The best result came from o1-preview using the AIDE framework, which achieved at least a bronze medal in 16.9% of competitions.
That performance surpassed Anthropic's Claude 3.5 Sonnet when paired with the AIDE framework. The result gives o1-preview the strongest showing in the initial tests described by OpenAI.
The benchmark also highlights how agent performance changes when the system is allowed to keep trying. More attempts per competition significantly improved outcomes. With 8 attempts, the medal rate of o1-preview doubled to 34.1%.
Processing time also affected results. GPT-4o increased its medal rate from 8.7% to 11.8% when processing time was extended from 24 to 100 hours. That suggests that longer-running agent workflows can produce better submissions, even when the underlying task remains the same.
More GPU power did not show the same effect. According to the source results, additional GPU power had little impact on performance. In these tests, repeated attempts and longer processing time were more important scaling methods than simply adding compute in that form.
The Contamination Problem
One challenge for MLE-bench is that Kaggle competitions are publicly available. Because of that, OpenAI had to consider whether AI agents might benefit from prior exposure to top solutions or related material.
OpenAI addressed this risk in two ways. It used a plagiarism detector to compare agent submissions with top Kaggle solutions, and it ran experiments to check for contamination effects.
This is an important part of the benchmark design. If an agent succeeds because it reproduces known solutions, that would not show the same engineering ability as solving the task through its own process. The benchmark needs to separate genuine problem-solving from memorized or copied material as much as possible.
The source does not present MLE-bench as a perfect answer to this issue. Instead, it describes contamination as a challenge OpenAI faced while creating the benchmark. The steps taken are part of making the results more meaningful, but they do not remove every limitation of using public competitions.
What MLE-bench Can and Cannot Prove
MLE-bench is useful because it targets practical machine learning work. OpenAI sees it as a way to assess core ML engineering skills, including preparing large multimodal datasets, managing long-term training procedures, and debugging underperforming models.
Those skills are central to building working machine learning systems. A model that can reason about an algorithm is not automatically able to manage the full process of producing a competitive solution. MLE-bench tries to evaluate that broader engineering workflow.
At the same time, the benchmark has clear boundaries. It is strongest where tasks are well specified and outcomes are measurable. It says less about open-ended research problems, ambiguous product goals, or situations where evaluation cannot be reduced to a simple score.
The early results point in both directions at once. AI agents are making measurable progress on ML engineering tasks, especially when given multiple attempts and longer processing time. But the medal rates also show that these systems remain far from reliably matching strong human performance across the full set of competitions.
MLE-bench is available on GitHub, which makes it part of a broader effort to evaluate autonomous AI systems with clearer, more practical tests. For now, its main value is not that it settles the question of AI engineering ability. It gives researchers a more concrete way to measure where that ability is improving, and where the limits still show.