GDPval tests how close AI models are to expert work

OpenAI's GDPval benchmark evaluates AI models on real-world knowledge work across 44 professions and nine major industries. Early results show GPT-5 and Claude Opus 4.1 approaching expert performance on many tasks, but the scores vary sharply by file format and the benchmark still does not simulate full jobs.

WTF Index TERMINATOR
◄ Terminator 2 Idiocracy 0 ►

The story shows AI systems becoming more capable at expert knowledge work, but without clear evidence of autonomy or harm.

GDPval tests how close AI models are to expert work

OpenAI's GDPval benchmark is designed to answer a practical question: how well can today's AI models handle the kinds of deliverables professionals actually produce at work?

The first results suggest that leading systems are moving into more demanding territory. GPT-5 and Claude Opus 4.1 can match or beat human reference work on many tasks, but the details matter. File format, task design, and the limits of one-shot testing all shape what the benchmark really shows.

What GDPval Measures

GDPval covers 1,320 tasks across 44 professions. Those roles come from nine major industries, each representing more than 5 percent of US GDP.

OpenAI selected high-paying jobs in those sectors and filtered them through the O*NET database, a resource developed by the US Department of Labor that catalogs detailed occupational information. The benchmark includes roles where at least 60 percent of the work is non-physical, and the list is based on Bureau of Labor Statistics (May 2024) numbers, according to OpenAI.

The task set spans areas such as technology, nursing, law, software development, journalism, and more. Each task was created by professionals averaging 14 years of experience. The assignments are based on real-world work products, including legal briefs, care plans, and technical presentations.

Why The Tasks Are Different

GDPval is not built around simple text prompts. The benchmark asks models to work with additional materials and produce deliverables in more complex formats.

One example in the source describes a mechanical engineer assignment: design a test bench for a cable winding system, deliver a 3D model, and prepare a PowerPoint presentation based on technical specs. That kind of task is closer to professional output than a short answer in a chat window.

Every result is reviewed by industry experts in blind tests. The reviewers compare AI outputs with human reference solutions and rate them as "better," "as good as," or "worse than."

OpenAI also built an experimental AI-based review assistant to simulate human scoring. According to the paper, each task was reviewed about five times through peer checks, additional expert reviews, and model-based validation.

Where GPT-5 And Claude Opus 4.1 Stand

Early GDPval results show top models nearing expert-level output on parts of the benchmark. In about half of the 220 gold-standard tasks published so far, experts rated AI work as equal to or better than the human benchmark.

GPT-5 shows major gains over GPT-4o, which launched in spring 2024. Depending on the metric, GPT-5's scores have doubled or even tripled.

Claude Opus 4.1 goes further in the overall comparison, with results rated as good as or better than human output in nearly half of all tasks. OpenAI claims Claude tends to stand out in aesthetics and formatting, while GPT-5 leads in expertise and accuracy.

OpenAI also points to efficiency. The models completed tasks about 100 times faster and 100 times cheaper than human experts, counting only inference time and API costs. The company expects AI first drafts could save time and money, while still requiring human review, iteration, and integration in real workflows.

File Format Changes The Score

A later update to the GDPval results adds an important caveat: model performance depends heavily on the file format used for the submission.

On plain text tasks, the models have their lowest win rates. Claude Opus 4.1 reaches 14 percent, while GPT-5 reaches 22 percent.

The results look much stronger in other formats:

  • For PDFs, Claude Opus 4.1 reaches a 46 percent win rate.
  • On Excel files (xlsx), Claude Opus 4.1 reaches 45 percent.
  • For PowerPoint presentations (pptx), Claude Opus 4.1 reaches 48 percent.
  • In the "other" category, which includes various other formats, Claude Opus 4.1 matches human professionals at 50 percent.
  • GPT-5 shows 36 percent on xlsx and 49 percent in "other."

OpenAI does not give a reason for these differences. The source notes that layout and visual design may influence human reviewers. A model can benefit from neat formatting, clear structure, and strong visuals in presentations, spreadsheets, or PDFs, even when the underlying content is not necessarily better.

Plain text removes much of that advantage. In that setting, the model has to compete more directly on writing and reasoning.

This matters because PDF, Excel, and "other" formats together account for over 80 percent of all deliverables. That helps explain why models can post strong overall results while still trailing humans by a wide margin on pure text work.

What GDPval Still Does Not Show

GDPval remains a task benchmark, not a full workplace simulation. The current version uses "one-shot" tasks, meaning models get one attempt at each assignment. They do not receive feedback, build context over time, or revise through back-and-forth interaction.

The tasks also leave out much of the ambiguity that appears in real professional settings. Actual work often involves unclear requirements, discussions with colleagues or clients, and changing expectations. GDPval focuses instead on isolated, computer-based steps.

OpenAI is careful to say that current AI models are not replacing entire jobs. The company says they are best at automating repetitive, clearly structured tasks. The test set is also limited, with only about 30 tasks per job across the 44 professions.

Future versions of GDPval are expected to move closer to realistic work conditions. OpenAI says those versions will include more interactive tasks, built-in ambiguity, and feedback loops. The long-term aim is to track AI's economic impact and understand how it is changing the labor market.