OpenAI pushes Codex toward day-long engineering work

OpenAI has released GPT-5.1-Codex-Max as the new default model across Codex interfaces. The model is designed for long-running engineering tasks, faster real-world work, and extended coding sessions, but OpenAI still stresses human review.

WTF Index TERMINATOR
◄ Terminator 2 Idiocracy 1 ►

A more autonomous long-running coding agent mildly increases power and review-control concerns, though OpenAI still emphasizes human oversight.

OpenAI pushes Codex toward day-long engineering work

OpenAI is moving Codex further into the territory of sustained software engineering work with GPT-5.1-Codex-Max, a new agentic coding model built for complex tasks that can run for hours. The model replaces GPT-5.1-Codex as the standard across all Codex interfaces and is aimed at work that requires large context, repeated implementation, and careful follow-through.

The release also puts more attention on a practical tension around AI coding tools: longer sessions can produce more useful work, but they can also make review harder. OpenAI is presenting Codex as an additional reviewer and engineering assistant, not a substitute for human judgment.

What changed in Codex

GPT-5.1-Codex-Max is described by OpenAI as its latest agentic coding model. The company says it is built for "long-running, detailed work," which signals a focus on engineering assignments that go beyond short code snippets or quick edits.

Across Codex interfaces, the Max version now replaces the older GPT-5.1-Codex model as the default. The previous model is being retired after just a few days, according to the source article.

Access is open now for ChatGPT Plus, Pro, Team, Edu, and Enterprise users. API access is expected soon, while OpenAI has not released pricing for the new model yet.

The pricing detail that is known applies to the short-lived predecessor: $1.25 per million input tokens and $10 per million output tokens. Because the new pricing has not been released, developers and teams evaluating Codex still have an important cost variable unresolved.

Benchmark gains and speed claims

OpenAI says GPT-5.1-Codex-Max is projected to reach a top score of 77.9 percent on SWE-Bench Verified. The source article says that places it ahead of Anthropic and Google's recently released Gemini 3.

The model also improved in OpenAI's internal "SWE-Lancer IC SWE" benchmark, moving from 66.3 percent to 79.9 percent, according to the blog post referenced in the source.

OpenAI is also making an efficiency claim. The company says the model uses 30 percent fewer "thinking tokens" than its predecessor while maintaining the same quality. It also runs 27 to 42 percent faster on real-world tasks.

For cases where speed is less important, OpenAI is adding an Extra High reasoning mode. That mode gives the system more time to think, which is meant for work where latency does not matter as much as careful execution.

Why long sessions matter

The most notable claim is about duration. OpenAI says GPT-5.1-Codex-Max can stay focused on a single assignment for "more than 24 hours" in internal tests. The examples given include fixing test failures and iterating on implementations.

OpenAI did not share details about those workloads. That matters because the usefulness of a long-running coding agent depends heavily on what kind of task it is doing, how much context it receives, and how its output is checked.

The technical approach behind the longer sessions is called "compaction." When the model fills its context window, the system automatically compresses the session history. It keeps relevant information, removes details judged less important, and preserves the core task and key steps over millions of tokens.

GPT-5.1-Codex-Max is described as the first model natively trained to work this way across multiple context windows. In plain terms, the goal is to let Codex continue a task without losing the thread when the conversation or work history becomes very large.

Windows, security, and review

OpenAI says GPT-5.1-Codex-Max is also the first model specifically trained to work effectively in Windows environments. That is meant to improve how it handles command-line tasks.

The company also says 95 percent of its engineers use Codex weekly. Since the tool's introduction, OpenAI says it has seen a 70 percent increase in pull requests.

On security, OpenAI calls this its most capable cybersecurity model to date, while also saying it remains below the internal "High Capability" threshold. The company plans to support defenders with tools like Aardvark, but it warns that developers should double-check the agent's work before deployment.

That warning becomes more important as Codex handles longer assignments. OpenAI says reviewing the agent's work becomes "increasingly important" because these systems still make mistakes. More generated code can also make it harder to verify, understand, and debug later.

Codex provides terminal logs that cite its tool calls and test results to help with review. Still, OpenAI emphasizes that Codex acts as an additional reviewer, not a replacement for human eyes.

Usage limits for subscribers

The source article also lists usage limits for ChatGPT Plus and Pro users. Plus users have limits set at 45 to 225 local messages and 10 to 60 cloud tasks every five hours.

Pro users receive more capacity. Their limits range from 300 to 1,500 local messages and 50 to 400 cloud tasks in the same period.

Those limits matter because the value of a long-running coding model depends not only on capability, but also on how often users can run it and how much work they can delegate. For now, GPT-5.1-Codex-Max expands what Codex is meant to handle, while keeping human review central to the workflow.