Managing AI agents becomes the new pitch from OpenAI and Anthropic

Anthropic and OpenAI are pushing AI agents as systems users supervise rather than chat with one at a time. Claude Opus 4.6, agent teams, Frontier, and GPT-5.3-Codex all point to that shift, but reliability and independent proof remain open questions.

WTF Index TERMINATOR
◄ Terminator 3 Idiocracy 2 ►

The story leans toward more autonomous workplace AI agents with permissions and memory, though it emphasizes oversight and unproven reliability rather than immediate danger.

Managing AI agents becomes the new pitch from OpenAI and Anthropic

Anthropic and OpenAI are moving the center of AI work away from the familiar chatbot window. Their latest releases point toward a different model: people assigning work to multiple AI agents, watching those agents run in parallel, and stepping in when the work needs correction.

The idea is ambitious, but the evidence is still developing. The source article makes clear that current AI agents still need substantial human oversight, and that no independent evaluation has confirmed that multi-agent tools reliably beat a single developer working alone.

From chatbot to supervised AI workforce

On Thursday, Anthropic and OpenAI each released products built around a similar assumption: one assistant answering one prompt is no longer the whole story. The companies are now pitching systems where users manage groups of AI agents that split up work and move at the same time.

Anthropic’s release centers on Claude Opus 4.6, a new version of its most capable model, and a Claude Code feature called “agent teams.” The feature lets developers start multiple AI agents, divide a task into separate pieces, and let those agents coordinate and run concurrently.

In use, agent teams are described as a split-screen terminal environment. A developer can move among subagents with Shift+Up/Down, take direct control of any one agent, and let the others continue working. Anthropic describes the feature as best suited for “tasks that split into independent, read-heavy work like codebase reviews.” It is available as a research preview.

OpenAI’s comparable move is Frontier, an enterprise platform positioned around the idea of AI co-workers. Frontier gives each AI agent an identity, permissions, and memory, and connects those agents to existing business systems such as CRMs, ticketing tools, and data warehouses.

“What we’re fundamentally doing is basically transitioning agents into true AI co-workers,” Barret Zoph, OpenAI’s general manager of business-to-business, told CNBC.

The manager role becomes the product

The shared thread across Claude Opus 4.6, Claude Code agent teams, Frontier, and Codex is a change in what the human user is expected to do. Instead of writing one prompt and waiting for one answer, the user becomes a supervisor: assigning tasks, checking progress, correcting direction, and reviewing the final work.

That framing matters because it changes the value proposition of AI software. The user is not simply asking a model for help. The user is being asked to manage a workflow in which multiple AI agents act at once.

OpenAI’s Codex work reinforces the same direction. Three days before Frontier, OpenAI released a new macOS desktop app for Codex, its AI coding tool. OpenAI executives described it as a “command center for agents.” The app lets developers run multiple agent threads in parallel, with each thread working on an isolated copy of a codebase through Git worktrees.

OpenAI also released GPT-5.3-Codex on Thursday. The model powers the Codex app, and OpenAI claims that early versions helped the Codex team debug the model’s own training run, manage deployment, and diagnose test results. The company wrote, “Our team was blown away by how much Codex was able to accelerate its own development.”

The important practical point is less about branding and more about supervision. These tools may produce useful drafts quickly, but the source article notes that they still work best as tools that amplify existing skills, not as fully autonomous co-workers. They require frequent human course-correction.

What Claude Opus 4.6 adds

Claude Opus 4.6 follows Claude Opus 4.5, which Anthropic released in November. One major change is context length: for the first time in the Opus model family, Opus 4.6 supports a context window of up to 1 million tokens in beta. That means it can handle much larger bodies of text or code in one session.

Anthropic says Opus 4.6 beats OpenAI’s GPT-5.2 and Google’s Gemini 3 Pro across several evaluations, including Terminal-Bench 2.0, Humanity’s Last Exam, and BrowseComp. At the same time, GPT-5.3-Codex, released the same day, appears to have taken the lead on Terminal-Bench 2.0.

The source article gives several benchmark figures:

  • On Terminal-Bench 2.0, GPT-5.3-Codex scored 77.3%, about 12 percentage points above Opus 4.6.
  • On ARC AGI 2, Opus 4.6 scored 68.8 percent, compared with 37.6 percent for Opus 4.5, 54.2 percent for GPT-5.2, and 45.1 percent for Gemini 3 Pro.
  • On MRCR v2, Opus 4.6 scored 76 percent on the 1 million-token variant, compared with 18.5 percent for Sonnet 4.5.

Those numbers help explain why long-context ability matters for AI agents. If agents are expected to work across large codebases or large bodies of documents, they need to track information over long sessions without losing the relevant details. Still, the article cautions that AI benchmarks should be treated carefully because measuring model capability remains relatively new and unsettled.

Anthropic kept API pricing the same as Opus 4.5: $5 per million input tokens and $25 per million output tokens. For prompts above 200,000 tokens, the premium rate is $10/$37.50. Opus 4.6 is available on claude.ai, the Claude API, and all major cloud platforms.

Why software markets are paying attention

The releases arrived during a volatile week for software stocks. On January 30, Anthropic released 11 open source plugins for Cowork, its agentic productivity tool that launched on January 12. Cowork gives Claude access to local folders for work tasks, while the plugins extend it into areas such as legal contract review, non-disclosure agreement triage, compliance workflows, financial analysis, sales, and marketing.

By Tuesday, investors reportedly reacted by erasing roughly $285 billion in market value across software, financial services, and asset management stocks. A Goldman Sachs basket of US software stocks fell 6 percent that day. Thomson Reuters led the decline with an 18 percent drop, and the impact spread to European and Asian markets.

The concern is straightforward: AI model companies are beginning to package full workflows, not just general assistants. That could put pressure on software-as-a-service vendors if customers begin to see AI agents as a way to perform work that previously required specialized software. The source article is careful to note that it is still not settled whether these tools can actually deliver on those workflows.

Frontier may sharpen that concern because it is designed for AI agents that can log in to applications, perform tasks, and manage work with limited human involvement. Fortune described it as a bid to become “the operating system of the enterprise.” OpenAI CEO of Applications Fidji Simo rejected the idea that Frontier is meant to replace existing software, telling reporters, “Frontier is really a recognition that we’re not going to build everything ourselves.”

The unresolved question

The direction is clear: AI companies want users to manage AI agents rather than only chat with bots. The unresolved issue is whether that model consistently improves real work.

For now, the safest reading is practical. AI agents can divide tasks, run in parallel, connect to workplace systems, and produce work quickly. But they still need people to notice errors, judge quality, and decide when to intervene. The future being sold is less about removing the human from the loop and more about changing the human’s job inside it.