WIRED AI July 17, 2024 TERMINATOR

Why OpenAI's AI safety push is still facing scrutiny

OpenAI has presented new AI safety research aimed at making a stronger model explain its reasoning more clearly through a back-and-forth with another model. The work is being released publicly, but critics say technical progress on transparency does not answer broader questions about oversight, governance, and corporate incentives.

WTF Index TERMINATOR

◄ Terminator 2 Idiocracy 0 ►

The story centers on risks from increasingly capable AI systems and whether transparency and governance can keep them controllable.

Why OpenAI's AI safety push is still facing scrutiny

OpenAI is trying to show that it is taking AI safety seriously at a moment when its approach is under pressure. The company has showcased research that it says could help humans better inspect advanced AI systems as those systems become more useful and more capable.

The work focuses on a problem that sits at the center of current AI safety debates: whether researchers can understand what a model is doing when it gives an answer. OpenAI’s latest technique uses two AI models in a structured exchange, with the goal of making the more powerful model’s reasoning more transparent, or more “legible,” to human observers.

A new route to clearer AI reasoning

The technique OpenAI described is built around a conversation between two models. One model is the stronger problem solver. The other is trained to judge whether answers are correct. By having the models go back and forth, researchers found that the problem-solving model became more direct and open about its reasoning.

So far, the method has been tested on an AI model built to solve simple math problems. OpenAI researchers asked that model to explain how it reached answers while solving problems or responding to questions. The second model then checked whether those answers were right.

The important claim is not that this solves AI transparency. It is that a model can be pushed, through interaction with another model, toward explanations that are easier for humans to examine. In a field where systems can produce plausible answers without making their internal process obvious, that is a meaningful research direction.

Yining Chen, an OpenAI researcher involved in the work, framed the project as part of the company’s broader safety mission. “This is core to the mission of building an [artificial general intelligence] that is both safe and beneficial,” Chen tells WIRED.

Why transparency matters for AI safety

Transparency and explainability are major concerns for researchers studying large language models, including the kinds of systems behind programs like ChatGPT. A model may produce a reasonable-sounding explanation for an answer, but that does not automatically mean the explanation is complete, reliable, or faithful to the model’s actual process.

The concern grows sharper as models become more capable. The source article describes a fear that future systems could become harder to interpret, or even deceptive in the explanations they provide. In that scenario, a model might appear to explain itself while pursuing an undesirable goal.

OpenAI’s new work sits within a broader effort to understand how large language models operate. The company and others are also exploring more mechanistic ways of looking inside these systems. The shared aim is to make powerful AI models easier to scrutinize, and therefore safer.

The research is being released publicly in a paper. Jan Hendrik Kirchner, another OpenAI researcher involved in the project, describes it as one piece of a longer program. “It’s part of the long-term safety research plan,” Kirchner says. “We hope that other researchers can follow up, and maybe try other algorithms as well.”

The timing adds pressure

The announcement arrives after OpenAI has faced criticism from people who argue that the company may be moving too quickly in developing more powerful artificial intelligence. In recent weeks, it has revealed several ideas related to AI safety, a pattern that appears aimed at demonstrating that safety research remains central to its agenda.

That context matters because OpenAI was founded with a promise to make AI safer and more open to scrutiny. But the success of ChatGPT and competition from well-backed rivals have intensified questions about whether the company is prioritizing major advances and market position over safety.

The source article also notes that in May, WIRED learned that a team focused on long-term AI risk had been disbanded. That happened shortly after the departure of cofounder and key technical leader Ilya Sutskever, who had been one of the board members who briefly ousted CEO Sam Altman last November.

Those events have shaped how the new research is being received. For supporters of technical safety work, a method that encourages clearer reasoning is useful. For critics, it does not settle the larger issue of whether the organizations building advanced systems are accountable enough.

Critics want more than research papers

Daniel Kokotajlo, a researcher who left OpenAI and signed an open letter criticizing the company’s approach to AI safety, says the new work matters but does not go far enough. His objection is not that the technique is useless. It is that technical improvements do not replace oversight of the companies racing to build the technology.

“The situation we are in remains unchanged,” he says. “Opaque, unaccountable, unregulated corporations racing each other to build artificial superintelligence, with basically no plan for how to control it.”

Another source with knowledge of OpenAI’s inner workings, who asked not to be named because they were not authorized to speak publicly, made a similar point. For that source, the key question is not whether some safety research exists, but whether the company has the processes and governance needed to put societal benefit ahead of profit.

“The question is whether they’re serious about the kinds of processes and governance mechanisms you need to prioritize societal benefit over profit,” the source says. “Not whether they let any of their researchers do some safety stuff.”

That criticism captures the split around OpenAI’s announcement. The research may help make AI reasoning more inspectable, and public release may allow other researchers to test related ideas. But the debate around AI safety is no longer only about whether a promising technique exists. It is also about who decides how powerful systems are developed, how their risks are judged, and what kinds of accountability surround the companies building them.