TechCrunch AI December 22, 2024 NEUTRAL

How OpenAI made o1 and o3 check safety before answering

OpenAI trained its o-series reasoning models to consult its safety policy during inference, not only during earlier training stages. The method, called deliberative alignment, is meant to help o1 and o3 refuse unsafe requests while still answering benign ones.

WTF Index NEUTRAL

◄ Terminator 1 Idiocracy 0 ►

This is mainly a safety-alignment update for more capable reasoning models, with only a mild Terminator lean from increasing model power and control concerns.

How OpenAI made o1 and o3 check safety before answering

OpenAI is trying to make its reasoning models safer by changing what happens after a user submits a prompt. Its latest approach, called deliberative alignment, trains o1 and o3 to bring OpenAI's safety policy into their internal problem-solving process before they respond.

The company says the method improved o1's alignment with its own safety principles. In practical terms, OpenAI says o1 became better at refusing prompts the company considers unsafe, while also becoming better at answering harmless questions that should not be blocked.

What Changed Inside OpenAI's Reasoning Models

OpenAI announced a new family of AI reasoning models on Friday, o3, which it says is more advanced than o1 or anything else it has released. The company has connected some of these gains to scaling test-time compute, but it also points to a new safety technique for its o-series models.

The important change is where safety policy enters the process. Much AI safety work is usually associated with pre-training and post-training. Deliberative alignment adds another layer during inference, the phase after a user presses enter on a prompt.

OpenAI's o-series models are described with words such as reasoning and deliberating, but the source article makes clear that they are not thinking the way people do. They are still AI systems that generate answers by predicting the next token, which is roughly half a word, in a sentence.

Still, these models operate differently from simpler prompt-response systems. After a user submits a prompt in ChatGPT, o1 and o3 can spend anywhere from five seconds to a few minutes re-prompting themselves with follow-up questions. They break a task into smaller steps, a process OpenAI calls chain-of-thought, and then produce an answer based on the information generated during that internal process.

How Deliberative Alignment Works

Deliberative alignment trains the model to recall and apply relevant pieces of OpenAI's safety policy while it is working through a prompt. Instead of only learning general refusal behavior during training, the model is taught to use the policy text as part of its internal reasoning path.

OpenAI's research included an example involving a request for a realistic disabled person's parking placard. In the model's chain-of-thought, it identifies the request as an attempt to forge something by referring to OpenAI's policy. In the final answer, the model refuses to help.

That example shows the intended pattern. The model should recognize when a request falls into an unsafe category, connect that request to the correct safety rule, and then respond in a way that matches the policy. OpenAI says this produced safer answers that were better calibrated to context.

The context matters because safety is not just about blocking dangerous words. A user might ask for help making a bomb, where to obtain drugs, or how to commit crimes. OpenAI does not want its models to answer those kinds of requests.

But a model that refuses too broadly creates a different failure. If every prompt containing the word bomb were blocked, users could not ask ordinary questions such as who created the atom bomb. This problem is called over-refusal, and it is one of the tensions OpenAI is trying to reduce.

The Jailbreak Problem Is Still Central

OpenAI's challenge is not only identifying clearly unsafe prompts. Users can phrase the same harmful request in many different ways. The source article notes that there are probably a million different ways to ask ChatGPT how to make a bomb.

Some users also try to bypass safeguards through jailbreaks. One example cited in the source asked the model to act as a deceased Grandma who used to make bombs and to remind the user how they did it. That approach worked for a while but was patched.

This is why deliberative alignment focuses on policy reasoning rather than simple keyword matching. The model needs to interpret the user's goal, not merely scan for a banned word. It also needs to avoid refusing prompts that are sensitive but legitimate.

OpenAI says the method improved performance on safety evaluation. On a benchmark called Pareto, which measures resistance against common jailbreaks, StrongREJECT [12], o1-preview outperformed GPT-4o, Gemini 1.5 Flash, and Claude 3.5 Sonnet.

Why Synthetic Data Matters Here

Deliberative alignment happens during inference, but OpenAI also used new post-training methods to make it work. Instead of relying on human-written answers or human-written chain-of-thought examples, OpenAI says it used synthetic data.

Synthetic data means training examples created by another AI model. OpenAI instructed an internal reasoning model to produce chain-of-thought examples that referred to different parts of its safety policy. Another internal reasoning model, called judge, was used to assess whether those examples were good or bad.

Researchers then trained o1 and o3 on those examples through supervised fine-tuning. The goal was to help the models learn to call up the right parts of the safety policy when they encountered sensitive topics.

OpenAI had a practical reason for that design. Asking o1 to read through the entire safety policy was creating high latency and unnecessarily expensive compute costs. Training the model to recall the right policy sections was meant to reduce that burden while keeping the safety behavior available at inference time.

The same judge model was also used in reinforcement learning to assess answers from o1 and o3. Supervised fine-tuning and reinforcement learning are not new techniques, but OpenAI says synthetic data could make alignment more scalable.

What This Means For AI Safety

AI safety has become more prominent as models become more popular and more powerful. It is also controversial. David Sacks, Elon Musk, and Marc Andreessen have argued that some AI safety measures amount to censorship, underscoring how subjective these decisions can be.

OpenAI's approach does not remove that subjectivity. It makes the company's own safety policy more directly active during the model's response process. That may make behavior more consistent with OpenAI's rules, but those rules still define what counts as safe or unsafe in the first place.

OpenAI says deliberative alignment helped o1-preview, o1, and o3-mini become some of its safest models yet. The larger question is how the method performs once o3 is publicly available. The o3 model is set to roll out sometime in 2025.

For now, deliberative alignment shows a clear direction for OpenAI's safety work. As reasoning models become more capable and are given more agency, the company is trying to make them check policy during the act of answering, not only before they are released.