The Decoder May 9, 2025 NEUTRAL

New fine-tuning choices push o4-mini toward specialist work

OpenAI is expanding fine-tuning for o4-mini with Reinforcement Fine-Tuning, a method that uses programmable graders to score model answers. It is also adding supervised fine-tuning for GPT-4.1 nano, giving organizations another way to adapt models for fixed input-response tasks.

WTF Index NEUTRAL

◄ Terminator 1 Idiocracy 0 ►

This is mostly a routine model customization update, with only a mild lean toward more capable specialist AI systems.

New fine-tuning choices push o4-mini toward specialist work

OpenAI is widening the ways organizations can adapt its models for specialized work. The company is expanding its fine-tuning program for o4-mini with Reinforcement Fine-Tuning, or RFT, while also making supervised fine-tuning available for GPT-4.1 nano.

The update matters because it separates two common needs. Some teams want a model to learn from scored behavior across complex tasks. Others need a smaller, faster adjustment based on known examples. OpenAI is now offering both routes across different models.

What RFT adds for o4-mini

RFT is aimed at organizations that need language models to perform well in highly specific domains, including law, finance, or security. Instead of training only around fixed answers, the method relies on a programmable "grader" that evaluates each response against custom criteria.

Those criteria can include style, accuracy, or security. OpenAI says multiple graders can be combined, which lets an organization define more nuanced goals than a single pass-or-fail rule would allow.

In practical terms, the model is not simply shown a desired answer and asked to imitate it. It generates possible responses, receives scores from the grader, and learns to favor the kinds of answers that score better. The source article describes this as building on reinforcement learning, the same core technique behind OpenAI's reasoning models like o3.

RFT is available to verified organizations starting today. That makes the feature a more formal option for teams that want to tune o4-mini around internal standards, domain-specific judgment, or structured outputs.

How the training loop works

The RFT process is organized around five main steps. First, a grader is created to define what strong answers should look like. Then training and validation data are uploaded, and the fine-tuning job begins.

During training, the model produces several possible answers for each prompt. Each answer is evaluated by the grader. A policy gradient algorithm then updates the model so it gives priority to responses that receive higher scores.

This setup gives organizations a way to encode evaluation rules into the training process itself. The important shift is that the grading logic becomes part of how the model is improved, rather than something applied only after deployment.

OpenAI also tracks metrics during training, including average reward on both training and validation sets. High-performing checkpoints can be tested individually or resumed as needed. RFT is also integrated with OpenAI's evaluation tools.

A security example shows the intended use

OpenAI demonstrates the approach with a security-focused example. In that case, the model is trained to answer questions about a company's internal security policies.

The expected output is a JSON object with fields for "compliant" and "explanation." The "compliant" field can contain yes, no, or "needs review." Both the compliance decision and the quality of the explanation are graded.

This example shows why RFT can be useful for specialized domains. A correct response may depend on more than matching a single known answer. The model may need to follow a policy, explain its reasoning in an acceptable way, and return information in a structure that another system can read.

Training data must be in JSONL format and include the expected structured outputs. That requirement makes the process more disciplined, but it also means organizations need to prepare their data in a form that reflects the outputs they want the model to produce.

Why rare data matters

OpenAI first introduced RFT as an experimental technique in a research program in December 2024. The early results showed promise in specialized domains, according to the source article.

OpenAI researcher Rohan Pandey says RFT could be especially valuable for vertical startups that train specialized agents on rare data. That point is central to the appeal of the method. When a task depends on uncommon examples or domain-specific rules, ordinary general-purpose behavior may not be enough.

RFT gives those teams a way to turn custom evaluation into a training signal. The grader can reward answers that match an organization's preferred behavior, while checkpoints and validation metrics give teams a way to compare progress during training.

GPT-4.1 nano gets supervised fine-tuning

Alongside the RFT expansion for o4-mini, OpenAI is also offering supervised fine-tuning for GPT-4.1 nano. The source describes GPT-4.1 nano as the fastest and most cost-effective GPT-4 variant.

Supervised fine-tuning is the more traditional option. It uses fixed input-response pairs, making it suitable when an organization already knows the answer format or behavior it wants the model to learn.

Organizations that share their training data with OpenAI receive a 50% discount. Results are available through the standard API, so fine-tuned models can be integrated directly into existing applications.

Taken together, the changes give organizations a broader fine-tuning menu. RFT for o4-mini is built around graders, rewards, checkpoints, and structured evaluation. Supervised fine-tuning for GPT-4.1 nano is built around example pairs and direct API use. The choice depends on whether the task needs scored behavior across complex outputs or a more conventional adjustment from known examples.