The Decoder September 25, 2024 NEUTRAL

Can SCoRe make large language models correct themselves?

Google DeepMind researchers have developed SCoRe, a reinforcement learning method that trains a single large language model to improve its own answers using self-generated data. Tests with Gemini 1.0 Pro and 1.5 Flash showed gains on MATH and HumanEval, though the method currently trains for only one round of self-correction.

WTF Index NEUTRAL

◄ Terminator 1 Idiocracy 0 ►

This is a technical research update about modest self-correction gains, with only a mild capability-increase lean and no clear harm angle.

Can SCoRe make large language models correct themselves?

Google DeepMind researchers have introduced SCoRe, a method designed to help large language models recognize and repair some of their own mistakes. The idea is simple to describe but difficult to train: a model should not only produce an answer, but also make a better second attempt when its first response is flawed.

SCoRe stands for "Self-Correction via Reinforcement Learning." According to the source, it trains a single model with reinforcement learning and relies only on self-generated data, rather than multiple models or external checks.

Why self-correction matters

Current large language models can struggle when asked to correct themselves. They may revise an answer, but that does not automatically mean the revision is better. In many cases, outside verification or more than one model is needed to check whether an attempted correction is actually an improvement.

SCoRe focuses on what the researchers call meaningful positive intrinsic self-correction. In practical terms, the goal is for the same model to produce a first answer, attempt a correction, and improve without external feedback.

That distinction is important. A system that depends on outside checks may still be useful, but it is not the same as a model learning an internal strategy for finding and fixing its own errors. SCoRe is aimed at that internal capability.

How SCoRe trains a model to revise its answers

The method works in two phases. The first phase optimizes the model initialization so that the model can generate corrections on the second try while keeping its first responses close to the behavior of the base model. The source says this phase uses a special loss function that considers both goals.

This setup matters because the researchers are not only trying to make the second answer different. They are trying to preserve the model's normal first-answer behavior while making the follow-up answer more useful when a correction is needed.

The second phase uses multi-stage reinforcement learning. During this stage, the model learns to improve both its first and second answers. A reward function gives more weight to improvements between attempts, which pushes the model toward genuine self-correction rather than superficial changes.

SCoRe also differs from approaches that require external verification. The model creates its own examples by solving problems and then trying to improve its solutions. That self-generated training loop is central to the method described by Google DeepMind researchers.

What the tests showed

The researchers tested SCoRe with Google's Gemini 1.0 Pro and 1.5 Flash models. The reported gains were measured on two benchmarks: MATH for mathematical reasoning and HumanEval for code generation.

On MATH, self-correction improved by 15.6 percentage points.
On HumanEval, self-correction rose by 9.1 percentage points.

Those results suggest that SCoRe can help a model make better use of a second attempt in both mathematical reasoning and code generation. The key point is not merely that the model answers again, but that the later answer is more often an improvement.

The researchers describe SCoRe as the first approach to achieve meaningful positive intrinsic self-correction. Based on the source, that means the model can improve answers without outside feedback, using the training process built around its own attempts and revisions.

The limit of one correction round

SCoRe is not presented as a complete solution to self-correction. The source says it currently trains for one round of self-correction. That means the method is focused on improving a second answer after an initial response, not on an extended chain of repeated revisions.

Future work could explore multiple correction steps. That direction follows naturally from the current limitation: if a model can learn to improve from one attempt to the next, researchers may want to understand whether similar training can support longer correction processes.

Still, the one-round limit is important for interpreting the result. SCoRe shows progress on a defined version of the problem, not proof that large language models can reliably audit themselves across open-ended tasks or unlimited revision cycles.

What SCoRe suggests about future AI training

The researchers conclude that teaching metastrategies such as self-correction requires going beyond standard LLM training. In this case, the added ingredient is multi-stage reinforcement learning, with rewards shaped around improvement between attempts.

That framing points to a broader lesson from the source: better AI behavior may require training models not only to answer, but to practice the strategy of revising. For SCoRe, the target strategy is self-correction, and the training process is built around the model's own generated answers and attempted improvements.

For users, the practical promise is easy to understand. A model that can catch and fix some of its own mistakes without outside help could be more useful in tasks where first answers are often imperfect, such as mathematical reasoning or code generation. The source does not claim that SCoRe solves every reliability problem, but it does show a route for making self-correction a learned behavior rather than an afterthought.