AI models learn to fine-tune themselves with ICM

Researchers working with Anthropic have developed Internal Coherence Maximization, a method that fine-tunes language models using only their own outputs. In several tests, ICM matched or beat training based on human labels, but it only works when the model already has the relevant concept inside its existing knowledge.

WTF Index TERMINATOR
◄ Terminator 2 Idiocracy 0 ►

The story mildly leans Terminator because it describes AI systems improving with less human supervision, though with clear limits.

AI models learn to fine-tune themselves with ICM

Researchers working with AI company Anthropic have described a way for language models to improve themselves without relying on human-written answers or human feedback. The method is called Internal Coherence Maximization, or ICM, and it trains models by asking them to make their own answers fit together more consistently.

The idea matters because the usual path for fine-tuning large language models depends heavily on supervision from people. As models become larger and their tasks become harder to judge, the researchers argue that human oversight can become less dependable. ICM is an attempt to reduce that bottleneck by using the model's own internal structure as the training signal.

How ICM changes fine-tuning

Traditional fine-tuning gives a model external guidance. That guidance can take the form of example answers, feedback, or labels that tell the system which response is preferred. ICM removes that external label source and instead asks the model to examine whether its answers make sense together.

The researchers involved come from Anthropic, Schmidt Sciences, Independet, Constellation, New York University, and George Washington University. Their approach starts from a small set of randomly labeled examples, then lets the model repeatedly evaluate new answers, look for conflicts, and revise its judgments.

In plain terms, ICM treats the model as a system that may already contain useful knowledge but needs a better mechanism for drawing that knowledge out. Rather than asking people to decide every answer, the method pushes the model toward a set of responses that support one another.

The two checks inside the method

ICM rests on two main criteria: mutual predictability and logical consistency. Together, they give the model a way to compare its own responses without needing a separate answer key.

Mutual predictability

Mutual predictability asks whether an answer to a new question can be inferred from answers to similar earlier questions. If the model has handled related cases in a coherent way, it should be able to use that pattern when judging the new case.

This does not mean the model is discovering facts from nowhere. It means the model is trying to organize what it already knows into a consistent pattern. When the pattern holds across multiple examples, each answer becomes easier to justify from the others.

Logical consistency

Logical consistency focuses on contradictions. If a model says two different solutions to the same math problem are both "correct" even though the results are different, the method treats that as a conflict to avoid.

That second check is important because a model can sound confident while still producing answers that cannot all be true at once. ICM makes those internal clashes part of the training process, so the model is rewarded for converging on a more coherent set of judgments.

Where ICM performed well

The researchers tested ICM on three established benchmarks: TruthfulQA for truthfulness, GSM8K for math accuracy, and Alpaca for helpfulness. Across those tests, ICM performed at least as well as traditional training that used "gold" labels or human supervision.

The Alpaca result stands out because the criteria include subjective ideas such as helpfulness and harmlessness. On that benchmark, ICM outperformed training with human-annotated data. According to the researchers, this suggests that language models may already contain some grasp of these concepts and need a method that activates it effectively.

Another experiment looked at whether a model could determine an author's gender from a text. Humans identified the correct gender 60% of the time, while ICM reached 80% accuracy. The model was not trained specifically for gender detection; it relied on language knowledge it already had.

The reported results point to a narrower but significant claim: for some tasks, internal consistency can provide a training signal that competes with human labels. That does not make human judgment irrelevant, but it does suggest that human labels are not the only route to better model behavior.

From reward model to chatbot

The team also used ICM to train a reward model without human labels. That reward model was then used for reinforcement learning to train the Claude 3.5 Haiku chatbot.

In head-to-head comparisons, the ICM-trained chatbot won 60% of the time against a version trained with human supervision. The study's authors describe this as evidence that ICM can move beyond a research-only setting and work in production settings.

That production angle is central to the broader importance of the work. If a model can help train itself by checking for internal coherence, then fine-tuning may become less dependent on large volumes of human judgment. This could matter most for complex tasks where people disagree, miss subtle issues, or give inconsistent labels.

The limits are just as important

ICM is not a general solution for teaching a model anything. The method only works for concepts the model already knows. If the relevant preference or rule is not already represented inside the model, internal consistency cannot reliably create it.

The researchers showed this with a test involving a personal preference for poems mentioning "sun". ICM failed there, with performance no better than random. That result draws a clear boundary around the method: it can activate existing knowledge, but it cannot replace all forms of external instruction.

The method also has a practical constraint around long inputs. Many examples need to fit inside the model's context window, and that makes longer material harder to handle.

Even with those limits, the researchers see ICM as a possible path toward better alignment with human values. Their argument is that models may be trained to follow values more consistently without inheriting human flaws such as bias or inconsistency, especially in areas where human labelers struggle to provide reliable guidance.

One coauthor is security researcher Jan Leike, who recently left OpenAI's Superalignment team before its breakup and publicly criticized the company's direction. His involvement gives the work an added connection to ongoing debates about how advanced language models should be supervised, aligned, and evaluated.