Microsoft Research Asia has introduced rStar-Math, a training method designed to make small language models much stronger at math. The core idea is straightforward: instead of relying only on final answers, the system explores multiple ways to solve a problem, checks the reasoning with executable code, and learns from the paths that work.
The result is a notable shift in what smaller AI systems can do. After training with rStar-Math, the 7-billion-parameter Qwen2.5-Math-7B model achieved 90% accuracy on the MATH benchmark, improving by 30 percentage points from its starting point and coming in 4.5% higher than OpenAI's o1-preview. A smaller 1.5 billion parameter model reached 88.6% accuracy.
How rStar-Math searches for better answers
At the center of rStar-Math is Monte Carlo Tree Search (MCTS), a method also associated with Google Deepmind's Alpha Zero and similar systems. MCTS lets a system explore different solution paths instead of committing immediately to one answer.
For math, that matters because a model can make a plausible early step and still end up wrong. rStar-Math tries to reduce that risk by looking across alternatives and learning which paths tend to produce stronger solutions.
What makes the method distinctive is the way it pairs plain English explanations with Python code. For each step, the model has to explain its reasoning and also write working code that can validate the approach. The researchers call this a "code-augmented chain-of-thought" approach.
The Python code is not just decorative. It includes explanations in comments, and if the code fails to run properly, the solution is rejected. That creates an automatic verification loop, which is especially useful for mathematical text problems where a solution can be checked clearly.
Why self-assessment is central to the method
rStar-Math also uses a special evaluation model called the Process Preference Model (PPM). Rather than making simple yes-or-no judgments, the PPM compares alternative solutions and learns which approaches are more effective.
The training process happens in four rounds, beginning with 747,000 math problems. In each round, both the main model and the evaluation model improve. The system creates verified solutions, and those solutions help train the next generation of models.
This is one of the most important parts of the work. rStar-Math does not depend on copying answers from larger language models. It learns from its own successful attempts, using verified solutions as training data for later rounds.
As the rounds continue, the system tackles more complex problems and produces stronger solutions. That points to a broader idea: small language models may be able to improve substantially when they can generate, test, and reuse high-quality reasoning traces.
Where the results stand out
The benchmark results are the clearest evidence of the method's impact. The 7-billion-parameter Qwen2.5-Math-7B model reached 90% accuracy on MATH after rStar-Math training. The 1.5 billion parameter model reached 88.6% accuracy.
The system also performed well on the American Mathematical Olympiad AIME 2024. It solved 8 out of 15 problems on average, matching the performance of the top 20% of student participants.
These results matter because they show that size is not the only path to better math performance. rStar-Math uses training design, search, verification, and repeated self-improvement to push smaller models toward stronger outcomes.
- Monte Carlo Tree Search helps explore multiple solution paths.
- Python verification filters out solutions that do not run correctly.
- Process Preference Model compares steps and learns which reasoning paths work better.
- Four training rounds let the system improve using its own verified solutions.
The trade-off is more computation
rStar-Math improves accuracy by spending more computation during inference. Like OpenAI's o-models, it tries multiple solution attempts before settling on an answer. The researchers tested how this test-time compute scales for rStar-Math.
With just four solution attempts, rStar-Math outperforms o1-preview and comes close to o1-mini. Performance continues to improve as the system makes more attempts, up to 64 per problem.
The gains are not identical across every type of math task. For MATH, AIME, and Math Olympiad problems, improvements level off around 64 attempts. College math problems continue to improve beyond that point.
This creates a practical limitation. Running and evaluating dozens of attempts for each problem can be time-consuming and computationally expensive. The source notes that this cost is also seen with OpenAI's expensive o3 model.
What rStar-Math cannot do yet
The same verification system that makes rStar-Math strong also narrows where it can be applied. It works well when there is a clear right or wrong answer and when code can check the reasoning. That is harder to transfer to tasks such as text comprehension, where correctness is less cleanly defined.
The system also cannot yet handle geometric problems because it currently lacks the ability to process visual information. That limitation matters for math domains where diagrams are part of the problem.
Even so, the researchers see potential beyond mathematical text problems. They identify programming tasks and common-sense reasoning as areas where similar verification mechanisms may work well.
rStar-Math also fits into Microsoft's broader focus on smaller, more efficient AI models that can reduce development and operating expenses. The company recently released its 14-billion-parameter Phi-4 model as open source under the MIT license. The rStar-Math team plans to share its code and data with the research community, and project lead Li Lyna Zhang notes on Hugging Face that a GitHub repository exists but will remain private until the internal approval process is complete.