The Decoder September 24, 2024 TERMINATOR

Why OpenAI's o1 still struggles with reliable planning

Arizona State University researchers tested OpenAI's o1 on PlanBench and found a major jump over conventional large language models. But the same results show sharp limits: accuracy falls on longer tasks, unsolvable problems remain difficult, and testing was costly.

WTF Index TERMINATOR

◄ Terminator 1 Idiocracy 0 ►

The story shows improved AI planning capability, but mainly emphasizes current reliability limits rather than immediate danger.

Why OpenAI's o1 still struggles with reliable planning

OpenAI's o1 appears to mark a real step forward in AI planning, but a new independent test also shows why benchmark success is not the same as dependable reasoning. Researchers at Arizona State University evaluated the model with PlanBench, a benchmark built to test whether AI systems can form correct plans rather than simply produce plausible text.

The findings are unusually clear in both directions. o1 performed far better than conventional large language models on several planning tasks. Yet it still failed in ways that matter for reliability, especially when problems became longer, more disguised, or impossible to solve.

What PlanBench Tests

PlanBench was developed in 2022 to measure planning ability in AI systems. The benchmark includes 600 tasks from the Blocksworld domain, a controlled setting where blocks must be arranged into specified orders.

Blocksworld is simple on the surface, but it is useful because a system has to produce a sequence of steps that actually reaches a goal. A wrong move can make a plan invalid, and a confident answer is not enough. The model must get the sequence right.

In the Arizona State University study, o1 was described as a Large Reasoning Model, or LRM. That distinction matters because the comparison was not only with smaller or weaker systems. The researchers compared o1 with conventional large language models, including LLaMA 3.1 405B, which had been the previous best language model on the benchmark.

Where o1 Made Real Progress

On standard Blocksworld tasks, o1 reached 97.8% accuracy. That result was far above LLaMA 3.1 405B, which solved 62.6% of the same type of task.

The study also looked at Mystery Blocksworld, an encrypted version designed to make the task harder. In that setting, o1 reached 52.8% accuracy, while conventional models almost entirely failed. This result is important because it suggests o1 was not only benefiting from the familiar surface form of the original benchmark.

The researchers then tested a new randomized variant. The point was to reduce the chance that o1's result came from benchmark data appearing in its training set. On this randomized test, o1 dropped to 37.3% accuracy. Even so, that score was still far above older models, which scored near zero.

Taken together, the numbers show a model that can do more than previous language models in this planning setup. The improvement is not marginal. It appears across the original task, the encrypted version, and the randomized variant, even though performance varies sharply by condition.

The Weakness Shows Up as Tasks Get Longer

The strongest results came from simpler planning cases. When problems required more steps, o1's performance fell sharply.

On problems requiring 20 to 40 planning steps, accuracy in the simpler test dropped from 97.8% to 23.63%. That decline is the central caution in the study. A model can look strong on short examples and still be unreliable when the chain of actions becomes longer.

This matters because planning is not only about recognizing what should happen next. It is about maintaining a correct sequence from start to finish. As the number of steps grows, small errors have more chances to appear, and the final answer can still look complete even when it cannot work.

The study also found that o1 struggled with unsolvable tasks. It correctly recognized them only 27% of the time. In 54% of cases, it produced complete plans that were impossible.

That failure mode is especially important for practical use. A system that cannot reliably say when no valid plan exists may produce an answer that looks useful while hiding the fact that the task cannot be completed as requested.

Why Accuracy Alone Is Not Enough

The researchers called o1's benchmark performance a "quantum improvement," but they also emphasized that the model does not provide guarantees that its solutions are correct. That is a major dividing line between a reasoning model and a classic planning algorithm.

The study compared o1 with classic planning algorithms such as Fast Downward. Those algorithms achieved perfect accuracy with much shorter computation times. That contrast does not erase o1's progress, but it frames it more precisely: o1 is much better than conventional large language models in this benchmark, while still not matching specialized planning methods on correctness and efficiency.

Cost was another part of the evaluation. Running the tests cost nearly $1,900. By comparison, the source article notes that classic algorithms can run on standard computers at virtually no cost.

The researchers therefore argue that fair comparisons of AI systems should look beyond headline accuracy. The relevant factors include:

accuracy on the benchmark task;
efficiency and computation time;
cost to run the evaluation;
reliability when problems become longer or impossible.

What the Results Mean for AI Planning

The study's conclusion is not that o1 fails at planning. It is that o1 shows meaningful progress without yet delivering robust planning ability.

That distinction is useful. Compared with conventional LLMs, o1 performs much better on standard Blocksworld and shows progress on obfuscated versions. But the gains do not generalize evenly across harder conditions. Longer problems and unsolvable instances expose weaknesses that simple benchmark scores can hide.

For anyone evaluating AI reasoning, the lesson is direct: a strong benchmark result should be read together with stress tests. The same model can be impressive on one version of a task and much less dependable on another.

OpenAI's o1, in this study, sits in that middle ground. It is a notable advance over earlier language models on PlanBench, yet it remains far from a planning system that can be trusted to always know when a plan is correct, efficient, or even possible.