Snowflake’s hands-on coding benchmark puts GLM-5.2 in a position that matters for the AI market: not clearly superior to Opus 4.7, but close enough to make price impossible to ignore.
The result is especially relevant because the test focused on programming work, one of the flagship use cases where major AI companies are trying to prove value. In Snowflake’s comparison, Opus 4.7 remained the stronger performer on consistency and efficiency. GLM-5.2, however, came close on overall task completion when it was allowed multiple attempts.
A near tie after three attempts
The benchmark covered 103 tasks. Each task was run three times, and the models had to produce code that worked on both DuckDB and Snowflake.
When given three attempts per task, the two models ended up almost even. GLM-5.2 solved 66% of tasks, while Opus 4.7 solved 67%.
That narrow gap is the headline result. It suggests that, at least in this Snowflake benchmark, GLM-5.2 can deliver competitive coding outcomes when retries are part of the workflow.
Retries matter because many real development systems already involve checking, revising, and running code more than once. A model that misses on the first try but can converge later may still be useful, depending on the cost and time involved.
Opus 4.7 still looks more dependable
The closer result after three attempts does not mean the models behaved the same way. Opus 4.7 had a clear edge on first-attempt accuracy, reaching 53.7% compared with GLM-5.2 at 47.6%.
That difference points to a practical advantage. A model that solves more tasks on the first try can reduce review time, repeated tool use, and operational friction.
Snowflake also found that GLM-5.2 required more work to reach its results. The model averaged 99 runs per task, compared with 80 for Opus 4.7. It also used 860 million tokens, nearly double Opus 4.7’s 439 million.
Those numbers matter because token use is not just a technical metric. It affects speed, infrastructure demand, and final cost. A cheaper model can lose part of its price advantage if it needs many more tokens to finish the same work.
Where GLM-5.2 helped, and where it struggled
According to Snowflake CEO Sridhar Ramaswamy, GLM-5.2 showed strength in validating code across DuckDB and Snowflake at the same time. That ability helped explain why only GLM could solve certain tasks.
But the same benchmark also exposed weaknesses. GLM-5.2 sometimes stopped too soon, and in other cases spent too much effort checking the wrong details.
One task made that tradeoff clear. GLM-5.2 made 411 tool calls in 24 minutes while checking row counts, distributions, null values, and column types, yet failed all three attempts. Opus 4.7 solved that same task with 49 calls in 9 minutes.
Ramaswamy also said the claim that GLM produces cleaner code did not hold up in the test. The broader lesson is straightforward: more checks do not automatically create more correct answers.
Even with those caveats, Snowflake’s team is excited about GLM-5.2 and wants to make it available to customers.
The pricing gap changes the calculation
The benchmark becomes more important when pricing enters the picture. GLM-5.2 costs $1.40 per million input tokens and $4.40 per million output tokens, according to Zhipu’s official price sheet.
By comparison, Claude Opus 4.7 runs $5 input and $25 output. GPT-5.5 costs $5 input and $30 output.
That is a major difference for coding workloads that can generate large volumes of tokens. GLM-5.2’s higher token usage eats into the gap somewhat, but the source comparison still frames it as serious pricing pressure for Anthropic and OpenAI.
The pressure is sharper because coding is central to the business case for advanced AI models. If a lower-cost model can get close enough on real programming tasks, buyers may start asking harder questions about when premium models are necessary and when cheaper alternatives are sufficient.
Why this matters beyond one benchmark
Snowflake’s test does not erase the advantages of Opus 4.7. First-attempt accuracy, fewer runs, fewer tokens, and faster completion on difficult tasks are meaningful strengths.
But the GLM-5.2 result shows how the competitive picture can shift when performance is viewed alongside price. A model can be less efficient and still attractive if its unit cost is low enough and its results are close enough for the task.
That is why the benchmark has implications beyond Snowflake’s internal comparison. The source article frames China’s pricing as a challenge to the Western AI market, especially companies whose valuations depend on fast revenue growth.
If lower-cost models put pressure on revenue growth, the AI market faces a stress test. OpenAI’s and Anthropic’s valuations rest on the assumption that revenue keeps climbing fast, and those valuations are tied to billions in bets on AI infrastructure buildout, from data centers to chip orders.
For now, the cleanest reading is this: Opus 4.7 remains better in Snowflake’s benchmark, but GLM-5.2 is close enough, and cheap enough, to change the pricing conversation.