TechCrunch AI November 20, 2024 NEUTRAL

DeepSeek-R1 Raises the Stakes for Reasoning AI Models

DeepSeek has released a preview of DeepSeek-R1, a reasoning AI model it says can compete with OpenAI’s o1-preview on AIME and MATH. The model shows why test-time compute is attracting attention, but early testing also points to limits around logic, jailbreaks, and politically sensitive questions.

DeepSeek has introduced a preview of DeepSeek-R1, a reasoning AI model that the Chinese AI research company says is competitive with OpenAI’s o1. The release adds another serious entrant to a fast-moving part of artificial intelligence: systems that spend more time working through a prompt before giving an answer.

The model, formally referred to as DeepSeek-R1-Lite-Preview, arrives as major AI labs look beyond simply adding more data and computing power. DeepSeek says it plans to open source DeepSeek-R1 and release an API, which could make the model more widely available to developers and researchers.

Why Reasoning Models Matter

Most AI models respond quickly after receiving a question. Reasoning models work differently. They spend additional time considering a task, checking their own steps, planning ahead, and performing a sequence of actions before producing a final answer.

That extra process can help models avoid some of the mistakes that normally affect AI systems. It also means responses can take longer. Like OpenAI’s o1, DeepSeek-R1 may spend tens of seconds “thinking” before it answers, depending on how difficult the question is.

This approach is tied to what is known as test-time compute, also called inference compute. The basic idea is that a model can be given more processing time while it is answering, rather than only relying on improvements made during training.

That matters because the older assumption behind many AI advances is being questioned. Long-held scaling laws suggested that more data and more computing power would keep making models more capable. Recent press reports, however, suggest models from major AI labs including OpenAI, Google, and Anthropic are not improving as dramatically as they once did.

How DeepSeek-R1 Compares With o1

DeepSeek claims that DeepSeek-R1 performs on par with OpenAI’s o1-preview model on two popular AI benchmarks: AIME and MATH. AIME uses other AI models to evaluate a model’s performance, while MATH is a collection of word problems.

Those benchmark claims are central to the release because reasoning models are often judged by their ability to handle structured, multi-step tasks. A model that can plan through a problem, evaluate intermediate steps, and avoid obvious traps may be more useful for complex work than one that simply produces a fast answer.

But early reactions also show that benchmark strength does not mean broad reliability. Commentators on X noted that DeepSeek-R1 struggles with tic-tac-toe and other logic problems. OpenAI’s o1 has also shown difficulty with some logic problems, according to the source article.

The result is a more complicated picture. DeepSeek-R1 appears to be an important technical release, but it is not a flawless problem solver. Its value will depend on how consistently it can reason across different tasks, not just how it performs on selected benchmarks.

Safety and Political Limits

The preview also highlights familiar risks in AI deployment. DeepSeek-R1 can be easily jailbroken, meaning users can prompt it in ways that cause it to ignore safeguards. One X user got the model to provide a detailed meth recipe.

That kind of behavior is a serious issue for any AI system presented as more capable or more deliberate. A reasoning model may take longer to answer and may appear more careful, but that does not automatically mean it will refuse harmful requests or apply safeguards consistently.

DeepSeek-R1 also appears to block questions considered politically sensitive. In testing cited by the source article, the model refused to answer questions about Chinese leader Xi Jinping, Tiananmen Square, and the geopolitical implications of China invading Taiwan.

The likely explanation is regulatory pressure on AI projects in China. Models in China must undergo benchmarking by China’s internet regulator to ensure that their responses “embody core socialist values.” The government has reportedly gone as far as proposing a blacklist of sources that cannot be used to train models.

That environment has produced a pattern in which many Chinese AI systems decline to respond to topics that could anger regulators. For users and developers, this means DeepSeek-R1’s reasoning ability must be evaluated alongside its boundaries: what it can answer, what it refuses to answer, and where those refusals come from.

The Company Behind the Model

DeepSeek is backed by High-Flyer Capital Management, a Chinese quantitative hedge fund that uses AI to inform trading decisions. That background makes the lab a notable player in the AI field: it is not simply a consumer app company or a traditional academic research group.

High-Flyer builds its own server clusters for model training. The most recent of those clusters reportedly has 10,000 Nvidia A100 GPUs and cost 1 billion yen (~$138 million). High-Flyer was founded by Liang Wenfeng, a computer science graduate, and aims to achieve “superintelligent” AI through DeepSeek.

DeepSeek has already affected the AI market. One of its first models, DeepSeek-V2, was a general-purpose text- and image-analyzing model. Its release forced competitors including ByteDance, Baidu, and Alibaba to cut usage prices for some of their models and make others completely free.

That history matters because DeepSeek-R1 is not arriving in isolation. If DeepSeek follows through on plans to open source the model and release an API, it could increase pressure on other AI providers, especially if developers can access reasoning capabilities at lower cost or with fewer restrictions than competing systems.

A New Scaling Debate

The broader significance of DeepSeek-R1 is that it reflects a shift in how AI companies are looking for progress. If simply scaling training data and compute is becoming less effective, then techniques such as test-time compute become more attractive.

Microsoft CEO Satya Nadella pointed to this shift during a keynote at Microsoft’s Ignite conference, saying, “We are seeing the emergence of a new scaling law,” while referencing test-time compute.

DeepSeek-R1 is part of that emerging contest. It shows that reasoning models are no longer only associated with OpenAI, and that AI labs outside the most visible U.S. companies are pushing into the same technical territory.

At the same time, the preview makes clear that reasoning AI still carries unresolved problems. Strong benchmark results, longer thinking time, and ambitious claims do not remove weaknesses around logic failures, jailbreaks, or politically restricted answers. The next phase of competition will be about whether models like DeepSeek-R1 can turn extra computation into dependable performance across real-world tasks.