How DeepSeek’s image tokens could change AI memory

DeepSeek’s new OCR model explores a different way for AI systems to store context: packing written information into images instead of relying only on text tokens. Researchers say the approach could help models remember more efficiently, though it remains an early exploration.

How DeepSeek’s image tokens could change AI memory

DeepSeek has released an optical character recognition model that is drawing attention for something beyond OCR itself. The model points to a different way of handling AI memory: storing written information as images so a system can keep more context while using fewer tokens.

The idea matters because today’s large language models can struggle as conversations get longer. Text becomes costly to store and process, and that can lead to muddled recall, forgotten details, and what some researchers call “context rot.”

Why AI memory is expensive

Most large language models work by breaking text into small units called tokens. Those tokens let a model process language, but they also create a storage and computing burden as the amount of context grows.

That burden becomes more visible in long interactions. When a user keeps a conversation going for a long time, the system has to manage more prior information. If it cannot do that well, it may lose track of earlier instructions or mix information together.

The issue is not only about convenience. Improving the way AI models remember information could reduce the computing power needed to run them. The source article links that efficiency question to AI’s large and growing carbon footprint.

DeepSeek’s approach is notable because it does not start from the usual assumption that text tokens should be the default form of stored context. Instead, it asks whether written information can be packed more densely by treating it visually.

What DeepSeek’s OCR model does differently

OCR is not new. It is the technology behind scanner apps, text translation in photos, and accessibility tools that turn text in images into machine-readable words. The field is already mature, with many strong systems.

DeepSeek’s model, according to the paper and early reviews cited in the source article, performs on par with top models on key benchmarks. But the more interesting part is how it processes information.

Rather than storing words only as text tokens, the system packs written content into image form. The source article compares this to taking a picture of pages from a book. The researchers found that this lets the model retain nearly the same information while using far fewer tokens.

In that sense, the OCR model acts as a test bed. It is not just about reading text from images. It is a way to explore whether visual tokens can help AI systems carry more information more efficiently.

A blurrier memory that still stays useful

The model also uses a kind of tiered compression. Older or less critical content can be stored in a slightly more blurry form, saving space while keeping the information available in the background.

That design has an intuitive appeal because human memory does not preserve every detail with equal clarity. Some memories remain vivid, while less important information fades. DeepSeek’s paper argues that compressed content can still be accessible while maintaining high system efficiency.

The comparison has limits. Manling Li, an assistant professor of computer science at Northwestern University, says current AI systems still tend to forget and remember in a very linear way. They are better at recalling what is recent than at deciding what is important.

Li says future work should explore memory that fades more dynamically. In her example, people may remember a life-changing moment from years ago while forgetting what they ate for lunch last week. DeepSeek’s method does not fully solve that problem, but it gives researchers a new framework for thinking about it.

Why researchers are paying attention

Using visual tokens for context storage is unconventional. Text tokens have long been the standard building block for AI systems, so DeepSeek’s image-based approach has quickly attracted interest.

Andrej Karpathy, the former Tesla AI chief and a founding member of OpenAI, praised the paper on X. He wrote that images may ultimately be better than text as inputs for LLMs, and called text tokens “wasteful and just terrible at the input.”

Li says the basic idea of image-based tokens for context storage is not entirely new. But she also says this is the first study she has seen that takes the idea this far and shows it might actually work.

Zihan Wang, a PhD candidate at Northwestern University, sees possible value for AI agents. Since conversations with AI are continuous, he believes this kind of memory approach could help models remember more and assist users more effectively.

The approach could also matter for training data. Model developers are facing a severe shortage of quality text for training systems. DeepSeek’s paper says the company’s OCR system can generate over 200,000 pages of training data a day on a single GPU.

What comes next for visual tokens

For now, the model and paper remain an early exploration of using image tokens rather than text tokens for AI memorization. The research does not mean text tokens are disappearing, and it does not show that visual tokens can handle every part of AI memory or reasoning.

Li says she hopes to see visual tokens applied not only to memory storage but also to reasoning. That would move the idea from a more efficient way to store context toward a broader method for how AI systems process and use information.

DeepSeek’s broader reputation gives the work extra attention. Based in Hangzhou, China, the company has tried to keep a low profile while still pushing the frontier in AI research. At the start of this year, it shocked the industry with DeepSeek-R1, an open-source reasoning model that rivaled leading Western systems in performance while using far fewer computing resources.

The new OCR work fits that pattern: a technical shift aimed at efficiency rather than scale alone. If visual compression proves useful beyond this early test, it could become one path toward AI systems that remember more, cost less to run, and handle long-running interactions with fewer failures.