

DeepSeek is preparing to revolutionize AI learning with a new open-source OCR compression model. Thanks to its advanced optical coding, DeepSeek can learn from more than 200,000 document pages per day on a single Nvidia A100 GPU.
With the proliferation of AI data centers and associated processing costs, the onus is now on algorithm efficiency, and no language model seems to do it better than DeepSeek. Its models are open source, and training them comes at a much lower cost than those of OpenAI’s ChatGPT or Google’s Gemini.
The newly announced DeepSeek-OCR model is a prime example of learning efficiency. By using optical mapping, it can compress extremely long documents by converting them to images with a 97% recognition precision at a compression ratio lower than 10x.
By using advanced encoder and decoder, more than nine tokens with document text can be converted into a single visual token, greatly diminishing the computing resources needed for processing the content. Even at a 20x compression ratio, the new DeepSeek-OCR system can achieve 60% optical recognition accuracy, a rather unprecedented feat.
Thanks to the new AI compression algorithms, DeepSeek-OCR can learn from scientific or historical text processed by a single Nvidia A100 data center GPU with the speed of 200,000 pages per day. A 20-node A100 cluster can thus process 33 million document pages daily, a paradigm shift in text-heavy LLM learning. According to the OmniDocBench ranking, DeepSeek-OCR beats other popular solutions like GOT-OCR2.0 or MinerU2.0 by a mile when it comes to fewer vision tokens used per page.
The new DeepEncoder algorithms can handle a range of document sizes and resolutions without sacrificing speed or accuracy, while the DeepSeek3B-MoE-A570M decoder relies on the so-called mixture-of-experts architecture that distributes the knowledge across the specialized models needed for each OCR task. As a result, DeepSeel-OCR can process complex documents with graphs, scientific formulas, diagrams, or images, even when written in several languages.
To achieve such a scale and accuracy, DeepSeek went through 30 million pages in Portable Document Format (PDF) written in nearly 100 languages, which included every single category out there, from newspapers and scientific handwriting to textbooks and PhD dissertations. Still, while the speed and efficiency of visual tokenization achieved with the new DeepSeek-OCR system are undeniable, it remains to be seen if this will lead to language model performance improvement when it comes to actual reasoning when compared with the current text-based token paradigm.

Daniel Zlatev – Senior Tech Writer – 1931 articles published on Notebookcheck since 2021
Wooed by tech since the industrial espionage of Apple computers and the times of pixelized Nintendos, Daniel went and opened a gaming club when personal computers and consoles were still an expensive rarity. Nowadays, fascination is not with specs and speed but rather the lifestyle that computers in our pocket, house, and car have shoehorned us in, from the infinite scroll and the privacy hazards to authenticating every bit and move of our existence.
Daniel Zlatev, 2025-10-22 (Update: 2025-10-22)