Baidu Unlimited-OCR: One-Shot Long-Horizon Parsing
Baidu Unlimited-OCR: One-Shot Long-Horizon Parsing
Baidu has introduced Unlimited-OCR, a model designed for "one-shot long-horizon parsing." The system enables the transcription of extensive documents, such as multi-page PDFs, in a single pass without the memory crashes typically associated with processing long sequences in Vision Language Models (VLMs).
Solving the VRAM Bottleneck with Reference Sliding Window Attention
Unlimited-OCR addresses the primary technical hurdle in long-document OCR: the linear growth of the KV (Key-Value) cache. In standard transformer architectures, as a model transcribes a 100-page PDF, it attempts to remember every previously generated word, causing memory usage to grow $O(N)$ until the system runs out of VRAM.
To solve this, Unlimited-OCR implements Reference Sliding Window Attention (R-SWA). This architectural approach splits the model's focus into two distinct paths:
- Global Reference: The model maintains a full, uncompromised view of the original document images, ensuring it never loses the visual context of the source material.
- Local Generation: The model restricts its memory of its own generated text to a tight, moving window (e.g., the last 128 words). It safely "forgets" older generated text, preventing the KV cache from expanding indefinitely.
This mechanism allows the model to process virtually unlimited document lengths while maintaining the visual grounding necessary for accurate transcription.
Implementation and Inference Options
Unlimited-OCR is built upon the foundations of DeepSeek-OCR and PaddleOCR. It provides two primary methods for deployment and inference:
Transformers-based Inference
For users with NVIDIA GPUs, the model can be run using the Hugging Face transformers library. The environment requires Python 3.12.3 and CUDA 12.9. Key dependencies include torch==2.10.0 and transformers==4.57.1.
The model supports two configuration modes for single images:
- Gundam Mode: Optimized for specific layouts (
base_size=1024,image_size=640,crop_mode=True). - Base Mode: Standard parsing (
base_size=1024,image_size=1024,crop_mode=False).
SGLang Server Deployment
For high-throughput production environments, Unlimited-OCR supports SGLang. This allows the model to be served via an OpenAI-compatible API. The server can be launched with a custom logit processor (DeepseekOCRNoRepeatNGramLogitProcessor) to manage repetition and improve output quality.
Technical Requirements and Capabilities
- Input Formats: The model supports single images, multi-image sequences, and PDFs (which are converted to images via PyMuPDF before processing).
- Context Length: The system is configured for a context length of 32,768 tokens.
- Batch Processing: The
infer.pyscript enables concurrent requests for image directories or PDF files, allowing for scalable batch inference.
Community Insights and Perspectives
While the technical approach to memory management is praised, the community has raised several critical points regarding the practical application of AI-driven OCR:
"My attempts at using AI to do OCR have always resulted in invented artifacts, which is not production feasible... words that are supposed to be in other languages being automatically translated to English, which ruins the effect."
Other users noted that while text OCR is advancing, other specialized domains remain underserved. One user highlighted that Optical Music Recognition (OMR) is still in a "greenfield" state, as current AI models struggle with the complex symbolic nature of music notation and the lack of standardized digital formats like MusicXML that are sufficiently rich for training corpora.
Additionally, some users questioned the necessity of reinventing the OCR engine itself, arguing that traditional vision models are already stable and reliable, suggesting that the real value of these new models lies in post-processing and data extraction rather than the raw transcription process.