MiniCPM-V 4.6 release notes / what's new
MiniCPM-V 4.6 release notes / what's new
MiniCPM-V 4.6 is a 1.3 billion parameter vision model designed to enable local AI agents to process visual data—such as screenshots, PDFs, and videos—without the VRAM overhead of larger multimodal models or the latency of hosted APIs. It prioritizes edge deployability and token efficiency, making it suitable for integration into agent loops where frequent tool calls and visual inputs otherwise exhaust context budgets.
Model Architecture and Specifications
MiniCPM-V 4.6 utilizes a combination of a SigLIP 2-400 vision encoder and a Qwen 3.5 0.8B language model. Key technical specifications include:
- Parameter Count: 1.3 billion total parameters.
- License: Apache 2.0 (fully open weights).
- Context Window: Up to 262K tokens, supporting single images, multiple images, and video inputs.
- Deployment Support: Compatible with vLLM, SGLang, Llama CPP, and Ollama, with quantized variants available in standard formats (including GGUF).
- Mobile Support: Includes example apps and on-device adaptation code for iOS, Android, and Harmony OS.
Performance and Intelligence Benchmarks
On the Artificial Analysis Intelligence Index, MiniCPM-V 4.6 scores 13, which is approximately one-quarter of GPT-4o's intelligence. Despite its size, it outperforms models more than twice its size, such as the Ministral 3B model, and the Qwen 3.5 0.8B model.
In visual reasoning, the model performs exceptionally well on the MMU Pro benchmark, scoring higher than any other open-weights model under 2 billion parameters. While not intended to replace frontier models like Gemini for high-accuracy production environments or complex browser-use tasks, it provides a highly efficient alternative for sub-agent tasks.
Token Efficiency and Visual Compression
Token efficiency is a primary advantage of MiniCPM-V 4.6, particularly for agentic workflows where every visual input consumes the context budget.
- Reduced Token Consumption: The model uses approximately 5.4 million output tokens on the Artificial Analysis Intelligence Index suite, which is roughly 19 times fewer tokens than the non-reasoning Qwen 3.5 0.8B and 43 times fewer than the reasoning version of that model.
- Flexible Compression Modes: Users can switch between two visual token compression modes at inference time:
- 16x Compression: Optimized for video processing and maximum efficiency.
- 4x Compression: Optimized for fine-grained image details and OCR tasks.
Functional Capabilities and Testing
MiniCPM-V 4.6 demonstrates strong capabilities across several visual tasks, though performance varies by configuration:
Visual Question Answering (VQA) and OCR
- Document Analysis: The model can extract data from invoices and order receipts, such as identifying specific items (e.g., "Coke Zero") and their associated costs.
- Handwriting Recognition: The model successfully extracts drug names and dosages (milligrams) from handwritten medical receipts, a task traditionally difficult for small vision models.
- Detail Resolution: Using the 4x downsampling mode significantly improves results for OCR and detailed image analysis compared to the 16x mode.
Video Understanding
- The model can describe general actions in videos, such as football matches, identifying team names and ball movement. However, it may struggle with highly specific details or accurate scoring in some instances.
Thinking vs. Non-Thinking Modes
- Non-Thinking: Faster, basic responses.
- Thinking (Chain-of-Thought): Produces more detailed explanations and better mathematical reasoning (e.g., itemizing costs on a receipt before summing them). Thinking mode also improves the accuracy of descriptions for video understanding tasks.
Summary of Use Cases for Agents
MiniCPM-V 4.6 is best utilized as a specialized vision component within a larger agentic system. Instead of using a massive multimodal model for all text and vision tasks, developers can use a lightweight text model for general reasoning and trigger MiniCPM-V 4.6 specifically when an image or video needs to be processed. This approach preserves VRAM and reduces latency in local deployments.