Chinese-CLIP: a large-scale Chinese vision-language model for cross-modal retrieval and zero-shot image classification
Chinese-CLIP: a large-scale Chinese vision-language model for cross-modal retrieval and zero-shot image classification
What it solves
Chinese-CLIP provides a Chinese-language version of the CLIP (Contrastive Language-Image Pre-training) model. It addresses the lack of high-performance, large-scale vision-language models specifically optimized for the Chinese language, enabling tasks like cross-modal retrieval, zero-shot image classification, and image-text similarity calculations.
How it works
Built upon the open_clip project, Chinese-CLIP was trained on a massive dataset of approximately 200 million Chinese image-text pairs. It uses a dual-encoder architecture (one for vision, one for text) to map both modalities into a shared embedding space. The project offers various model scales (from RN50 to ViT-H-14) and supports advanced training optimizations such as FlashAttention, gradient accumulation, and the FLIP training strategy to improve efficiency and performance.
Who it’s for
This project is designed for developers and researchers working with Chinese multimodal AI, specifically those needing to implement image-text search, automated image tagging, or zero-shot classification in Chinese.
Highlights
- Large Scale Training: Trained on ~200 million Chinese image-text pairs.
- Multiple Model Sizes: Offers five different scales, including ResNet50 and various Vision Transformer (ViT) configurations.
- Deployment Ready: Includes support for ONNX, TensorRT, and CoreML for faster inference and deployment.
- Flexible Training: Supports knowledge distillation fine-tuning, distributed training, and gradient checkpointing for memory efficiency.
Sources
- undefinedOFA-Sys/Chinese-CLIP