data-juicer: a cloud-scale data processing system for curating AI-ready multimodal datasets using composable operators
data-juicer: a cloud-scale data processing system for curating AI-ready multimodal datasets using composable operators
What it solves
Data-Juicer addresses the challenge of transforming raw, chaotic data into high-quality, AI-ready intelligence. It eliminates the need for custom "glue code" when cleaning, synthesizing, and analyzing massive datasets required for foundation models, agent systems, and RAG indices.
How it works
It operates as a composable data processing system using a modular architecture of over 200 operators. Users can define reproducible pipelines via YAML recipes or Python code to chain these operators together. The system is designed for cloud-native scalability, leveraging Ray for distributed execution across thousands of nodes and incorporating optimizations like automatic operator fusion and CUDA acceleration to handle PB-scale datasets.
Who it’s for
This tool is designed for AI researchers and engineers who need to curate pre-training corpora, prepare fine-tuning data, clean agent interaction traces, or build domain-specific RAG indices at scale.
Highlights
- Massive Operator Library: Over 200 operators covering text, image, audio, video, and multimodal data.
- Cloud-Scale Performance: Capable of processing 70B samples in 2 hours on 50 Ray nodes.
- Recipe-First Workflow: Uses versionable YAML pipelines for reproducible data curation.
- Broad AI Lifecycle Support: Specialized tools for foundation model pre-training, agent quality gating, and embodied AI (VLA) processing.
Sources
- undefineddatajuicer/data-juicer