chonkie: a lightweight ingestion library for fast and efficient RAG text chunking
chonkie: a lightweight ingestion library for fast and efficient RAG text chunking
What it solves
Chonkie is a lightweight ingestion library designed to simplify and accelerate the process of splitting text into chunks for Retrieval-Augmented Generation (RAG) pipelines. It eliminates the need to build custom chunkers from scratch and reduces the overhead associated with large, bloated libraries.
How it works
Chonkie provides a variety of chunking strategies and a pipeline system to chain these operations together. It supports multiple methods of splitting text, including:
- Fixed-size/Token-based: Using
TokenChunkeror the SIMD-acceleratedFastChunker. - Structural/Hierarchical: Using
RecursiveChunker(with customizable rules) orCodeChunkerfor programming languages. - Semantic: Using
SemanticChunker(based on similarity) orSlumberChunker(using an LLM to find meaningful breaks). - Specialized:
TableChunkerfor markdown tables andNeuralChunkerfor neural models.
Users can build a Pipeline to fetch data, chunk it, refine it (e.g., adding embeddings or merging overlaps), and ship it directly to a vector database via "handshakes."
Who it’s for
Developers building RAG applications who need a fast, efficient, and multilingual (supporting 56 languages) text splitting tool that integrates easily with existing vector stores, embedding providers, and LLMs.
Highlights
- Extensive Integrations: Over 45 integrations including 10+ vector databases (ChromaDB, Pinecone, Qdrant, etc.), 16+ embedding providers, and 5+ LLM providers.
- High Performance: Benchmarked as significantly faster and smaller in package size than competing alternatives.
- REST API Server: Can be run as a self-hosted API for easy integration into any application.
- Flexible Tokenization: Supports various tokenizers including tiktoken, Hugging Face, and custom token counting functions.
- Agent-Ready: Provides official skills and plugins for AI coding agents like Claude Code and Cursor.
Sources
- undefinedfeyninc/chonkie