chonkie: a lightweight ingestion library for fast and efficient RAG text chunking

chonkie: a lightweight ingestion library for fast and efficient RAG text chunking

What it solves

Chonkie is a lightweight ingestion library designed to simplify and accelerate the process of splitting text into chunks for Retrieval-Augmented Generation (RAG) pipelines. It eliminates the need to build custom chunkers from scratch and reduces the overhead associated with large, bloated libraries.

How it works

Chonkie provides a variety of chunking strategies and a pipeline system to chain these operations together. It supports multiple methods of splitting text, including:

  • Fixed-size/Token-based: Using TokenChunker or the SIMD-accelerated FastChunker.
  • Structural/Hierarchical: Using RecursiveChunker (with customizable rules) or CodeChunker for programming languages.
  • Semantic: Using SemanticChunker (based on similarity) or SlumberChunker (using an LLM to find meaningful breaks).
  • Specialized: TableChunker for markdown tables and NeuralChunker for neural models.

Users can build a Pipeline to fetch data, chunk it, refine it (e.g., adding embeddings or merging overlaps), and ship it directly to a vector database via "handshakes."

Who it’s for

Developers building RAG applications who need a fast, efficient, and multilingual (supporting 56 languages) text splitting tool that integrates easily with existing vector stores, embedding providers, and LLMs.

Highlights

  • Extensive Integrations: Over 45 integrations including 10+ vector databases (ChromaDB, Pinecone, Qdrant, etc.), 16+ embedding providers, and 5+ LLM providers.
  • High Performance: Benchmarked as significantly faster and smaller in package size than competing alternatives.
  • REST API Server: Can be run as a self-hosted API for easy integration into any application.
  • Flexible Tokenization: Supports various tokenizers including tiktoken, Hugging Face, and custom token counting functions.
  • Agent-Ready: Provides official skills and plugins for AI coding agents like Claude Code and Cursor.

Sources