lance: an open lakehouse format for multimodal AI with high-performance vector search and random access
lance: an open lakehouse format for multimodal AI with high-performance vector search and random access
What it solves
Lance provides a high-performance open lakehouse format specifically designed for multimodal AI. It addresses the limitations of traditional SQL-centric lakehouse formats (like Parquet or Iceberg) which struggle with the random access, vector search, and multimodal data storage required for modern ML training and feature engineering.
How it works
Lance implements a file format, table format, and catalog specification that can be built on top of object storage. It enables efficient storage of embeddings, images, videos, audio, and text in a single unified format. It supports hybrid search—combining vector similarity, full-text search (BM25), and SQL analytics—and provides lightning-fast random access for sampling and exploration.
Who it’s for
It is built for AI engineers and data scientists who need to manage large-scale multimodal datasets, build search engines or feature stores with hybrid search, and perform high-performance IO for large-scale ML training.
Highlights
- Hybrid Search: Combines vector similarity, BM25 full-text search, and SQL analytics on a single dataset.
- Fast Random Access: Up to 100x faster random access than Parquet or Iceberg.
- Multimodal Support: Native storage and lazy loading for images, videos, audio, and text.
- Data Evolution: Allows adding columns with backfilled values without requiring full table rewrites.
- Zero-copy Versioning: Includes ACID transactions, time travel, tags, and branches without extra infrastructure.
- Broad Integration: Compatible with Apache Arrow, Pandas, Polars, DuckDB, Ray, and Apache Spark.
Sources
- undefinedlance-format/lance