turbovec: what it is, what problem it solves & why it's gaining traction

turbovec: what it is, what problem it solves & why it's gaining traction

What it solves

turbovec is a high-performance vector index designed to reduce the massive RAM requirements of large-scale vector search. It allows users to fit millions of documents in a fraction of the memory (e.g., 10 million documents in 4 GB instead of 31 GB) while maintaining high search speed and recall, making it ideal for air-gapped or memory-constrained RAG stacks.

How it works

Built on Google Research's TurboQuant algorithm, the project uses a data-oblivious quantizer that requires no separate training phase. The process involves:

  1. Normalization and Rotation: Vectors are normalized to unit directions and multiplied by a random orthogonal matrix to make their coordinate distributions predictable.
  2. Calibration (TQ+): A shift and scale are fitted to each coordinate during the first ingestion to map empirical data to a canonical Beta distribution.
  3. Lloyd-Max Quantization: Coordinates are bucketed into 2-bit or 4-bit integers using precomputed optimal boundaries.
  4. Length-Renormalization: A scalar is stored per vector to correct the systematic underestimation of inner products caused by quantization, ensuring unbiased scoring.
  5. SIMD Search: Search is performed using hand-written NEON (ARM) and AVX-512BW (x86) kernels that score directly against codebook values without full decompression.

Who it’s for

Developers building Retrieval-Augmented Generation (RAG) applications where privacy, low latency, and memory efficiency are are critical, particularly those using local or air-gapped environments.

Highlights

  • Online Ingest: No training step, parameter tuning, or index rebuilds are required as the corpus grows.
  • Extreme Compression: Up to 16x compression (e.g., FP32 to 2-bit) with minimal recall loss.
  • High Performance: Outperforms FAISS IndexPQFastScan by 10–19% on ARM and remains competitive on x86.
  • Filtered Search: Supports search-time filtering via an allowlist, which is integrated directly into the SIMD kernel to avoid unnecessary computation.
  • Framework Integrations: Drop-in replacements for in-memory vector stores in LangChain, LlamaIndex, Haystack, and Agno.
  • Pure Local: No managed services; data remains on the local machine or VPC.

Sources