LLMLingua: a prompt compression toolkit that reduces token usage by up to 20x to lower costs and accelerate LLM inference
LLMLingua: a prompt compression toolkit that reduces token usage by up to 20x to lower costs and accelerate LLM inference
What it solves
LLMLingua addresses the limitations of Large Language Models (LLMs) regarding prompt length limits and high API costs. It specifically targets the "lost in the middle" problem where LLMs struggle to process information in the middle of long contexts, and helps users fit more information into a prompt without sacrificing performance.
How it works
The project provides a series of prompt compression methods that identify and remove non-essential tokens from prompts:
- LLMLingua: Uses a compact, well-trained language model (like GPT2-small or LLaMA-7B) to remove redundant tokens, achieving up to 20x compression.
- LongLLMLingua: Specifically designed for long-context scenarios to mitigate the "lost in the middle" issue and improve RAG performance.
- LLMLingua-2: A faster, task-agnostic compressor trained via data distillation from GPT-4 using a BERT-level encoder.
- SecurityLingua: A safety guardrail that uses security-aware compression to reveal malicious intentions in jailbreak attacks.
Who it’s for
It is designed for developers and researchers building LLM-based applications, particularly those using Retrieval-Augmented Generation (RAG), processing long documents, or looking to reduce API costs and inference latency.
Highlights
- Significant Compression: Reduces prompt length by up to 20x with minimal performance loss.
- Cost and Speed: Lowers API costs and accelerates inference by reducing token counts and KV-cache size.
- RAG Enhancement: Improves RAG performance by up to 21.4% while using only a fraction of the tokens.
- Integration: Integrated into popular frameworks like LangChain, LlamaIndex, and Prompt flow.
- Task-Agnostic: LLMLingua-2 offers 3x-6x speed improvements and handles out-of-domain data effectively.
Sources
- undefinedmicrosoft/LLMLingua