heretic: what it is, what problem it solves & why it's gaining traction
heretic: what it is, what problem it solves & why it's gaining traction
What it solves
Heretic is designed to remove "safety alignment" (censorship) from transformer-based language models. It aims to do this without the need for expensive post-training or deep expertise in transformer internals, allowing models to answer prompts that they would otherwise refuse.
How it works
Heretic uses a technique called directional ablation (or "abliteration"). It identifies "refusal directions" in the model's hidden states by comparing the residuals of harmful and harmless prompts. It then orthogonalizes the model's weight matrices (specifically attention out-projection and MLP down-projection) against these directions to inhibit the model's ability to refuse.
To make the process fully automatic, Heretic uses a TPE-based parameter optimizer (via Optuna) to find the best ablation parameters. It optimizes for a balance between minimizing refusals and maintaining the original model's intelligence (measured by KL divergence).
Who it’s for
- LLM Users: People who want uncensored versions of existing models without needing to perform fine-tuning.
- AI Researchers: Those studying model internals and interpretability, as Heretic provides tools to plot residual vectors and analyze residual geometry.
Highlights
- Fully Automatic: No manual configuration or transformer expertise required to decensor a model.
- Broad Support: Works with most dense models, multimodal models, and various MoE architectures.
- Intelligence Preservation: Uses optimization to ensure the model retains as much of its original capability as possible.
- Interpretability Tools: Includes research features to generate PaCMAP projections and residual geometry tables to visualize how residuals transform between layers.
Sources
- undefinedp-e-w/heretic