Slopo: Detecting Non-Exact Code Duplication with Embedding Models
Slopo: Detecting Non-Exact Code Duplication with Embedding Models
Slopo identifies non-exact code duplication using embedding models
Slopo is a lightweight CLI tool designed to detect code duplication that is not an exact copy-paste. By utilizing embedding models, it identifies snippets of code that are written similarly but may be located far apart in a codebase, across different modules, or separated within large files. This approach targets the "hardest to detect" duplication—code that is semantically similar but not identical—which is often the most harmful to maintainability.
How the embedding-based detection works
Slopo departs from traditional duplication detection by calculating an embedding for every code unit. It then identifies pairs of code units whose embeddings are mathematically close (using cosine similarity), marking them as potential duplicates.
The detection pipeline
Similar code units are filtered through a two-pass process to reduce noise:
- Similarity Threshold: The tool first filters out pairs whose embeddings do not meet a minimum cosine similarity (ranging from -1 to 1).
- Reranking and Boosting: Similar pairs are grouped into clusters. These clusters are then reranked. A "boost" is applied based on the distance in the codebase:
- Cross-file: A boost is applied based on the number of directory hops required to reach the other file (up to 15%).
- Same-file: A boost is applied based on the distance in lines of code (up to 10%).
This ranking system ensures that similar code located far apart in the codebase is prioritized, as these are the least obvious duplicates for human developers.
Handling exact copies
While Slopo focuses on non-exact duplication, it also detects exact copies. To keep reports clean, identical code is shown once with a list of all paths where it appears, rather than repeating the same snippet multiple times.
Supported languages and technical requirements
Slopo supports a wide range of popular programming languages, including:
- Python
- TypeScript
- JavaScript
- Java
- Kotlin
- C#
- Go
- Rust
Embedding model configuration
Embeddings are generated via external providers compatible with LiteLLM. For optimal results, the author recommends models specifically dedicated to code, such as those from Voyage AI. The tool allows for flexible configuration of embedding dimensions and batch sizes to optimize performance.
Integration into development workflows
Slopo is designed to be used as part of a larger refactoring workflow, often in conjunction with AI coding agents.
The recommended workflow
- Initial Analysis: Run
slopo index,slopo embed, andslopo analyzeto generate an initial report. - Filtering: Use an AI coding agent to review the clusters of similar code and determine if they are true duplicates that require refactoring.
- Ignoring: Discarded clusters are added to
slopo.ignore.txt, which can be committed to a Git repository to ensure the team shares the same reviewed results. - Refactoring: The remaining verified duplicates are used as the basis for refactoring.
Key configuration parameters
Users can tune the tool's sensitivity using several parameters:
similarity_threshold: Adjusts the minimum cosine similarity for the first pass.rerank_threshold: Adjusts the minimum similarity after the codebase distance boost is applied.body_node_count_threshold: Sets the minimum number of AST nodes in a code unit body. This ensures that only code units of a certain complexity are analyzed, preventing the report from being cluttered with trivial, small snippets.
Community insights and use cases
Discussion among developers suggests several high-value applications for Slopo's approach:
- Pre-push hooks: Integrating the tool into a pre-push hook to maintain codebase cleanliness after an initial cleanup.
- Semantic duplication: Identifying duplication that is semantic rather than a direct copy-paste, which is particularly useful before major refactors.
- Code review integration: Using similar snippet detection during code review to alert developers to existing constructs in the repo that might be doing the same thing.
"I built Slopo to solve one specific problem: finding similar code that is hardest to detect by other tools, coding AI agents, and humans... sometimes most of the detected duplicates are false positives, but the remaining ones are strong candidates to refactor or even bugs."
— rkochanowski, creator of Slopo