You Can't Unit Test for Taste: Building a Points of Interest Pipeline
You Can't Unit Test for Taste: Building a Points of Interest Pipeline
Building a feature that requires "taste"—such as determining which landmarks are actually interesting to a user—cannot be solved with traditional unit tests because there is no objective ground truth for subjectivity. While Large Language Models (LLMs) can provide a subjective rating based on their training data, they are prone to hallucinations and biases, making them most effective as a supporting signal rather than a primary source of truth.
The Technical Stack for Geospatial Data Processing
To build a points of interest (POI) pipeline for the In the Long Run app, a combination of Python, Apache Parquet, and DuckDB was used to handle large-scale geospatial datasets.
- Data Source: GeoNames provided the raw location and category data under a Creative Commons license.
- Storage and Querying: Processed data was stored in Apache Parquet files for efficiency, with DuckDB serving as the query layer for SQL-based analysis.
- Geo-Calculations: The pipeline utilized Shapely and Pyproj to calculate bounding boxes and the distance of POIs relative to specific running routes (defaulting to a 50km radius).
- AI Integration: Claude (Anthropic) was used as a coding agent to help design the project plan and implement the pipeline steps.
Filtering for Notability and Overcoming Bias
Raw geospatial data is often too noisy for a curated user experience. A multi-stage filtering process is required to move from millions of rows to a manageable set of notable landmarks.
Initial Filtering and Notability Signals
Filtering began by excluding administrative divisions (countries, states) and selecting specific feature codes such as parks, historic sites, castles, and monuments. To identify "notable" sites, the pipeline used Wikipedia links found within the GeoNames alternateNames.txt dataset as a primary notoriety signal.
The "Anglophone Bias"
An early realization in the pipeline was that relying on English Wikipedia links created a geographic bias. For example, Route 66 (3,787 km) yielded 14,181 POIs, while the Iceland ring road (1,321 km) yielded only 511. This indicated that the data was reflecting where English speakers live and edit Wikipedia rather than the actual density of interesting sites.
The Role of LLMs: Subjective Taste vs. Factual Accuracy
LLMs were integrated into the pipeline to provide a "subjective" rating for POIs, but they proved unreliable for generating factual summaries.
Hallucinations in Enrichment
Attempts to use Anthropic's Haiku model to generate summaries led to significant hallucinations. The model occasionally misidentified locations (e.g., classifying a Central Park in Illinois as the one in Manhattan) or fabricated statistics regarding population and mountain height. Consequently, the project reverted to using original Wikipedia summaries to ensure correctness over readability.
LLMs as a Rating Tool
While poor at factual writing, the LLM was successful in providing a subjective significance score. This score helped lift "interesting" points of interest above those that simply had many automatically translated Wikipedia pages in multiple languages, which would otherwise skew the results toward generic populated places.
The Challenge of Verifying "Taste"
Unlike functional requirements, "taste"—the quality of what makes a POI feel right for a specific route—cannot be validated with red/green unit tests.
Per-Route Variance
Data requirements vary wildly by geography. A route through a densely populated area can quickly become a "population map" of every small village if not tuned. To solve this, per-route parameters were introduced, including:
- Custom population filters.
- Weighting the subjective LLM score more heavily against objective wiki link counts.
- Geographic radius filters to ensure an even spread of points between urban clusters and rural paths.
The Limits of Automation
Because there is no ground truth for what constitutes an "interesting" sight, the evaluation of success remains manual and iterative. As noted in the community discussion, taste is often "the part of the spec you forgot to write down, plus the part you couldn't write down even if you tried."
"Verification becomes hard to reason about because there is no ground truth for points of interest, there are no red/green unit tests for taste."
Community Insights on Taste and AI
Discussion among developers suggests that while AI can augment the process, the human element of "selection" remains the critical path.
- Externalizing Taste: Some argue that taste can only be unit-tested if it can be fully externalized into a specification, which is often impossible because humans are not "hashmaps."
- Governance over Generation: The value of AI agents is shifting from generation to governance—where the human's role is to cut the 200 lines of AI-generated output down to the 80 that actually fit the desired aesthetic or functional goal.
- Alternative Signals: For those seeking more objective notoriety signals, tools like QRank (which aggregates page views across Wikimedia projects) provide a more data-driven alternative to simple link counts.