HackerRank Hiring Agent: The Risks of Non-Deterministic AI Resume Scoring

HackerRank Hiring Agent: The Risks of Non-Deterministic AI Resume Scoring

AI-Driven Resume Scoring is Non-Deterministic

Using Large Language Models (LLMs) to assign numerical scores to resumes results in high variance, meaning the same candidate can receive wildly different scores across multiple evaluations. In a test of HackerRank's open-source hiring-agent, a single resume scored between 66 and 99 out of 100 across 100 runs using the default gemma3:4b model at a low temperature (0.1). Even when switching to a more capable model like gemini-3.1-flash-lite, scores remained inconsistent, clustering between 48 and 64.

This non-determinism is a fundamental design flaw rather than a tunable bug. Because LLMs are stochastic processes, they struggle to make consistent judgment calls on subjective criteria. For example, a model might describe a project as "lacking architectural complexity" in one run and "demonstrating real-world deployment" in another, leading to a "vibe-check" rather than a standardized evaluation.

Critical Flaws in the Scoring Rubric

The hiring-agent employs a scoring system that heavily weights proxies for talent over actual professional experience, creating a bias toward specific candidate profiles.

Over-Weighting of Side Projects

Out of a total of 100 points (plus 20 bonus points), 65% of the score is derived from open-source contributions (35 points) and personal projects (30 points). Professional work experience is capped at 25 points. This weighting system penalizes experienced engineers who do not maintain public GitHub repositories or personal projects in their spare time, potentially filtering out highly qualified senior talent in favor of junior candidates with active open-source profiles.

Lack of Evaluation Anchors

The prompts used to evaluate experience are underspecified, lacking rubrics or examples to differentiate between score tiers. Testing revealed that both a junior engineer with a single internship and a principal engineer with a decade of experience received a perfect 25/25 for work experience. Without anchors to define what constitutes a 15 versus a 25, the score becomes a meaningless metric.

Technical Implementation Red Flags

Technical analysis of the hiring-agent repository reveals several implementation errors that undermine the reliability of the AI's judgments:

  • Monolithic Prompting: The system attempts to perform all evaluation steps in a single call rather than breaking the task into sub-components (e.g., separate prompts for open-source assessment and experience assessment).
  • Subjective Adjectives: The prompts rely on vague terms like "significant contribution" or "substantial community involvement," which the LLM must interpret without clear definitions.
  • Hallucinations: Users have reported the system awarding bonus points for credentials the candidate never claimed, such as participation in Google Summer of Code (GSoC).
  • Ineffective Bias Mitigation: The system instructs the LLM to ignore demographic information. Experts note that since LLMs are statistical distribution generators, the input (including names) still affects the output; the only reliable way to prevent bias is to strip the data before it reaches the model.

Industry Perspective and Counterpoints

The Volume Problem

Some hiring managers argue that despite the inaccuracy, AI screening is a necessary evil due to the sheer volume of applications. One commenter noted that even a 35% success rate in identifying viable candidates is preferable to an exhausted human recruiter missing qualified talent in a pool of hundreds of applicants per hour.

The Developer's Intent

The HackerRank CTO clarified that the hiring-agent was not intended as a full Applicant Tracking System (ATS) or a tool for final rejection. It was designed as a ranking tool for intern applications (50,000–60,000 per year) to help humans decide which resumes to read first. He noted that the production version uses a top-tier Gemini model and a very low cutoff score to ensure the vast majority of candidates still reach human review.

Legal and Ethical Risks

There are significant concerns regarding the legality of such tools in the EU, where anti-discrimination laws may prohibit the use of non-transparent, biased, or stochastic systems for employment decisions. Because the "randomness" in AI scoring is not independent of the resume content, it may be viewed as systematic bias rather than random filtering.

Sources