GPT-5.5 Codex: Reasoning-Token Clustering and Performance Degradation
GPT-5.5 Codex: Reasoning-Token Clustering and Performance Degradation
GPT-5.5 Codex exhibits reasoning-token clustering
GPT-5.5 Codex is experiencing a performance regression where reasoning output tokens cluster at fixed values spaced approximately 518 tokens apart. This clustering phenomenon is strongly correlated with incorrect results in complex reasoning tasks, as the model appears to "short-circuit" its thinking process at these specific thresholds rather than continuing to reason until a solution is reached.
According to the original report by user @maille, this behavior is specific to GPT-5.5; it is significantly less prevalent in GPT-5.4 and almost entirely absent in versions 5.2 and 5.3.
Technical analysis of token clustering
Evidence suggests that the clustering occurs at specific intervals, such as 516, 1034, and 1552 tokens. Technical analysis from the community suggests these numbers may be the result of server-side throughput optimizations, specifically batching reasoning inference into multiples of 512 tokens.
One theory proposed by @tyingq is that the 516-token mark represents an initial 512-byte buffer with a 4-byte header, with subsequent increments of 518 tokens accounting for additional buffers and metadata (such as linked list references).
Impact on reasoning quality and reliability
Users report that when the model hits these clustering thresholds, it frequently returns incorrect answers to complex puzzles or coding tasks. In contrast, when the model utilizes a larger number of reasoning tokens (e.g., 6,000 to 8,000), it typically arrives at the correct result.
Observed failure patterns
- Short-circuiting: In one test case involving a probability puzzle, a user reported that 5 out of 10 runs resulted in exactly 516 reasoning tokens and an incorrect answer, while runs with higher token counts were successful.
- Intermittent Quality Drops: Multiple users report "step jumps" in quality, where the model provides "incredibly stupid implementations" intermittently, leading some to migrate to alternative models like Claude.
- Version Comparison: Some users noted that while GPT-5.5 is generally more capable, it consumes significantly more tokens than GPT-5.3, which some consider the most balanced version for code quality and token efficiency.
Community evidence and verification
Users have developed methods to verify this degradation using the Codex CLI. One user provided a Python script to generate a histogram of reasoning_output_tokens from past sessions, which confirmed a visible spike at the 516-token mark.
"I’ve definitely experienced step jumps down in quality on an almost daily basis... The experience of relying on codex’s outstandingly thorough coding earlier in the year has evaporated for me." — @zenapollo
"This explains so much why gpt 5.5 has been so bad lately... it was really puzzling why it struggled so much where when it first came out it was one shotting stuff totally amazing." — @zuzululu
Comparison with other models
Some users compared the current state of GPT-5.5 Codex to previous regressions seen in other frontier models, such as Claude Code in April. Others noted that the "black box" nature of GPT's encrypted reasoning contents makes it harder to debug compared to models like DeepSeek or GLM, where reasoning is more transparent.