Using Claude Code and Opus 4.8 for MRI Analysis: A Case Study in AI Second Opinions
Using Claude Code and Opus 4.8 for MRI Analysis: A Case Study in AI Second Opinions
AI-Driven MRI Analysis Reveals Conflict Between Human and Machine Diagnosis
Using Claude Code with the Opus 4.8 (xhigh) model to analyze a shoulder MRI resulted in a diagnosis that directly contradicted a human orthopedist. While the human doctor diagnosed a Grade III partial-thickness tear of the subscapularis tendon, the AI analysis concluded the tendon was intact, reporting only mild insertional tendinosis. This discrepancy highlights the current limitations of Large Language Models (LLMs) in medical imaging and the psychological tension created when AI challenges professional medical advice.
Technical Implementation: Analyzing DICOM Data with Claude Code
To perform the analysis, the user utilized Claude Code rather than the standard Claude.ai chat interface. This distinction is critical because Claude Code allows the model to execute code, install necessary software packages, and perform iterative work on local files—capabilities essential for processing complex medical data.
The Workflow
- Data Input: The input consisted of a standard DICOM export containing several hundred extensionless files totaling approximately 266 MB.
- Environment Setup: The model was instructed to identify and install all required Python packages for DICOM image processing and analysis.
- Iterative Planning: The user provided minimal context ("right shoulder pain for 2–3 weeks") and tasked the model with creating a detailed execution plan before analyzing the images.
- Arbitration Process: After the initial conflict, the user implemented an "arbitration" phase. This involved providing the AI with the human report and a separate discussion from GPT 5.5 Pro regarding physical movements and symptoms. The AI used multiple sub-agents to conduct unbiased analyses, eventually siding with its own initial finding that no discrete tear existed.
Expert Critique: Why LLMs Struggle with Medical Imaging
Medical professionals in the community have cautioned against trusting LLMs for primary image interpretation, citing fundamental architectural and data gaps.
Training Data Deficits
Radiologists note that the volume of public training data for medical images is minuscule compared to the thousands of scans a human radiologist reviews during residency.
"These models are generally terrible at reading medical images... There’s obviously a ton of medical images in general but very few, and even fewer along with a report are available on the internet publicly for download."
Spatial Recognition and Tokenization
Technical critiques suggest that the way LLMs perceive images—through tokenization—is ill-suited for the precise spatial recognition required for radiology. Unless images are converted into a natively tokenized format (such as JSON) that accurately represents anatomical structures, the risk of hallucination remains high.
Clinical Context and the "Second Opinion" Dilemma
The case study underscores a broader tension in modern healthcare: the gap between the efficiency of AI and the accessibility of human experts.
The Human Factor
Users report that human doctors often provide limited time (10–15 minutes per appointment), leading patients to turn to AI for the "sympathetic" and exhaustive exploration of their symptoms. However, human diagnosis is not deterministic; it is a compound output of experience, equipment, and updated medical knowledge.
The Utility of AI as a Synthesis Tool
While image interpretation is contested, several users found AI highly effective for:
- Translating Jargon: Converting complex medical reports into human-readable summaries.
- Researching Rare Conditions: Surfacing niche NIH studies that human practitioners may overlook.
- Creating Support Plans: Generating gym or mobility plans based on a diagnosis, which can then be validated by a physiotherapist.
Conclusion: The State of AI in Diagnostics
Currently, AI serves better as a "brainstorming tool" or a synthesis engine for text-based medical reports than as a replacement for radiological expertise. The risk of misdiagnosis remains high on both sides: humans can err due to fatigue or lack of specialization, and AI can hallucinate due to a lack of curated medical training data. The consensus among experts is that the most reliable path remains obtaining multiple opinions from certified human specialists.