Medical Student University of Pennsylvania Perelman School of Medicine Philadelphia, PA, US
Introduction: Rapid interpretation of CT head scans for intracranial hemorrhage (ICH) is important in prioritizing patients for neurosurgical intervention. AI-driven image analysis has rapidly developed in recent years, and most recently the large language model (LLM) Grok-2 was introduced on the social media platform X with claims of high accuracy in image interpretation. Here, we test the diagnostic accuracy of Grok-2 in supporting clinical evaluation of ICH.
Methods: 50 patients were randomly sampled from the RSNA 2019 database (Flanders, 2020), comprising 25 normal and 25 hemorrhage cases. From each scan, a representative non-contrast, axial CT head image was selected to standardize inputs. Each slice was presented to Grok with a zero-shot prompt asking whether ICH was present, and, if affirmative, what type of ICH was present. Statistical analysis was then conducted.
Results: Grok correctly identified 18 out of 25 scans as containing ICH and correctly ruled out ICH in 18 out of 25 scans (sensitivity = 0.72, specificity = 0.72, PPV = 0.72, NPV = 0.72). Among the subtypes of ICH, greatest sensitivity was achieved for intraparenchymal hemorrhages (n = 14, sensitivity = 0.71, specificity = 0.83, PPV = 0.63, NPV = 0.88). Grok-2 was least successful at identifying intraventricular hemorrhage (n = 8), with no cases correctly detected.
Conclusion : Grok-2 demonstrates moderate capacity to identify the presence or absence of ICH. However, it lacks the consistent ability to discriminate subtypes of ICH and will often miss ICH diagnoses. Our results indicate early promise for LLM-based models in ICH detection to potentially act as an alert for ICH rather than rule out ICH.