Identifying Radiographic and Clinical Features of Untreated Vestibular Schwannoma: Comparative Analysis of Large Language Model Performance in Patient Records
Introduction: In vestibular schwannoma (VS), transitioning from the wait-and-scan (W&S) approach to active treatment relies on comprehensive longitudinal knowledge of tumor radiology and symptoms. Large language models (LLMs) may help to mitigate this high information-processing burden, yet their individual and comparative performance is not well understood. We compared the ability of three LLMs (ChatGPT-4o, Gemini 1.5-pro and Meta-Llama) to identify clinically useful features from health records of patients with untreated VS.
Methods: A retrospective review of electronic health records from a single neurosurgical center identified adult patients diagnosed with VS managed via the W&S approach. To establish ground truth, 200 randomly selected MRI reports and clinical notes were manually annotated by two independent adjudicators. Features extracted from MRI reports included maximum tumor dimension, location, brainstem involvement, enhancement, homogeneity, cystic texture, and presence of associated edema/hydrocephalus. Features derived from clinical notes included hearing impairment, tinnitus, facial/trigeminal nerve dysfunction, vestibular/cerebellar symptoms, symptom duration, and audiology results. An instructive prompt was developed and iteratively refined based on observed errors to optimize performance in each LLM.
Results: In total, 306 patients with 1181 MRI reports and 1939 clinical notes were identified. Mean age was 59.9 years (standard deviation [SD]=14.7), there was a higher proportion of women (N=164,53.6%) and median follow-up duration was 43.2 months. Mean maximum tumor dimension was 16.7mm (SD=9.3), 44.8% of tumors involved the cerebellopontine angle, 22.9% caused brainstem compression, 21.9% were heterogeneous, 77.1% were enhancing and 4.2% had associated edema. 88% of patients presented with hearing loss, 51% with tinnitus, 20% with facial symptom/signs and 45% with vestibular/cerebellar dysfunction. All three LLM types achieved high (>0.85) accuracy, sensitivity and specificity in extracting each radiographic and clinical feature. ChatGPT-4o displayed the best overall performance of >0.9 across performance metrics for all variables extracted.
Conclusion : The LLMs tested achieved high accuracy in extracting clinically useful information about VS from health records. LLMs may be integrated into clinical workflows in VS, serving as valuable adjuncts to collate longitudinal radiographic and clinical features to inform management decisions.