Evaluating the Baseline Capability of Large Language Models (LLMs) in Interpreting Imaging for Neurosurgical Applications

Presenting Author(s)

David M. Doyle, B.S.

Medical Student
Central Michigan University

Introduction: Large language models (LLMs) like ChatGPT, released in 2023, have shown promise across medical applications, including board exam preparation, educational content creation, simulated clinical interactions, and decision support. While their utility in medical contexts is increasingly recognized, their potential in interpreting medical imaging remains underexplored. LLMs do not directly analyze images but apply pattern recognition and learned associations from extensive textual training to describe images, including CT and MRI scans. This study evaluates the baseline capability of LLMs in interpreting neurological imaging within neurosurgical contexts, with implications for educational applications and assessment of AI-driven accuracy in clinical workflows.

Methods: A curated database of 28 central nervous system pathology images with “diagnosis certain” status from Radiopaedia was created. Each image was uploaded to ChatGPT 4.0 and Claude 3.0, accompanied by standardized prompts. To evaluate response consistency, each model processed each image three times. The study used two questioning formats: individual question prompts and a sequential, conversational approach. ChatGPT’s memory was disabled to ensure independent responses for each prompt.

Results: ChatGPT 4.0 and Claude 3.0 demonstrated moderate accuracy in identifying common neurological pathologies, though complex cases posed challenges. Performance in the conversational format was notably improved, as each model could refine interpretations based on previous responses. However, both LLMs exhibited limitations in specificity and clinical contextualization, underscoring areas for improvement. Overall, LLMs showed a foundational ability to recognize anatomical features and interpret pathology-related information.

Conclusion : This study underscores the potential of LLMs as educational tools in neurosurgical imaging interpretation, contributing to a baseline for future research on AI integration into medical practice. While limited by indirect image analysis, LLMs provide a cost-effective, accessible option for medical training and could ultimately support clinical workflows, particularly in resource-constrained settings. These findings highlight the need for ongoing improvements in AI accuracy and contextual understanding, laying groundwork for future multimodal AI applications that may serve as adjuncts to human expertise.