Medical Student University of Rochester School of Medicine and Dentistry
Introduction: Large-language models (LLMs) have shown the capability to effectively answer medical board examination questions. However, their ability to answer image-based questions has not been examined. This study sought to evaluate the performance of two LLMs (GPT-4o and Google Gemini) on an image-based question bank designed for neurosurgery board examination preparation.
Methods: The accuracy of LLMs was tested using 302 image-based questions from The Comprehensive Neurosurgery Board Preparation Book: Illustrated Questions and Answers. LLMs were asked to answer all questions on their own and provide an explanation for their chosen answer. The problem-solving order of questions and quality of LLM responses was evaluated by senior neurological surgery residents who have passed the American Board of Neurological Surgery (ABNS) primary examination. Chi-squared tests and independent-samples t-tests were conducted to measure performance differences between LLMs.
Results: On the image-based question bank, GPT-4o and Gemini achieved correct score percentages of 52.64% (95% CI: 47.02-58.21%) and 37.42% (95% CI: 32.15-43.00%), respectively. GPT-4o significantly outperformed Gemini overall (P=.00023), particularly in pathology/histology (P=.034) and radiology (P=.0053). GPT-4o also performed better on second-order questions (58.02% vs. 40.12%, P=.0019) and had a higher average response quality rating (2.77 vs. 2.24, P=.0000006).
Conclusion : On a question bank with 302 image-based questions designed for neurosurgery board preparation, GPT-4o obtained a score of 52.64% and outperformed Gemini. GPT-4o not only achieved higher accuracy but also provided higher-quality responses compared to Gemini. In comparison to previous studies on LLM performance of board-style questions, image-based question performance was lower, indicating LLMs may struggle with machine vision/medical image interpretation tasks.