Medical Student Loma Linda University School of Medicine Loma Linda, CA, US
Introduction: The American Society for Stereotactic and Functional Neurosurgery and Congress of Neurological Surgeons (CNS) have provided evidence-based guidelines to help guide clinical decision-making. This study aimed to investigate the response and readability of generative AI models to questions and recommendations proposed by the 2018 CNS guidelines on subthalamic nucleus and globus pallidus internus deep brain stimulation for the treatment of patients with Parkinson’s Disease.
Methods: Seven questions were generated from CNS guidelines and asked to Perplexity, ChatGPT 4o, Microsoft Copilot, and Google Gemini. Answers were considered “concordant” if the overall conclusion/summary highlighted the major points provided by the CNS guidelines; otherwise, answers were considered “non-concordant.” Non-concordant answers were further sub-categorized as either “insufficient” or “overconclusive.” AI responses were evaluated for their readability via the Flesch-Kincaid Grade Level, Gunning Fog Index, SMOG (Simple Measure of Gobbledygook) Index, and Flesch Reading Ease test.
Results: ChatGPT 4o displayed the highest concordance rate of 69.2%, with non-concordant responses further classified as 14.3% insufficient and 42.8% over-conclusive. Perplexity had a concordance rate of 28.6%, with non-concordant responses distributed at 14.3% for insufficient and a notable 57.1% for over-conclusive. Copilot showed a 28.6% concordance rate, with 28.6% of responses being insufficient and 42.8% being over-conclusive. Gemini demonstrated a 28.6% concordance rate, with 28.6% of responses classified as insufficient and 42.8% as over-conclusive. Flesch-Kincaid Grade Level scores ranged from 14.44 (Gemini) to 18.94 (Copilot), Gunning Fog Index scores varied between 17.9 (Gemini) and 22.06 (Copilot), SMOG Index scores ranged from 16.54 (Gemini) to 19.67 (Copilot), and Flesch Reading Ease scores were low across all models, with Copilot showcasing the lowest mean score of 4.4 and Gemini showing the highest mean score of 30.91.
Conclusion : ChatGPT 4o emerged as the best-performing models in concordance, while other models displayed similar concordance levels. Copilot and Gemini showcased the most insufficient responses. All AI responses contained a high level of complexity and difficult readability. The findings suggest that AI can be a valuable adjunct in decision-making but should not replace clinician judgment.