Medical Student Loma Linda University School of Medicine Loma Linda, CA, US
Introduction: The Congress of Neurological Surgeons (CNS) has provided evidence-based guidelines to help guide clinical decision-making. This study aimed to investigate the response and readability of generative AI models to questions and recommendations proposed by the 2023 CNS guidelines for Chiari 1 malformation.
Methods: Thirteen questions were generated from CNS guidelines and were asked to Perplexity, ChatGPT 4o, Microsoft Copilot, and Google Gemini. Answers were considered “concordant” if the overall conclusion/summary highlighted the major points provided by the CNS guidelines; otherwise, answers were considered “non-concordant.” Non-concordant answers were further sub-categorized as either “insufficient” or “over-conclusive.” Additionally, AI responses were evaluated for their readability via the Flesch-Kincaid Grade Level, Gunning Fog Index, SMOG (Simple Measure of Gobbledygook) Index, and Flesch Reading Ease test.
Results: Perplexity displayed the highest concordance rate of 69.2%, with non-concordant responses further classified as 0% insufficient and 30.8% over-conclusive. ChatGPT 4o had the lowest concordance rate at 23.1%, with 0% insufficient and 76.9% over-conclusive classifications. Copilot showed a 61.5% concordance rate, with 7.7% insufficient and 30.8% over-conclusive responses. Gemini demonstrated a 30.8% concordance rate, with 7.7% insufficient and 61.5% as over-conclusive. Flesch-Kincaid Grade Level scores ranged from 14.48 (Gemini) to 16.48 (Copilot), Gunning Fog Index scores varied between 16.18 (Gemini) and 18.8 (Copilot), SMOG Index scores ranged from 16 (Gemini) to 17.54 (Copilot), and Flesch Reading Ease scores were low across all models, with Gemini showing the highest mean score of 21.3.
Conclusion : Perplexity and Copilot emerged as the best-performing models in concordance, while ChatGPT and Gemini displayed the highest over-conclusive rates. All generative AI responses contained a high level of complexity and difficult readability. The findings suggest that AI can be a valuable adjunct in decision-making but should not replace clinician judgment.