Medical Student Hackensack Meridian School of Medicine
Introduction: Artificial intelligence (AI) chat models are increasingly used to address patient questions in healthcare, yet their reliability in medical contexts remains under-researched. This study evaluates the accuracy and consistency of responses from five popular AI models to frequently asked questions (FAQs) about scoliosis.
Methods: We selected five popular AI models (ChatGPT 4, ChatGPT 4o, ChatGPT o1, Perplexity, and Gemini) and 20 common scoliosis FAQs. We submitted each question to each model across three independent sessions to avoid influence from prior queries. Researchers submitted questions to one model each. Responses were blinded and graded by researchers who did not directly interact with the model, using a 4-point scale rated responses 4 - Excellent, 3 - Satisfactory (minimal clarification needed), 2 - Satisfactory (moderate clarification needed), and 1 - Unsatisfactory (substantial clarification or major errors). Using RStudio, a Friedman test compared average response scores across models, followed by post-hoc analysis to identify significant model differences.
Results: The average rating for each model was 3.42 (ChatGPT 4), 3.95 (ChatGPT 4o), 3.37 (ChatGPT o1), 3.35 (Perplexity), and 3.62 (Gemini). The Friedman test revealed significant differences in response quality amongst the five AI models (p < 0.0001). Post-hoc analysis determined that ChatGPT 4o consistently outperformed other models including ChatGPT 4 (p=0.01), Gemini (p=0.04), ChatGPT o1 (p=0.003), and Perplexity (p=0.004), respectively.
Conclusion : This study highlights significant variability in the reliability of AI chat models for patient education on scoliosis. The superior performance of ChatGPT 4o, measured through both overall ratings and individual post-hoc comparisons, suggests it is the most reliable option for generating accurate and comprehensive answers to scoliosis-related FAQs. These findings illustrate the potential for select AI models to contribute to clinical patient education effectively, while also highlighting the need for careful model selection and evaluation in healthcare.