Postdoc Researcher Beth Israel Deaconess Medical Center
Introduction: Type II odontoid fractures pose a significant challenge due to their high risk of nonunion, impacting outcomes and treatment strategies. Predictive modeling with AI offers the potential to enhance clinical decision-making. This study evaluates the predictive performance of Large Language Models (LLMs) with a more transparent decision-making process, alongside machine learning (ML), using the largest retrospective analysis of Type II odontoid fractures to predict fracture union.
Methods: A retrospective analysis (2015–2023) utilized clinical and radiological data from a U.S. academic center to train ML models (XGBoost, Logistic Regression) for predicting fracture union. Hyperparameter optimization employed nested cross-validation. Logistic regression coefficients informed a scoring system based on patient factors. GPT-4's ZSP and DSP are based on Logistic regression-derived coefficients. were evaluated against ML predictions using accuracy, sensitivity, specificity, and AUC-ROC.
Results: A total of 253 patients were included with a median age of 82.6 years; 52.4% were female. Radiological assessments were conducted at a median of 3.3 months. Predictive scoring identified significant predictors of union: surgical fixation (+3 points), advanced age and severe frailty (both -3), and moderate frailty, smoking, osteoporosis, and severe angulation (all -2). Scores >0 predicted union; ≤0 indicated nonunion. XGBoost achieved the best performance (precision: 84%, recall: 96%, AUC-ROC: 0.97). Logistic regression showed high recall (97%) and accuracy (87%) but lower precision (76%). GPT-4's DSP improved sensitivity (89%) and accuracy (84.3%) over ZSP, though specificity was moderate (54%).
Conclusion : ML continues to excel in predictive tasks, with XGBoost demonstrating superior performance due to its optimal balance of precision, recall, and high AUC-ROC. Although ZSP showed limited, DSP informed by ML-derived coefficients significantly improved performance metrics. LLMs hold promises for integrating unstructured data but require carefully tailored prompts to achieve higher performance. Combining ML with LLM capabilities could enhance transparency, creating more interpretable predictive tools for clinical applications compared to standalone ML models. Future research should focus on integrating ML-driven scoring systems into LLM frameworks.