67 Views

Spine

Breaking Down outcomes : Machine Learning vs Large Language Models for Predicting Discharge Outcomes in Type II Odontoid Fractures

Breaking down Outcomes : Machine Learning vs Large Language Models for Predicting Discharge Outcomes in Type II Odontoid Fractures

Presenting Author(s)

Forough Yazdanian, M.D

Postdoc Researcher
Beth Israel Deaconess Medical Center
cambridge, MA, US

Introduction: Type II odontoid fractures pose management challenges due to patient variability and fracture complexity, impacting outcomes like home discharge—a key recovery indicator. While machine learning (ML) has been extensively utilized in predictive modeling, the role of large language models (LLMs) in clinical decision-making remains underexplored. This study compares the performance of ML models and LLM in predicting home discharge.

Methods: We conducted a retrospective analysis of electronic health records (EHRs) spanning January 2015 to December 2023. Demographic, clinical, and radiological data were analyzed using ML models, including Logistic Regression and XGBoost, with hyperparameters optimized through nested cross-validation. The predictive performance of LLM GPT-4 was evaluated using zero-shot and domain-specific prompting, with domain-specific prompting performed based on XGBoost-derived statistics.

Results: Among 253 patients (median age 82.6 years, 52.4% female), 92.5% were managed conservatively, while 7.5% underwent surgical intervention. Radiological assessments were conducted at a median of 3.3 months For home discharge prediction, the XGBoost model demonstrated the best performance, achieving an accuracy of 77%, precision of 64%, and an AUC-ROC of 0.82, outperforming other ML models. Logistic Regression showed comparable results, with an accuracy of 76% and a precision of 62%. Large language models (LLMs) exhibited enhanced performance with domain-specific prompting based on XGBoost-derived statistics (accuracy: 68%, precision: 66%) compared to zero-shot prompting (accuracy: 49.8%, precision: 67%).

Conclusion : These findings suggest that while ML, particularly XGBoost, demonstrates superior performance in predicting home discharge, DSP with prompt engineering improves LLM performance and shows promise for integration into clinical workflows, highlighting LLMs’ advantage with a more transparent and interpretable process over black box nature of ML. Future research should explore integrating ML-derived statistics into LLM frameworks to further enhance predictive power and clinical utility.