35 Views

Spine

Battle of AI: Large Language Models vs. Machine Learning in Predicting functional outcomes for Type II Odontoid Fractures

Battle of AI: Large Language Models vs. Machine Learning in Predicting Functional Outcomes for Type II Odontoid Fractures

Presenting Author(s)

Forough Yazdanian, M.D

Postdoc Researcher
Beth Israel Deaconess Medical Center
cambridge, MA, US

Introduction: Type II odontoid fractures represent a significant clinical challenge, with modified Rankin Scale (mRS), a critical metric of functional recovery. This study leverages the largest retrospective dataset on Type II odontoid fractures to explore predictive modeling approaches. Recognizing the inherent limitations of the black-box nature of traditional machine learning (ML) models, we investigate the potential of Large Language Models (LLMs) for outcome prediction. The analysis compares the predictive performance of LLMs with ML.

Methods: A retrospective analysis of electronic health records (2015–2023) evaluated machine learning (ML) models, including XGBoost with hyperparameters optimized through nested cross-validation and GPT-4-based Large Language Models (LLMs) for predicting mRS outcomes (mRS ≤2 vs. >2). Using 72 examples per category, LLMs were tested with Zero Shot Prompting (ZSP) and Many Shot Prompting (MSP). Model performance, including AUC-ROC, compared the predictive efficacy of ML and LLM approaches.

Results: The results demonstrated that XGBoost outperformed other models, achieving an AUC-ROC of 0.81, accuracy of 83.3%, sensitivity of 76%, and specificity of 66%, showcasing strong and balanced performance. In comparison, MSP exhibited competitive performance with an accuracy of 76.3%, AUC-ROC of 0.76, and superior sensitivity of 87%, though its specificity remained moderate at 66%. Meanwhile, ZSP underperformed with an AUC-ROC of 0.48.

Conclusion : XGBoost demonstrated balanced performance, outperforming LLM-based approaches. While MSP showed promise with high sensitivity, its moderate specificity indicates room for improvement. MSP outperformed ZSP but fell short of ML. Additionally, LLMs with precisely engineered prompts demonstrate significant predictive potential for integrating unstructured data and offering a distinct advantage in clinical applications by providing interpretable reasoning behind decision-making processes.