Battle of AI: Comparative Analysis of Large Language Models and Machine Learning for Predicting DBS Motor Outcomes from Preoperative Cognitive Profiles
Postdoc Researcher Beth Israel Deaconess Medical Center cambridge, MA, US
Introduction: Motor outcomes following Deep Brain Stimulation (DBS) in Parkinson's Disease (PD) vary widely, with preoperative neurocognitive predictors underexplored, making individualized response prediction challenging. This study compares Large Language Models (LLMs) and Machine Learning (ML) for predicting DBS motor response.
Methods: A retrospective analysis (1990–2023) utilizing electronic records from a U.S. academic center incorporated clinical and cognitive data across multiple domains to train ML models, including XGBoost, and multivariate Logistic Regression. Hyperparameter tuning was conducted using leave-one-out cross-validation nested within repeated 10-fold cross-validation. The Mistral LLM was evaluated using ZSP and DSP, using tailored prompts based on logistic regression coefficients. Models predicted a strong response with ≥40% motor improvement in UPDRS Part III scores within one year post-DBS. Performance metrics (accuracy, sensitivity, specificity, AUC-ROC) compared ML models and LLMs for DBS motor outcome prediction.
Results: Among 119 patients (mean age: 64.3 years, 70% male), 44% were strong responders. UPDRS III scores (off-medication) improved from 31.6 ± 14.55 to 28.84 ± 9.63. Feature importance analysis identified key predictors of motor improvement, including younger age at DBS surgery, shorter PD duration, and lower baseline UPDRS III scores. Conversely, poorer performance in categorical fluency, delayed recall, and lower Oral Trail A scores were associated with weaker outcomes (all p < 0.01).
XGBoost achieved the highest performance (AUC-ROC: 87%, accuracy: 89%, sensitivity: 84%, specificity: 82%), followed by Logistic Regression (AUC-ROC: 79%, accuracy: 84%, sensitivity: 81%, specificity: 78%). DSP based on Logistic Regression coefficients, achieved 77% accuracy, outperforming ZSP (65%) across all metrics, including sensitivity (73 % vs. 51%) and specificity (72.5% vs. 65%).
Conclusion : Preoperative cognitive performance in delayed recall, attention, and verbal fluency predicts motor outcomes. ML continues to excel over LLM with balanced predictive performance for structured data. Tailored LLM prompts based on ML-derived coefficients showed significant improvement over ZSP, highlighting their potential in leveraging unstructured data. Additionally, LLMs provide an advantage over black-box ML with interpretable reasoning, enhancing clinical applicability.