Large Language Models Can Accurately Reason to Predict Disposition Status Following Anterior Lumbar Interbody Fusion Surgery

Presenting Author(s)

DP

Dhiraj Pangal, MD

Resident
Department of Neurosurgery, Stanford Medicine
Los Angeles, California, United States

Introduction: Large Language Models (LLMs) have opened new avenues for predictive analysis and streamlining workflows in healthcare. Finding safe disposition for patients undergoing major spine surgery with many comorbidities is a significant source of delay in discharge at large academic centers. Our study explores the use of LLMs to simulate case management staff, and reason through the likely discharge of a patient using text from the EMR.

Methods: The Stanford Research Repository (STARR) was queried to extract clinical notes (operative reports, case management assessments, and patient history) and outcome variables (disposition, LOS, 90-day complications, reoperation, risk of extended hospitalization) from patients undergoing ALIF procedures. Three model evaluations were compared: 1) “out-of-the-box” models like LLaMa3 and Stanford’s secure GPT-4o, which are unspecialized LLMs that were tasked to classify and predict outcomes, 2) the same models provided with rubric-based prompts developed by neurosurgeons, and 3) fine-tuned models trained on ALIF patients’ notes and outcomes. Furthermore, SHapley Additive exPlanations (SHAP) were queried to identify key concepts and phrases that contributed to the outcome prediction, providing vital insights into the model’s “thought process” for decision-making.

Results: In preliminary evaluations across twenty-five patients, models performed at chance levels for classification tasks like predicting discharge location (home, skilled nursing, or acute rehab), as disposition was predicted correctly 28% (LLama3) and 34% (GPT-4o) of the time. When provided with parameterized prompts, models had marked improvements (+30% for LLaMa3, +10% for GPT-4o), along with reasoning that exemplified an understanding of the provided decision-making parameters and context-specific features.

Conclusion : This study explores methods for enhancing LLMs in predicting critical postoperative outcomes in spine surgery. Through domain-specific guidance, including fine-tuning and contextual prompting, our approach demonstrates strong improvements that can offer trustworthy precision and confidence in AI-assisted medical tools.