45 Views

Spine

Large language models are poor neurosurgical coders : A Generative Pretrained Transformer 4o analysis of lumbar spine surgeries

Large Language Models Are Poor Neurosurgical Coders: A Generative Pretrained Transformer 4o Analysis of Lumbar Spine Surgeries

Presenting Author(s)

Rushmin Khazanchi, BA

Medical Student
Northwestern University
Chicago, IL, US

Introduction: One area of interest for AI applications has been the automation of surgical CPT coding. Several supervised NLP algorithms have been developed, though these algorithms require significant training data and computational expertise to optimize. The rise of large-language models (LLMs) like OpenAI’s GPT has led to exploration of several use-cases. A developing area of focus in the literature is how well LLMs can perform in CPT coding tasks out-of-the-box. The objective of this study was to assess how well GPT can generate CPT billing codes from operative report text across prompting paradigms.

Methods: We extracted operative reports from lumbar fusions and decompressions performed at our institution in 2022 for a preoperative diagnosis of lumbar spondylolisthesis. We also extracted CPT codes attached to each procedure from the billing database. We deployed GPT-4 on reports using (1) a basic prompt (2) an explanation prompt with a list of all possible CPT codes and descriptions from the AAPC Codify database. We assessed overall coding accuracy on a report-wise and code-wise basis across both prompts, and stratified performance across clinically relevant code subgroups (procedure, additional levels, hardware, revision). Standard statistical measures were used.

Results: We included 88 operative reports, with a total of 381 CPT codes. With a basic prompt, GPT produced 56.96% of CPT codes, though it only produced exact CPT matches for 26% of reports. Subgroup wise, GPT performed best on revision codes, (77.8% of CPT codes, 75.0% of reports), and worst on additional level codes (45.3% of codes, 31.4% of reports). For the average report, GPT produced 2.74 codes that were extraneous. Prompt engineering significantly improved code-wise performance (62.2%, p = 0.013), but did not significantly change report-wise performance (24%, p= 0.683). Prompt engineering reduced extraneous coding (1.53, p < 0.001).

Conclusion : GPT failed to generate accurate coding lists from operative reports from patients, even with attempts at prompt engineering. Supervised NLP algorithms may currently display more state-of-the-art performance. Further work and exploration are needed to optimize LLMs for this task..