Medical Student Columbia University Irving Medical Center
Introduction: Systematic literature reviews are time-intensive and resource-consuming. While Large Language Models (LLMs) are promising, their ability to conduct the complex text analysis necessary for systematic reviews in neurosurgery remains unexplored. In this study, we investigate an LLM's performance in selecting relevant articles and extracting data for systematic reviews.
Methods: NeuroGPT, built with GPT-4o through chain-of-thought prompting and a multi-agent structure, was compared to 3 human reviewers in article inclusion/exclusion decisions. The NeuroGPT review involved two steps. First, it assessed 324 articles based on provided inclusion and exclusion criteria, classifying them as “Include,” “Exclude,” or “Needs full text.” 81 articles fell into the “Include” or “Needs full text” categories. In the second step, NeuroGPT evaluated the full texts of these 81 articles, categorizing them as “Include” or “Exclude.” NeuroGPT was also asked to extract qualitative data from full texts (e.g. study design, main outcomes). To validate NeuroGPT, a base GPT4o was given the same task, and performances were compared with a two-tailed z-test for proportions.
Results: NeuroGPT achieved 98.5% (95% CI: 96.5%-99.3%) accuracy in article classification compared with human reviewers with 0% false negatives and 1.5% false positives. The model also extracted the requested data with 100% (95% CI: 98.6%-100%) accuracy. Base GPT4o achieved an 82% accuracy, with 11% false positives, and 7% false negatives. When comparing base model to NeuroGPT, z = 7.2737 and P<.00001.
Conclusion : NeuroGPT accurately replicated human reviewers in article selection and data extraction for a neurosurgery systematic review while significantly outperforming a basic GPT4o engine. This performance suggests that LLMs, specifically NeuroGPT, can save time and resources by streamlining the systematic review process. Research on multiple subtopics and specialties is necessary to understand the generalizability and limitations of this tool. Further directions for the model include identifying discrepancies between papers and drawing meaningful insights from the extracted data.