Data Scientist NYU Neurosurgery New York, United States
Introduction: Gliomas exhibit heterogeneous pathology that influences prognosis and treatment strategies. Molecular markers, particularly IDH mutation and ATRX protein status, are essential for classifying glioma subtypes. While extracting these markers traditionally requires time-consuming manual review, Natural Language Processing (NLP) could automate this process. We evaluated regular expressions (regex) and term frequency-inverse document frequency (tf-idf) methods for extracting molecular markers from digital pathology reports. This automation has the potential to significantly improve clinical workflow efficiency and reduce human error in molecular classification.
Methods: We queried surgical pathology reports from NYU Langone’s Epic EHR system to build a dataset of 350 glioma patients. We supplemented this dataset with 100 additional glioma patients from UMichigan for external validation. Regex directly searched for specific phrases indicating molecular status, such as “retained/loss” and “positive/negative.” Contextless tf-idf vector representations used logistic regression models as binary classifiers.
Results: Both methods effectively extracted IDH and ATRX status. For IDH extraction, tf-idf achieved 97.88% accuracy (F1-score: 0.975) on the NYU dataset and 97.87% accuracy (F1-score: 0.986) on the UMichigan dataset. Regex outperformed tf-idf, with 98.08% accuracy (F1-score: 0.981) on the UMichigan dataset and 100% accuracy on the NYU dataset. For ATRX extraction, tf-idf achieved 96.43% (F1-score: 0.923) accuracy on the NYU dataset and 98.52% on the UMichigan dataset (F1-score: (0.982). Regex showed superior performance, with 98.9% (F1-score: 0.989) accuracy on the UMichigan dataset and 100% accuracy on the NYU dataset. Regex processed 350 NYU reports in seconds, while tf-idf processed 170 reports (70 NYU, 100 UMichigan) in about one minute—both substantially faster than manual review.
Conclusion : NLP techniques effectively automate molecular marker extraction from glioma pathology reports with high accuracy. While regex achieved near-perfect accuracy, tf-idf may be more generalizable across linguistically diverse datasets. Both methods could streamline molecular classification in neuro-oncology, reducing review time and errors. Future research will focus on refining these methods, integrating them into EHR systems, and validating them across larger, multi-institutional datasets.