Development and External Validation of a Machine Learning Model for Pulmonary Embolism Prediction in Intensive Care

Sampath Rapuri

Mentored by Robert D. Stevens

Pulmonary embolism (PE) is a frequent and life-threatening complication in hospitalized patients. While statistical risk scores have been proposed, there is an unmet need for more accurate methods that can forecast the likelihood for developing PE.

Current PE scoring systems (e.g., Modified Wells Scoring System, Revised Geneva Scoring System, and the Pulmonary Embolism Rule Out Criteria) allow clinicians to estimate risk based on a discrete number of predictive factors used as inputs in multivariable logistic regression equations [1]. However, these models have limited predictive accuracy, in part because they may not consider the physiological complexity of acutely ill patients, and because the statistical modeling techniques used may not be optimized for dynamic prediction tasks [2,3].

I used a large, multicenter database (eICU) containing >200,000 ICU admissions from 208 hospitals across the US to gather and analyze 2,799 unique ICU stays where a diagnosis of PE was recorded. From this data, I created three separate ‘datasets’ using differing observation windows (12, 24, and 48 hrs.) of time-dependent features (e.g. lab values and vital signs) due to no clear way to get the exact time of occurrence of PE. Figure 1. Details the inclusion and exclusion criteria of ICU stays used in this study. On each observation window, I evaluated multiple different machine learning (ML) models: decision tree, random forest (RF), gradient boosting (XGBoost, CatBoost, and GBoost), generalized linear models (GLM), support vector machine (SVM), and artificial neural network (ANN). After hyperparameter tuning, I compared these models to current PE risk scoring models (Wells and Geneva risk scoring models). Figure 2. Outlines the AUROC scores for the top performing model (logistic regression) and current risk scoring models for all three observation windows.

I am currently in the process of external validation with the Precision Medicine Analytics Platform (PMAP) dataset from Johns Hopkins. Additionally, I am revisiting the hyperparameter tuning process to address issues of overfitting.

After external validation, I aim to publish these results in a peer-reviewed journal.

This project allowed me to learn so much – both technically as well as professionally. I learned about the entire scientific process from crafting a hypothesis to sharing my results with the broader scientific community.

References:

1. Doherty S. Pulmonary embolism An update. Aust Fam Physician. 2017 Nov;46(11):816- 820. PMID: 29101916

2. van Doorn, W. P., Stassen, P. M., Borggreve, H. F., Schalkwijk, M. J., Stoffers, J., Bekers, O., & Meex, S. J. (2021). A comparison of machine learning models versus clinical evaluation for mortality prediction in patients with sepsis. PLOS ONE, 16(1). https://doi.org/10.1371/journal. pone.0245157

3. Kamran Boka, M. D. (2022, June 29). Pulmonary embolism clinical scoring systems. Overview, Modified Wells Scoring System, Revised Geneva Scoring System. Retrieved January 16, 2023, from https://emedicine. medscape.com/article/1918940-overview

A headshot of Sampath Rapuri. He is wearing a navy suit jacket and against a plain gray background. He has short, dark brown hair, glasses, and is smiling.
Sampath Rapuri

Sampath Rapuri is a second-year student
pursuing a dual degree in Biomedical
Engineering and Computer Science. He is
interested in developing computational
pipelines that can improve the precision,
efficacy, and outcomes of care delivered
to critically ill patients.

IDIES logo