A Comparative Analysis of Data-Driven Modelling Techniques for 30-Day Heart Failure Readmission Prediction

Ryan Missel1, Jesdin Raphael1, Linwei Wang1, Christopher Haggerty2, Dustin Hartzel2, Jeffery Ruhl2, Brandon Fornwalt2
1Rochester Institute of Technology, 2Geisinger


Abstract

Background: Numerous studies have been done on the potential of machine learning approaches to short-term 30-day heart failure readmission, though most consider classical methods with relatively small data cohorts (avg. 3600 patients) and lack ablations on generalizing to subpopulations or over multiple hospitals. Deep learning's potential at scale for readmission prediction is still inconclusive.

Objective: To compare data-driven techniques on a large-scale cohort of electronic health records (EHR) over changing data availability and distributions of data.

Method: A cohort of 13,000 patients in a set of 30,513 emergency admissions related to heart failure-based ICD-9 codes were collected from the Geisinger Health System. Predictive features, both static and temporal measurements, were derived from EHR data and used to build a deep learning model (specifically a LSTM network) and a suite of 10 machine learning algorithms (notably AdaBoost and Random Forest). Predictive features were diversely sourced, examples bring demographics, medications, recorded vitals, and external lab results.

Results: Models were evaluated on an all-hospital set with 5-fold cross validation (80% data) and a holdout set (20%). On cross-validation sets, the LSTM model achieved 0.864+0.009 AUC and 0.210+0.470 Recall while the best regression model (AdaBoost) achieved 0.785+0.012 AUC and 0.265+0.014 Recall. On per-hospital experiments, the LSTM saw a noticeable drop in average performance (0.724+0.048 AUC/0.010+0.242 Recall) compared to the best regression model (0.804+0.021 AUC/0.298+0.046 Recall). The cross-hospital LSTM model failed to generalize across hospitals at all (0.547+0.034 AUC/0.014+0.180 Recall).

Conclusion: Despite temporal-based model performance on cross-hospital and large-scale datasets, its performance falls off in generalization given smaller data sources or cross-distribution settings when compared to traditional machine learning techniques.