Quantifying Uncertainty of a Deep Learning Model for Atrial Fibrillation Detection from ECG Signals

Md Moklesur Rahman1, Massimo W Rivolta1, FABIO BADILINI2, Roberto Sassi1
1Dipartimento di Informatica, Università degli Studi di Milano, 2AMPS LLC


Abstract

Recently, deep learning (DL) demonstrated capable to identify atrial fibrillation (AF) from electrocardiograms (ECGs) with significant performance. Nevertheless, these models may present an exaggerated self-confidence in their predictions, showing poor calibration in their output probabilities. As a consequence, it makes the model probabilities biased with respect to the model accuracy, at each given output probability. In addition, such models cannot quantify the uncertainty of the predictions: a fundamental property in the clinical practice. In this study, we compared two DL models with the same architecture, but the second one had the first and last layers trained using a Bayesian approach, i.e., variational inference (VI), allowing the estimate of uncertainty of the predictions. We focused on the quantification of calibration error for both models, and uncertainty for the second. The calibration error was determined using a well-known metric called Expected Calibration Error (ECE), and uncertainty was quantified with a Monte Carlo approach, by running the Bayesian model 100 times for each test sample. The models were developed using the MIT-BIH Atrial Fibrillation dataset, with an 80:20 split of patients for training and validation. Our experiments showed that the first model performed very well on the AF detection task (sensitivity: 97.62\%; specificity: 96.92\%). However, the model proved poorly calibrated, with an ECE of 0.36, despite the high test-set accuracy achieved. In contrast, the second model reflected a better calibration (ECE=0.042), still having similar performance (sensitivity: 98.83\%; specificity: 98.52\%). The quantification of uncertainty for the Bayesian model showed that the proportion of misclassified samples tested against random guess was 89\%, suggesting that the majority of mistakes could be identified at testing time, thanks to the confidence interval provided by the Bayesian approach. Our study demonstrated the importance of quantifying both calibration error and uncertainty of DL models for AF detection from ECG signals.