Background: Chagas disease continues to be a significantly underlooked parasitic disease with harmful implications on cardiovascular health, particularly in regions with limited diagnostic infrastructure. As a result, early cases can go undetected, with patients' lives remaining at risk. We propose an ECG classification pipeline to identify Chagas-positive individuals based on waveform abnormalities and extracted features in 12-lead ECG data.
Methods: We analyzed the data by exploring differences between positive and negative cases (i.e., predicted age, abnormal vs normal ECG) from two large ECG datasets, CODE-15 (N=233770 patients) and SamiTrop (N=1631 patients). Both datasets were processed using the WFDB waveform format. Waveform comparisons revealed minute variations of ECG leads between positive and negative cases. We used the "tsfresh" python library to extract time-series features from each ECG lead. After feature extraction, we mapped the statistically significant features using absolute p-values from Welch's t-test via the SciPy.stats library. The statistically significant features include permutation entropy, root means square (RMS) power, Benford correlation, peak count, kurtosis, count below mean, standard deviation, longest strike below mean, absolute maximum, and more. For modeling, we trained a Random Forest Classifier using scikit-learn in both datasets. To address the substantial class imbalance (~12,000 positives vs. ~300,000 negatives), we adjusted the weight class for underrepresented positive cases. The model hyperparameters were tuned via GridSearchCV with 5-fold cross-validation.
Results: Our cross-validation score, trained on the CODE15 and SamiTrop dataset and tested on 20% of the CODE15 dataset, was 0.537. The PhysioNet Challenge score was trained on both datasets and tested at 0.545.
Conclusion: This model offered numerous directions for future plans. We plan to investigate using SMOTE to oversample positive cases and extend the analysis to deep-learning models such as Convolutional Neural Networks (CNN) to better recognize hidden ECG lead patterns.