Accurate ECG analysis can prioritize limited serological testing capacity for Chagas disease, potentially improving early detection and treatment outcomes. We present a robust algorithm for identifying Chagas cases from 12-lead ECGs.
Training data comprised the official Challenge dataset, including the heterogeneous CODE-15%, SaMi-Trop, and PTB-XL ECG datasets. Minimal signal preprocessing was performed, imputing missing age (mean 50), but no explicit filtering, resampling, or normalization was applied. We extracted seven demographic and ECG-derived features: Age, Sex (one-hot encoded into 3 features), signal mean, standard deviation, and frequency-robust Standard Deviation of Normal-to-Normal intervals (SDNN) calculated robustly across native sampling frequencies (100 Hz, 400 Hz, 500 Hz). A gradient-boosting classifier (XGBoost) was employed for its robustness to overfitting and imbalanced data. Combining minimal preprocessing with a rigorously tuned gradient-boosting model trained on 7 basic features (including frequency-robust SDNN using native sampling rates) demonstrated robustness on heterogeneous ECG data. Optuna optimized hyperparameters (e.g., n_estimators, learning_rate) by maximizing mean 10-fold stratified CV Area Under the Receiver Operating Characteristic Curve (AUROC), using scale_pos_weight by weighting minority positive class to mitigate majority class bias. Final threshold (0.60) was determined via 5-fold stratified CV, explicitly optimizing the F1-score to tailor predictions for task-specific TPR@Top5% evaluation metric.
The hyperparameter optimization yielded a mean AUROC of 0.756 ± 0.02. Cross-validated True Positive Rate at 5% highest probabilities (TPR@Top5%) was 0.412, albeit with high variance (± 0.52, threshold 0.60). The model achieved leaderboard score of 0.461 (Team: Deakin_ML_Team) in the unofficial challenge phase.
Our systematically tuned gradient-boosting model, using 7 basic features, demonstrates robust performance despite minimal preprocessing. It confirms that optimized machine learning offers a promising approach to effectively guide Chagas testing in resource-limited settings using ECG data. Official phase work includes developing Transformer models that address weak labels and exploring advanced feature extraction to advance Chagas screening.