Effective Chagas Serological Test Prioritization via Optimized Feature-Based Gradient Boosting

Vinayaka Vivekananda Malgi and Sunil Aryal
Deakin University


Abstract

Accurate ECG analysis can prioritize limited serological testing capacity for Chagas disease, potentially improving early detection and treatment outcomes. We present a robust algorithm for identifying Chagas cases from 12-lead ECGs.

Training data comprised the official Challenge dataset, including the heterogeneous CODE-15%, SaMi-Trop, and PTB-XL ECG datasets. Minimal signal preprocessing was performed, imputing missing age (mean 50), but no explicit filtering, resampling, or normalization was applied. We extracted seven demographic and ECG-derived features: Age, Sex (one-hot encoded into 3 features), signal mean, standard deviation, and frequency-robust Standard Deviation of Normal-to-Normal intervals (SDNN) calculated robustly across native sampling frequencies (100 Hz, 400 Hz, 500 Hz). A gradient-boosting classifier (XGBoost) was employed for its robustness to overfitting and imbalanced data. Combining minimal preprocessing with a rigorously tuned gradient-boosting model trained on 7 basic features (including frequency-robust SDNN using native sampling rates) demonstrated robustness on heterogeneous ECG data. Optuna optimized hyperparameters (e.g., n_estimators, learning_rate) by maximizing mean 10-fold stratified CV Area Under the Receiver Operating Characteristic Curve (AUROC), using scale_pos_weight by weighting minority positive class to mitigate majority class bias. Final threshold (0.60) was determined via 5-fold stratified CV, explicitly optimizing the F1-score to tailor predictions for task-specific TPR@Top5% evaluation metric.

The hyperparameter optimization yielded a mean AUROC of 0.756 ± 0.02. Cross-validated True Positive Rate at 5% highest probabilities (TPR@Top5%) was 0.412, albeit with high variance (± 0.52, threshold 0.60). The model achieved leaderboard score of 0.461 (Team: Deakin_ML_Team) in the unofficial challenge phase.

Our systematically tuned gradient-boosting model, using 7 basic features, demonstrates robust performance despite minimal preprocessing. It confirms that optimized machine learning offers a promising approach to effectively guide Chagas testing in resource-limited settings using ECG data. Official phase work includes developing Transformer models that address weak labels and exploring advanced feature extraction to advance Chagas screening.