Background and Aim. Obstructive sleep apnea (OSA) is a respiratory disorder highly correlated with severe cardiovascular diseases. Although polysomnography is still considered the gold standard for OSA detection, its expensive requirements raised the need for alternative methods. In this regard, heart rate variability (HRV) and machine learning (ML) have gained popularity in recent years. Consequently, several works employed publicly available databases to train their models in a reproducible way. However, most of them also relied their validation on cross-validation methods using solely a single database. This hitherto fact brought forward the aim of the present work, i.e., how these models would have performed if they were tested in more realistic scenarios, such in an alien database, or even in the clinical practice.
Methods. The Apnea-ECG, MIT-BIH Polysomnographic, and the University College Dublin databases were re-annotated under the same labeling criteria. The corresponding ECG recordings were segmented into one-minute length epochs to extract the HRV, as well as its most representative features according to the state of the art. Then, various well-known ML classifiers were trained with different combinations of balanced subsets, computing the 10-fold cross-validation for each model. Eventually, these models were also tested on the remaining datasets, which were alien to the original training sets.
Results. External validation results have shown 10-40\% lower performance than 10-fold cross-validation regardless of the selected model.
Conclusions. The obtained results suggest the need for larger datasets to properly generalize the apnea detection problem, especially in those ML models trained and tested with cross-validation on a single database, which are suspected to be over-fitted.