An Embedding Approach for Biomarker Identification in Hypertrophic Cardiomyopathy

Arash Kazemi-Díaz1, Luis Bote-Curiel1, María Sabater-Molina2, Juan Gimeno-Blanes3, Salvador Sala-Pla4, Francisco Gimeno-Blanes4, Sergio Muñoz-Romero1, Jose Luis Rojo-Alvarez1
1Universidad Rey Juan Carlos, 2Instituto Murciano de Investigaci ́on Biosanitaria, 3Hospital Clínico Universitario Virgen de la Arrixaca, 4Universidad Miguel Hernández


Abstract

Introduction: Hypertrophic Cardiomyopathy (HCM) consists of a thickening of the cardiac muscle, causing fatigue, changes in the cardioelectric system, arrhythmias, and even sudden deaths. Variants in gene MYBPC3 are a well known cause of this illness. Our objective was to find variants in other genes that can cause this pathology.

Experiments and results: For that purpose, genetic data from a group of patients (affected and not affected) were analyzed using Machine Learning techniques. More precisely, we propose embedding methods that allow a lower dimensional representation, which is very helpful for visualization, diagnosis, and therapy personalization. Our results, applying different methods, –Principal Component Analyisis (PCA), t-distributed Stochastic Neighbor Embedding (t-SNE), Uniform Manifold Approximation and Projection (UMAP), Orthonormalized Partial Least Squares (OPLS) and Supervised Autoencoders– on genetic data showed a very good separability in the embedded space, allowing us to identify 10 variants in 9 different genes that cause that separability. Once the causes of the separability were identified, we applied again the same methods in order to check the new data distribution. The separability in the new space was measured applying a machine learning classifier (Support Vector Machines) and checking how good it fitted to the data. The results of the predictions in all the embedded spaces was not good, meaning that the separability was low.

Conclusion: This study explored the differences between controls and HCM patients embedding the original data onto lower-dimensional latent spaces. Thanks to that, we were able to identify 10 variants that where potential causes of the disease. Although this information is not conclusive enough to determine whether those mutations are a cause of HCM or not, it may help clinicians in the task of identifying new HCM cases.