Preliminary Program

Cross-Modal Attention Fusion of Electrocardiogram Emotion and Voiceprint for Depression Detection

Minghui Zhao¹, Lulu Zhao¹, Hongxiang Gao¹, Keming Cao¹, zhijun xiao², Feifei Chen³, Zhaoyang Cong⁴, Jianqing Li¹, Chengyu Liu¹
¹southeast university, ²State Key Laboratory of Bioelectronics, School of Instrument Science and Engineering, Southeast University, ³Southeast Univbversity, ⁴State Key Laboratory of Digital Medical Engineering, School of Instrument Science and Engineering, Southeast University

Abstract

Depression is one of the most important psychiatric disorders diseases. Early detection and intervention have a better impact on treatment outcomes. Depression symptoms typically manifest from ECG emotion and voiceprint. However, due to the heterogeneity, it is challenging to extract effective representations from various modalities. To address this issue, we propose a Lstm-based sinc network (Sinc-Lstm) to extract time-frequency domain features from both ECG and speech. Then, we employ a cross-modal attention mechanism to fuse these features based contrastive learning. Finally, we collect EEG and speech dataset for depression detection (ESDD). Comparative experiments were conducted on ECG emotion dataset (WESAD) and ESDD. Emotional classification accuracy on the WESAD dataset is 0.95, with an F1 score of 0.98. Depression classification accuracy on the ESDD dataset is 0.81, with an F1 score of 0.75. The results show that the crossed-attention mechanism along the temporal dimension effectively aggregates relevant features from various time periods, thereby realizing accurate and comprehensive depression assessment.