Cross-Modal Attention Fusion of Electrocardiogram Emotion and Voiceprint for Depression Detection

Minghui Zhao1, Lulu Zhao1, Hongxiang Gao1, Keming Cao1, zhijun xiao2, Feifei Chen3, Zhaoyang Cong4, Jianqing Li1, Chengyu Liu1
1southeast university, 2State Key Laboratory of Bioelectronics, School of Instrument Science and Engineering, Southeast University, 3Southeast Univbversity, 4State Key Laboratory of Digital Medical Engineering, School of Instrument Science and Engineering, Southeast University


Abstract

Depression is one of the most important psychiatric disorders diseases. Early detection and intervention have a better impact on treatment outcomes. Depression symptoms typically manifest from ECG emotion and voiceprint. However, due to the heterogeneity, it is challenging to extract effective representations from various modalities. To address this issue, we propose a Lstm-based sinc network (Sinc-Lstm) to extract time-frequency domain features from both ECG and speech. Then, we employ a cross-modal attention mechanism to fuse these features based contrastive learning. Finally, we collect EEG and speech dataset for depression detection (ESDD). Comparative experiments were conducted on ECG emotion dataset (WESAD) and ESDD. Emotional classification accuracy on the WESAD dataset is 0.95, with an F1 score of 0.98. Depression classification accuracy on the ESDD dataset is 0.81, with an F1 score of 0.75. The results show that the crossed-attention mechanism along the temporal dimension effectively aggregates relevant features from various time periods, thereby realizing accurate and comprehensive depression assessment.