Depression is one of the most important psychiatric disorders diseases. Early detection and intervention have a better impact on treatment outcomes. Depression symptoms typically manifest from ECG emotion and voiceprint. However, due to the heterogeneity, it is challenging to extract effective representations from various modalities. To address this issue, we propose a Lstm-based sinc network (Sinc-Lstm) to extract time-frequency domain features from both ECG and speech. Then, we employ a cross-modal attention mechanism to fuse these features based contrastive learning. Finally, we collect EEG and speech dataset for depression detection (ESDD). Comparative experiments were conducted on ECG emotion dataset (WESAD) and ESDD. Emotional classification accuracy on the WESAD dataset is 0.95, with an F1 score of 0.98. Depression classification accuracy on the ESDD dataset is 0.81, with an F1 score of 0.75. The results show that the crossed-attention mechanism along the temporal dimension effectively aggregates relevant features from various time periods, thereby realizing accurate and comprehensive depression assessment.