Adapting Audio Foundation Models for Heart Sound Analysis

Carla Biermann, Jing Han, Cecilia Mascolo
University of Cambridge


Abstract

Foundation models - large pretrained neural networks - have shown potential for heart sound classification tasks. However, a key question is still how to adapt a general audio foundation model best to these tasks. This work systematically studies three domain adaptation techniques, freezing the foundation model and training a linear layer on top (linear probing, LP), fine-tuning (FT), and continued pretraining (CP), on two audio foundation models using four public heart sound databases. Our findings demonstrate that LP alone is insufficient for heart sound analysis tasks. While FT improves performance over LP, it yields models that generalise poorly to unseen datasets. To overcome this limitation, we introduce CP as a novel method for heart sounds. We find that further pretraining a model on all datasets together produces a heart sound-specific yet task-agnostic foundation model, which boosts LP and FT performance by up to 13%. Furthermore, two CP variants are studied, and we find that using the downstream dataset only for CP improves the learned representations and boosts LP and FT performance the most. These findings underscore that choosing the correct adaptation strategy is critical for heart sound analysis tasks.