Aims: This study aims to evaluate the segmentation performance of pretrained general-purpose vision foundation models (FM) for delineating the left atrium (LA) from cardiac MRI to assist in diagnosing atrial fibrillation and guide ablation therapy by identifying and quantifying scar tissue located in the LA wall. This study explores the potential of transferring robust features learned by general-purpose vision FMs to task-specific segmentation like the LA from cardiac MRI.
Methods: We evaluated three different FMs - DINOv2 (base, large, giant), SAM (huge), and MedSAM (base) as frozen encoders paired with UNet decoders. The model was evaluated on the LAScarQS 2022 Task 2 LA data set. All methods were trained using binary cross-entropy, Adam optimizer with a batch size of 32, and an epoch of 50. Validation data of 10% from the training set was used to prevent overfitting and optimize the performance. The segmentation performance was evaluated using Dice Similarity Coefficient (DSC) and Intersection Over Union (IoU) on a test set.
Results: From our experiment, DINOv2-large with the UNet decoder achieved the highest performance with a DSC of 91.5% ± 3.9% and an IoU of 84.5% ± 6.1% than the state-of-the-art baseline, TransUNet with a DSC of 86.4% ± 3.5% and an IoU of 77.1% ± 5.3%. The other two FMs, SAM (86.5% ± 6.4%, 78.1% ± 9.1%) and MedSAM (86.3% ± 18.4%, 77.2% ± 17.8%), also showed performance improvements with the UNet decoder, though they remained slightly below DINOv2.
Conclusion: Our findings highlight the strong generalizability of DINOv2 for medical image segmentation, even without domain-specific pre-training. The combination of frozen foundation model encoders and hierarchical de coders enables accurate LA segmentation. For future work, we will explore different parameter-efficient-fine-tuning (PEFT) techniques such as Low-Rank Adaptation (LoRA) and Layer-Wise freezing with the UNet decoder to opti mize the segmentation performance.