Transformer architecture has shown excellent results for natural language processing tasks. Various attempts have been made to adopt the architecture to other domains including computer vision and signal processing. In computer vision, attention mechanisms are often used with convolutional neural networks(CNNs) to improve model performance. Recently, ‘Vision Transformer’ approaches, image classification model consisting of only transformers without CNNs, have achieved better performance than traditional CNN-mixed architectures. Inspired by this idea, we aim to explore the application of the Visual Transformer architecture to classify phonocardiogram recordings with heart murmur patterns. Our method utilizes two-dimensional spectrogram images converted from one-dimensional sequences representing sound signals. The images visualize the representation of the spectrum of frequencies of a signal varies with time. Using the images, we perform additional training from the pretrained model which was trained on imagenet2k datasets, intend to allow the model to learn various features quickly even with a small amount of data. This approach allows us to distinguish heart murmur patterns recorded in PCGs with a good accuracy. In addition, we could visualize the attention of the model on a spectrogram image, which could enable human visual inspections.