Automatic digitization of electrocardiogram (ECG) paper-based records—to extract 1-D ECG time series from 2-D scanned images—entails a model that is robust to a variety of distortions present in scanned images, including rotation, cropping, creases, as well as text artifacts. As part of the George B. Moody PhysioNet Challenge 2024, our team (mins-eth) proposes a convolutional neural network (CNN) architecture that is end-to-end trainable to perform the digitization task automatically. Specifically, we employ a convolutional module called spatial transformer, which is explicitly invariant to spatial deformation of features, to crop out single leads from the input image. The single leads are then processed by a denoising U-Net module to produce binarized 2-D signals, from which 1-D time series are read out. Our network is learnable in an end-to-end manner, allowing us to leverage the provided training set with data augmentation. Currently, our model achieves a reconstruction SNR of 8.57 in cross-validation on the training set. We did not receive a score during the unofficial phase due to timeout during evaluation.