Leveraging unlabeled data to improve guitar tablature transcription

Andrew Franklin Wiggins

doi:10.17918/00010494

Guitar tablature is the primary music notation that guitarists prefer to use when learning a new piece of music, since it indicates the string-fret combinations to play, rather than just the notes. As transcribing guitar tablatures is typically time-consuming and requires expertise, there is interest to automate this task. Previous systems for automatic tablature transcription train using the limited labeled guitar performance data that is available, which subjects these systems to poor generalizability to unseen guitar types and recording environments. As labelling guitar tablature is a time-consuming and difficult process, the ability for a system to leverage unlabeled guitar audio would be advantageous in allowing the model exposure to a greater variety of real-world guitar performances. Recently, works have explored music transcription tasks with unsupervised learning approaches, in which ground truth labels are not required, so training data is not limited to existing music datasets that have been pre-transcribed. Unsupervised transcription can be implemented by employing an analysis-synthesis framework: An analysis module takes in audio and predicts a transcription, while a synthesis module uses the prediction to resynthesize the original audio. The analysis module is trained by minimizing the distance between the original and reconstructed audio. As an alternate approach to leveraging unlabelled training data, the utilization of an unsupervised pretraining phase with an autoencoder structure has historically shown success in regularizing neural network models across domains. Despite the success of these unsupervised training approaches for related tasks, they have not been applied to the task of automatic guitar transcription. To address this, in this thesis I present an unlabeled dataset of over 27 hours of solo guitar-playing acquired from YouTube and propose two semi-supervised training approaches for improving guitar tablature transcription systems. First, I introduce an analysis-synthesis framework where predicted tablature is fed into a novel differentiable digital signal processing (DDSP) guitar synthesizer, and the transcription model trains via reconstruction loss between the original and resynthesized audio. (The synthesizer developed for this task offers promise in expressive guitar signal synthesis, in its own right.) Second, I explore the use of a convolutional autoencoder, where the audio's constant-Q transform is reconstructed. To evaluate the proposed approaches, I use a baseline guitar tablature transcription model trained with the GuitarSet dataset and observe its change in cross-dataset performance, due to an unsupervised training phase. I find that the DDSP approach offers promise in its tablature-structured latent representation, but the autoencoder approach provides more consistent improvements for transcribing guitars of unseen timbres. Ultimately, I find that the sequential incorporation of both unsupervised training phases provides the greatest overall performance improvement on guitar datasets of unseen timbre, indicating improved model generalizability.

Leveraging unlabeled data to improve guitar tablature transcription

Files and links (1)

Abstract

Metrics

Details

Leveraging unlabeled data to improve guitar tablature transcription

Files and links (1)

Abstract

Metrics

Details

Drexel University Social media