Speech representations: Adventures in pre-training and fine-tuning transformers for speech technology tasks

Nik Vaessen

doi:10.54195/9789465151090

Speech representations: Adventures in pre-training and fine-tuning transformers for speech technology tasks

Authors

Nik Vaessen

DOI: https://doi.org/10.54195/9789465151090

Keywords:

speech representation learning, self-supervised learning, multi-task learning, speech recognition, speaker recognition, transfer learning

Synopsis

This thesis explores self-supervised speech representation learning, specifically focusing on the the wav2vec 2.0 framework.

The research demonstrates that wav2vec 2.0, originally designed for automatic speech recognition, can be successfully fine-tuned for speaker recognition, even with limited labeled data. However, attempts to create a unified multi-task model for both speech and speaker recognition revealed performance trade-offs, as these tasks require orthogonal information processing.

A comprehensive analysis of pre-training batch sizes shows that downstream performance primarily depends on the total amount of data observed during self-supervision. The thesis also addresses data quality requirements for self-supervised learning, finding that clean, prepared speech is essential - particularly avoiding vocal music content which causes training divergence.

Finally, the research presents the creation of a 55,000-hour Dutch speech dataset from television broadcasts, demonstrating that mono-lingual pre-training can outperform multi-lingual pre-training for Dutch speech recognition.