Speech representations: Adventures in pre-training and fine-tuning transformers for speech technology tasks

Authors

Nik Vaessen

Keywords:

speech representation learning, self-supervised learning, multi-task learning, speech recognition, speaker recognition, transfer learning

Synopsis

This thesis explores self-supervised speech representation learning, specifically focusing on the the wav2vec 2.0 framework. 

The research demonstrates that wav2vec 2.0, originally designed for automatic speech recognition, can be successfully fine-tuned for speaker recognition, even with limited labeled data. However, attempts to create a unified multi-task model for both speech and speaker recognition revealed performance trade-offs, as these tasks require orthogonal information processing.

A comprehensive analysis of pre-training batch sizes shows that downstream performance primarily depends on the total amount of data observed during self-supervision. The thesis also addresses data quality requirements for self-supervised learning, finding that clean, prepared speech is essential - particularly avoiding vocal music content which causes training divergence.

Finally, the research presents the creation of a 55,000-hour Dutch speech dataset from television broadcasts, demonstrating that mono-lingual pre-training can outperform multi-lingual pre-training for Dutch speech recognition.

Cover image

Published

September 11, 2025

Details about the available publication format: PDF

PDF

ISBN-13 (15)

9789465151090