Speech representations: Adventures in pre-training and fine-tuning transformers for speech technology tasks
Keywords:
speech representation learning, self-supervised learning, multi-task learning, speech recognition, speaker recognition, transfer learningSynopsis
This thesis explores self-supervised speech representation learning, specifically focusing on the the wav2vec 2.0 framework.
The research demonstrates that wav2vec 2.0, originally designed for automatic speech recognition, can be successfully fine-tuned for speaker recognition, even with limited labeled data. However, attempts to create a unified multi-task model for both speech and speaker recognition revealed performance trade-offs, as these tasks require orthogonal information processing.
A comprehensive analysis of pre-training batch sizes shows that downstream performance primarily depends on the total amount of data observed during self-supervision. The thesis also addresses data quality requirements for self-supervised learning, finding that clean, prepared speech is essential - particularly avoiding vocal music content which causes training divergence.
Finally, the research presents the creation of a 55,000-hour Dutch speech dataset from television broadcasts, demonstrating that mono-lingual pre-training can outperform multi-lingual pre-training for Dutch speech recognition.

Published
Series
Categories
License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.