These frustrations are more common than we know. If it's happening with smart devices, Alexa and Siri and vehicles, it's happening in contact centers, in media, in everyday scenarios. Â
Going beyond voices, inclusivity in ASR involves surroundings and background noises that are present in our daily lives. Ideally, these factors should not hinder the accuracy of speech recognition, but unfortunately, they do in many cases. Frustrations can arise when speech recognition tools are unable to understand the scenario due to background noise. For example, imagine a parent is busy juggling young children, trying to set a timer for dinner via their speech recognition tool, but the background noise is hindering this. The software behind speech recognition is where inclusivity all begins. Â
So we ask ourselves, what does it mean to us? Â
An inclusive ASR product understands every voice, everywhere. This means accurate transcription independent of the language you speak, your accent or dialect, and your surroundings. Transcription software is robust to background noise and can understand you whether you are in a noisy urban environment or a home full of laughter and play. Driven by this mission, at Speechmatics, we've been investing in self-supervised learning to promote the accessibility of our product. Â
Self-supervised learning is a machine learning algorithm that allows us to train speech recognition models using unlabeled audio data. The model learns by performing tasks that exploit the patterns and structure in the audio itself without transcript labels. For example, the model may try to predict the next part of an audio clip based on what it has heard so far. Or it may be asked to identify which parts of the audio clip have been masked or swapped. By doing these self-supervised tasks, the model learns rich representations of speech. This foundation model can then be fine-tuned on a limited amount of labeled data to efficiently learn how to map acoustic features to written text, just as children learn to speak. In order to improve model performance and analyze the data in a more meaningful way, data enrichment can be applied after annotation.
One of the advantages of self-supervised learning is that it reduces labeled data dependency. Traditional supervised speech models require thousands of hours of transcribed audio data to train a good-quality speech recognition model. The problem is that labeled data is often limited and lacks diversity, and thus favors high-resource languages and demographics. In contrast, unlabeled speech data is abundant and covers a broader diversity of voices and languages. Self-supervised models remove this barrier by pre-training on raw unlabeled speech to learn representations. Â
Most importantly, self-supervised learning is language-agnostic. By leveraging a wide variety of unlabeled data, these models learn shared representations of speech across languages or universal properties of speech. This results in a 70% reduction in transcription errors across languages. compared to traditional speech recognition systems that don't rely on unlabeled data offering a speech recognition product that understands a diverse global population.
Inclusivity with Speechmatics Â
Speechmatics' Ursa generation models released in March 2023 were driven by inclusivity. Ursa is the most accurate transcription engine across age groups and is 25% more accurate than Google in the age group of 60 to 81. It is this age group that struggles more than other age groups. Â
Ursa consistently exhibits a 32% higher level of accuracy than Google across various skin tones. Using a dataset that has representation is what other speech-to-text systems are lacking.Â