World Inclusion Day 2023 🌎 The Promise for Speech Recognition

Inclusivity Is More Important Than Ever

The World Inclusion Day organization highlights that this day is an opportunity to create a more kind, accepting, respectful and unified world. A message and mission that is reminded on this day but should be embraced daily.

When it comes to the tech world, this day is more important than ever. We're now at a point where algorithms are so good that they not only learn to perform complex tasks such as face recognition but also learn hidden patterns in the data and, more importantly, learn historical and societal biases.

The risk is high, as these technologies have the potential to perpetuate discrimination at an unprecedented scale and speed. The current benchmarks for AI achievement do not adequately represent the perspectives of the global majority. Can this truly be characterized as a success?

Inclusivity in tech means any piece of technology works independently of who is using it. And this is also true for speech tech.

ASR Inclusivity means Everyone's Understood

Automatic Speech Recognition (ASR) software needs to be able to understand everyone to be truly beneficial. Inclusivity extends far beyond various English accents and languages; it encompasses a broad spectrum of diversity, including individuals with disabilities and embracing gender and age differences.

When it comes to voice recognition on smart devices, a revolution is happening. Whether it’s Amazon's Alexa, Apple's Siri or Google's Assistant, speech technology is part of our everyday lives. But not everyone can enjoy the full benefits of these tools, at least not as equally as others.

Adapting the way one speaks to interact with speech recognition technologies is a familiar experience to many, even for those whose first language is English. Halcyon Lawrence, an assistant professor of technical communication and information design at Towson University, puts it best, "part of our identity comes from speaking a particular language; having to change such an integral part of an identity to be able to be recognized is inherently cruel". Altering a piece of one's identity to be heard by a speech recognition tool is not a tool that is inclusive.

Speech recognition has been an incredibly helpful tool for people with disabilities, although "being misunderstood could have serious consequences". Allison Koenecke, a computational graduate student, further explains, "by using narrow speech corpora both in words that are used and how they are said, systems exclude accents and other ways of speaking that have unique linguistics features".

Recognizing these biases is an essential part of preventing them from being integrated into technologies. We have gathered a collection of narratives from individuals within our network who have expressed their frustrations.

These frustrations are more common than we know. If it's happening with smart devices, Alexa and Siri and vehicles, it's happening in contact centers, in media, in everyday scenarios.

Going beyond voices, inclusivity in ASR involves surroundings and background noises that are present in our daily lives. Ideally, these factors should not hinder the accuracy of speech recognition, but unfortunately, they do in many cases. Frustrations can arise when speech recognition tools are unable to understand the scenario due to background noise. For example, imagine a parent is busy juggling young children, trying to set a timer for dinner via their speech recognition tool, but the background noise is hindering this. The software behind speech recognition is where inclusivity all begins.

So we ask ourselves, what does it mean to us?

An inclusive ASR product understands every voice, everywhere. This means accurate transcription independent of the language you speak, your accent or dialect, and your surroundings. Transcription software is robust to background noise and can understand you whether you are in a noisy urban environment or a home full of laughter and play. Driven by this mission, at Speechmatics, we've been investing in self-supervised learning to promote the accessibility of our product.

Self-supervised learning is a machine learning algorithm that allows us to train speech recognition models using unlabeled audio data. The model learns by performing tasks that exploit the patterns and structure in the audio itself without transcript labels. For example, the model may try to predict the next part of an audio clip based on what it has heard so far. Or it may be asked to identify which parts of the audio clip have been masked or swapped. By doing these self-supervised tasks, the model learns rich representations of speech. This foundation model can then be fine-tuned on a limited amount of labeled data to efficiently learn how to map acoustic features to written text, just as children learn to speak. In order to improve model performance and analyze the data in a more meaningful way, data enrichment can be applied after annotation.

One of the advantages of self-supervised learning is that it reduces labeled data dependency. Traditional supervised speech models require thousands of hours of transcribed audio data to train a good-quality speech recognition model. The problem is that labeled data is often limited and lacks diversity, and thus favors high-resource languages and demographics. In contrast, unlabeled speech data is abundant and covers a broader diversity of voices and languages. Self-supervised models remove this barrier by pre-training on raw unlabeled speech to learn representations.

Most importantly, self-supervised learning is language-agnostic. By leveraging a wide variety of unlabeled data, these models learn shared representations of speech across languages or universal properties of speech. This results in a 70% reduction in transcription errors across languages. compared to traditional speech recognition systems that don't rely on unlabeled data offering a speech recognition product that understands a diverse global population.

Inclusivity with Speechmatics

Speechmatics' Ursa generation models released in March 2023 were driven by inclusivity. Ursa is the most accurate transcription engine across age groups and is 25% more accurate than Google in the age group of 60 to 81. It is this age group that struggles more than other age groups.

Ursa consistently exhibits a 32% higher level of accuracy than Google across various skin tones. Using a dataset that has representation is what other speech-to-text systems are lacking.

Ursa is consistently 32% more accurate than Google across skin tones (results based on Casual Conversation dataset; based on the Fitzpatrick scale spanning from 1 to 6, higher the number darker the skin tone).

When it comes to socio-economic backgrounds, Ursa leads by 30% over Google for individuals from a lower socio-economic background. In terms of education, Ursa leads 28% of those with less formal education.

Learn more about how we achieved accessibility through accuracy that came with Ursa.

We at Speechmatics feel encouraged by our results to continue to strive for a world where everyone feels heard. We are continuously testing our data and persevering through challenges to ensure this mission is at the forefront.

For us, this isn't the end; this is just a check-in. We hope this World Inclusion Day, you’ll be doing the same.

Oct 10, 2023 | Read time 6 min

World Inclusion Day 2023 🌎 The Promise for Speech Recognition

Inclusivity Is More Important Than Ever

ASR Inclusivity means Everyone's Understood

Inclusivity with Speechmatics

Related Articles

Achieving Accessibility Through Incredible Accuracy with Ursa

Introducing Real-Time Translation: Breaking Down Language Barriers

Boosting sample efficiency through Self-Supervised Learning