Blog - Technical
Jan 12, 2022 | Read time 4 min

What Makes Up Your Voice: Understanding the Best Speech-to-Text

Read about how important it is that speech recognition understands every voice in every situation and how Speechmatics' is ensuring more voices are better represented.
Benedetta CevoliSenior Machine Learning Engineer

What Makes Up Your Voice: Understanding the Best Speech-to-Text

What gives your voice its individuality is as complex and varied as everything that makes you the person you are. The best transcription tools understand that the make-up of your voice will be influenced by everything from your gender at birth, to your state of emotion, to the levels of pollution in your area. It may be guided by your parents, your siblings, your friends, and your education. But there is no single contributing factor that makes your voice yours.

Opinions are formed about us when we speak. We can be judged on how we sound and there are times when how we speak can either be misconstrued or completely ignored. These opinions often drive behavior and decision-making. For example, in a recent article from American Scientist, researchers discovered that a political candidate’s pitch – a simple element that makes up our voice – can have a major influence on how voters perceive them.

The different factors that make up a voice can also have exclusionary consequences. When it comes to being able to transcribe speech to text, we believe this shouldn’t be the case. When the technology is at its optimum, every voice should be treated the same. Every voice should be transcribed as equally and as accurately as possible.

Inaccuracy Means Exclusion

At Speechmatics, we specialize in automatic voice-to-text transcription. We turn what someone has said into the written word for assistance, reference, and analysis. If our results are inaccurate, it means someone isn’t being heard and we’re no closer in our mission to understand every voice. We constantly have to consider all the factors that make a voice a voice and make sure these don’t negatively influence our technology and lead to inaccurate results.

The primary differential for what makes up your voice is the size of your vocal cords. The majority of males have larger vocal cords than females leading to the majority of men having deeper voices. The same is, of course, true with adults and children. When it comes to voice recognition the latter is still not represented as well as the former, mostly because voice recognition models have been trained primarily on adult voices.

Emotion and Voice Recognition

Our emotions play a huge role in how we’re heard too. The way we speak is greatly influenced by what we feel in each moment. Our voices can be quite different when sad, happy, worried, or excited for example. The best speech-to-text technology must accurately transcribe every voice no matter its emotional charge. There’s also an undiscovered world to the crossover of emotion and speech recognition, with use cases from Health to Finance and Customer Service ready to benefit from future technologies that recognize if a caller is feeling nervous, excited, or anxious.

Then there are those in society with completely unique voice patterns, such as people with Down syndrome. When they use voice recognition designed primarily for able-bodied speakers, they’re often let down by technology which should be beneficial for everyone. Ventures such as Project Understood completely understand that without enough data to train on, these voices will be given poor results from voice recognition.

The same obstacles to receiving accurate results can be found in those who have suffered strokes, have received injuries to their vocal cords and for those who suffer from dementia. The more speech recognition engines which use self-supervised learning and unlabeled data – as Speechmatics does – the quicker we can get to systems that work for everyone. Before our machine learning experts unlocked self-supervised learning, we were training on around 30,000 hours of audio, now it’s over 1,000,000 hours.

Better Representation in Voice

In a world where digital assistants are everywhere, it’s particularly important that speech recognition understands every voice in every situation. The recent work we’ve done at Speechmatics has been exceptional at making sure a variety of voices are better represented. One example of this is the incredible performance of our recently launched Autonomous Speech Recognition on children voices. Speechmatics shows the best speech-to-text accuracy in adults as well as children. And we’ve seen the smallest gap in accuracy between younger and older voices when compared to competitors.

But, for us, this is just the beginning. Every day we’re setting our sights on understanding every voice, in every situation.

Benedetta Cevoli, Data Science Engineer, Speechmatics