Blog - Technical
Sep 1, 2022 | Read time 4 min

Speech Recognition: The Journey to Continuously Updating Software

To ensure a continually accurate transcription service, our ASR must have systems to continuously update its software. Try our ASR for free today.
Benedetta CevoliSenior Machine Learning Engineer

As speech-to-text innovators, we are building systems and models that adapt to a broader range of voices to reduce the need for expensive, bias-creating, error-prone human intervention and labeled data. Speech recognition software cannot keep up with the times if it, like the languages it processes, does not evolve.

To put this to the test, we compared our award-winning Autonomous Speech Recognition (ASR) engine against 24 different YouTube videos, measuring how our ASR performed against the video-streaming site's automated captioning system. As discussed in our Continuous Content whitepaper, the results were conclusive. Across the 24 videos, Speechmatics' accuracy ratings topped the rankings, the only vendor to achieve 90% or higher.

The way we use subtitles has recently shifted from commodity to necessity. More content online means a broader range of accents and dialects – all of which should have accurate captions, especially for educational purposes.

To continue providing up-to-date, accurate software, speech-to-text engines must have adaptive capabilities. They need continuous updating to keep up with evolving content.

COVID-19 and the Spontaneity of Language

Towards the end of 2019, news outlets reported a new virus spreading worldwide more frequently. COVID-19 barged into all our conversations without so much as a hello, imprinting itself in our minds and dictionaries. Of course, the virus forced millions to work from home. This made accurate speech-to-text software an increasing necessity, as it improves the experience for e-learning, online doctor consultations, and remote contact centers while also making the increased number of online meetings more straightforward and accessible to employees.

To maintain accuracy, software, like our ASR, must continually update and adapt to whatever new word becomes everyday use. This is happening more and more – perhaps most notably with 'Brexit' before the UK voted to leave the European Union, etching itself into linguistic folklore.

The bottom line is this: speech-to-text engines that continuously update with the times offer vital accessibility to an increasingly plugged-in population. It means broadcasters can educate people on deadly viruses despite potentially not understanding the original language, saving lives in the process.

It All Starts with Education

As with any trying time, be it war or a pandemic, technology is forced to innovate to adapt to societal changes. Over the past few years, education has begun to rely on technology—speech-to-text in particular.

Children are increasingly more plugged in with easier access to technology. In lessons, speech-to-text turns a potentially tricky situation like COVID-19 into a more accessible one. With accurate transcription, teachers can allow their students to catch up at their own pace, giving less advantaged pupils an equal playing field – essential when working from home.

To keep students in the loop about their education, ASR must have the capabilities to update itself with unfamiliar words and features. Looking elsewhere, new terms such as Brexit and COVID-19 create new jobs. Those jobs need technology that knows what it's transcribing.

How Speechmatics' ASR Makes it Work

For starters, our ASR has a feature that allows you to add specific words to its library. That way, users can avoid needing to meticulously hand-pick any mistakes. Perhaps the most unique way our software allows us to update continuously is through self-supervised learning.

Essentially, this allows us to learn about speech from unlabeled data without human intervention – saving countless hours and preventing mistakes via human error. However, the greatest advantage to self-supervised learning is the breadth of voices exposed to our engine.

By predicting what sound comes next, self-supervised learning gathers rich representations of speech from a wide breadth of voices and brings far-reaching accuracy. Our ASR has gone from 30,000 hours of training to 1.1 million through this method. The more of this we can do, the more continuously updated our software will be.

There's More Online Content Than Ever Before

As the world's population heads towards the 8-billion-mark, technology like ours will become increasingly prevalent. Increased globalization means new words and phrases are more likely to spread like wildfire, negating borders and blurring the lines between languages. As a result, accurate, continuously updating software must be able to keep up.

The implications are potentially huge. Speech-to-text has a massive variety of use cases, all of which benefit from saving time and preventing mistakes while also proving pivotal to how people adapt to trying times.

Benedetta Cevoli, Data Science Engineer, Speechmatics