Speech technology already plays a huge part in our everyday lives, from common applications on our phones and computers, to unseen uses in customer services and advertising. As artificial intelligence continues to make huge leaps in everything from structuring materials to content creation, it won’t be long before voice technology plays an even bigger role in our lives, including in areas like legal and healthcare. And while it might not be a matter of life and death if Alexa plays you the wrong song, in a health setting, it very well could be.
Speech Technology Today
Recent advancements in speech technology have been hugely impressive. AI-led speech-to-text is now the only real choice for transcription at scale, with human-led transcription being prohibitively expensive and time-consuming. But are we really at the stage where we can put our hands up and say, “The problem is solved”?
Speech recognition is an incredible technology that we’ll rely on more and more in the future. Yet, it can also be a barrier. As a non-native speaker, it’s a barrier I’m all too aware of. I’m originally from Italy but have been living in the UK for several years. A few years ago, my partner and I bought our first smart speaker. Excitedly, we started to interact with it in Italian, our native language. We usually speak Italian at home. We quickly switched to English. We didn’t switch because we were more comfortable with it, we switched because it didn’t work for us in Italian. It didn’t work great for us in English, either, with our accents. But it was the better of the two options.
For the Few, Not the Many?
In the past few years, research has shown that language, accent, race, gender and age are the main factors that influence the accuracy of speech recognition. Researchers at Stanford have found that speech-to-text systematically misunderstands Black speakers twice as often as White speakers. Another study reported robust differences in accuracy across both gender and dialect, with lower accuracy for women and speakers from Scotland.
It’s worth noting at this stage, that results for accuracy are complicated. After all, our voices are extremely rich and unique, no one is like any other. But any sort of barrier, any digital divide with unequal access to digital technologies, deserves dissection. As Halcyon Lawrence, an assistant professor of technical communication and information design at Towson University told Claudia Lopez Lloreda in a piece for Scientific America: “I don’t get to negotiate with these devices unless I adapt my identity”.
This is simply not inclusive. Why should some people have to adapt their own voices and others not? Why should some get inferior results and others not?
A Deprivation of Data
It’s an issue that reaches beyond the speech recognition world. English (and a handful of other languages) are generally the focus of today’s language technologies. Despite there being over 6,500 languages in the world today, only a handful are systematically represented in academia and industry. The issue is that the near-human results on language translation and understanding usually only apply to a few languages. The vast majority of languages fall far below such standards.
Modern deep learning systems are data-hungry, they rely on enormous amounts of data for accuracy. This is problematic for languages for which a limited amount of data is available. Without the data to drive efficiency, some languages will continue to improve while others won’t. The bridge to inequality will grow.
Exclusive vs Inclusive
There’s a vast difference between speech-to-text working for some people, most of the time, and for all people, all of the time. At Speechmatics, we’re battling hard to make the latter a reality. We strongly believe speech technology must help us interact with the digital world fairly. Until it works for everyone all the time, true fairness is a target not an accomplishment.
We currently support 50 languages, covering over half of the world’s population with leading, consistent accuracy, that’s not dependent on language. But we’re not stopping here. As we continue to move forward, expand our coverage, and improve our technology, we’ll keep pushing the limits of what inclusivity means for commercially-ready speech recognition.
Benedetta Cevoli, Data Science Engineer, Speechmatics