When speech-to-text works perfectly there’s a seamlessness to it. What’s spoken by a human is perfectly transcribed by a machine. Word for word. Today, we’re going to discuss some of the key areas where speech-to-text can be improved, starting from a broad global perspective and how to treat different languages, down to how we manage our time and resources to optimize our output.
Finding the Differences Between Languages
Within speech-to-text there’s generally a bias towards doing language-related things with an “English-first” mindset. As a UK-based company developing a product in which most of the research is conducted in English, and English is the language in highest demand for our customers, it makes sense that we’re biased towards English rather than, say, Sentinelese.
When we build our language models for English (and almost all of our other languages), we train a model to predict words, which in text appear as sequences of letters separated by spaces. These words are of a great size to be the tokens used to train a model to predict sequences of words. However, not all languages have their words come in similar packages.
Taking English and Turkish as our examples, we see major differences between the two. English is an analytic language (which comprises of few linguistic units per word), whereas Turkish is a very synthetic language (which comprises of many linguistic units per word). Compare the Turkish “Avrupalılaştıramadıklarımızdan mısınız?” to its English translation “are you one of these whom we failed to make Europeans?” To avoid Out Of Vocabulary (OOV) issues in Turkish, we can use subword language models, whereby the language model is trained on bits of word, bits that will pop up far more often than the long words they comprise of.
These major differences mean we must keep in mind what assumptions we’re making about language. Just because an approach is working well, that doesn’t mean there aren’t better methods out there for other languages.
Tackling Differences Within Language
Languages are constantly evolving, whether in terms of grammar, pronunciation or lexicon. If a language were fixed, we could reasonably expect to have full coverage of a language eventually; however, even in a short span of time, words enter the common consciousness of the speakers of a language.
Words such as ‘stonks’, ‘covid’, ‘Brexit’ and ‘yeet’ are just a few of the new examples from the last 8 years. If we’re not keeping our fingers on the pulse of the diachronic variation of language (how it varies across time), the gap between our language models and the languages we are trying to model will grow. Read Benedetta Cevoli’s blog post on Continuous Content, which covers the work we’ve been doing to stay on the culture curve.
Navigating Differences in Speech
We know humans can be biased by dialectal variations. Machines are too, but for different and not unrelated reasons. A machine won’t have ingrained sociocultural prejudices against people from certain cities or regions, but people with non-standard accents will tend to have less data representing how they talk. This paucity of data compared to the standard variety will lead to poorer recognition by a speech-to-text model.
A further consideration from a variation in speech data, is that it’s not necessarily all about what the speaker says, as the audio data will also be affected by the sampling rate of the audio file, the presence of background noise, if there are multiple speakers at once, and so on. In fact, once you have enough high-fidelity audio to get a solid baseline, it’s this ‘messier’ noisier data that will really help elevate how your speech-to-text engine will work in the real world.
Representation is key here. By presenting a model with a more varied range of dialects, we’ll have a far better chance at faithfully transcribing all speakers of a language.
The Importance of Context
Lexical ambiguity can be a problem for ASR, especially for our Entity Formatting, or inverse text normalisation (ITN). If I said, “I lost twenty pounds”, this could refer to me shedding some weight after some solid consistent effort in the gym (target: “I lost 20 lbs”) or if I misplaced a note from my wallet (target: “I lost £20”). In isolation, a human couldn’t tell which, but add in some context (“I’ve been working really hard with my personal trainer...” or “My wallet is feeling emptier than usual...”) and it becomes obvious to a human what is meant.
Current ITN approaches use a deterministic method, whereby for a particular string of text input, you get one fixed output. But in this case, there are two reasonably probable outputs for ITN. To solve this, we would need to be able to take context into account. On this front, we've been making great improvements to our systems thanks to neural networks. We’re excited to share this work soon.
The Danger of Data Bloat
Generally, more is considered more, but every now and again, it would appear that less is more (sorry Yngwie). Imagine we have 10,000,000,000 (ten billion) hours of data for training a model. Great, it’s probably going to learn absolutely loads and will smash WER scores out of the park! But training a model on 10,000,000,000 hours is going to take a vast amount of time and require a large amount of resources. If training on these 10,000,000,000 hours only gave you say a 1% relative improvement over using 10,000,000 (ten million) hours, then that 1% relative gain is probably not worth the cost of employees’ time and the energy and money taken to train the model on 1000 times the amount of data.
But how do we decide which 9,990,000000 hours to axe from our training data? It’s a difficult question. First, we need to consider what bloat in our data is; then we need to find it. Ultimately, we want to get rid of data that is so similar that we aren’t seeing significant returns from having more of it available, whilst also making sure our data is maximally diverse.
Conclusion
The key to improving speech-to-text is being open-minded when thinking about languages the world over. The importance of us having to be proactive in ascertaining that our data is sufficiently diverse is key for us to really push these boundaries, especially if we want to deliver on our mission to Understand Every Voice.
George Lodge, Computational Linguist at Speechmatics