Speech recognition has been making a lot of noise in the last few years, both academically and commercially.
Let’s begin with commercial.
Many big companies have begun launching their own voice assistants, such as Apple Siri, Microsoft Cortana, Amazon Alexa and Google Assistant.
Speech is the biggest form of communication used by humans but due to its complexities.
It is one of the hardest challenges to overcome. Speech as an interface is made possible by ever-increasing accuracy and speed of speech recognition systems. Speech interfaces are especially important in places with low literacy rates like third world countries where speech is the only form of communication. Or for example, in China where typing in pinyin is not convenient or accessible for many people, especially the older generation.
There are now speech recognition systems available as cloud-based APIs which are making speech as an interface more accessible from the likes of IBM, Microsoft, Google and us.
From an academic perspective, there have been recent claims about achieving human parity from Microsoft and even better systems from IBM and text to speech systems are also making a lot of progress recently.
But what are the differences?
Datasets
In academia, to allow for fair comparison, the datasets are fixed both for training and testing. However, collecting data is quite expensive, so often the datasets that are used are quite old thus there is a selection bias for models that end up performing on these specific datasets.
The switchboard datasets used in Microsoft's and IBM's paper use about 300 hours of training data, whereas other academic datasets go to up to 2000 hours.
For a commercial use, the only limitations are the cost of collecting the data and training the algorithm. For example, Baidu uses 40,000 hours. Major vendors typically use thousands of hours of data for their commercial systems, but the data is not disclosed due to its importance. The test sets they use internally are also kept confidential as they are tailored to the use of their customers.
Companies do not publish the accuracy of their commercial system on an open test set for these reasons, so the only way to discover which system is the best on a particular application is to test them all.
Vocabulary size
In the switchboard testing task, Microsoft used a vocabulary of 30k, and IBM used 85k. Today, large test sets (LVCSR) often have a few hundred thousand words in the vocabulary as standard.
However large vocabularies are often restricted for specific use cases. For example, for an embedded speech recognition system, Google used 64k. And on a popular natural language processing dataset, Google used a vocabulary size of about 800k.
New words appear every year and commercial systems need to take them into account when drawing new results.
Types of errors
In papers, word error rates (WERs) are reported, however we can see in Microsoft's and IBM's papers that most of the mistakes are made on short functional words such as “and”, “is”, “the”, and these errors are given the same weighting as errors on keywords.
In a commercial setting, verbatim transcripts are not always necessary. For example, voice search and voice assistants, content extraction and indexing applications often ignore the short functional words when dealing with a transcript.
Languages
Although there are open datasets in many languages, academics tend to focus on English. Open datasets also don't have as much data as the English language. There are common issues found in other languages that are not found in English such as, tone, agglutination and different script (i.e. better wording or a non-Latin script).
A provider of speech recognition systems need to offer languages that cover the most common languages in the world. The top 10 languages in the world only cover about 46% of the world population with the top 100 languages still only covering about 85% of the world population.
Speed
Academics rarely report the real-time factor (RTF) – the time taken to transcribe the audio divided by duration of the audio – of their models, the best systems proposed in academic papers are a combination of multiple large models. This makes them unusable commercially because the compute cost is too high.
Users often expect a fast turnaround time which has made real time systems increasingly important and desirable especially when embedded on a device.
Robustness
Popular academic datasets tend to be too clean whereas audio in real life applications are often very noisy. On a noisier dataset, the WER is about 40%, it would be good if more real-world data was available to academics to make the results more commercially viable and applicable.
Additional functionalities
Real world systems need more than good ASR accuracy, they also need diarisation (who speaks when), punctuation and capitalisation, whereas it is something expected from transcriptionist, and makes transcripts much easier and nicer for commercial use.
Audio segmentation – knowing what kind of noise or music happened when.
So, what does this mean for the future of speech recognition systems?
There is a gap between academic and commercial systems, however academic research is important to show that improvement can be made in speech recognition applications. So, then the problem becomes how can it be more efficient?
The academic method relies on dividing the problem into parts that can be improved in isolation. This has resulted in good progress being made on the somewhat artificial tasks we have set ourselves. In contrast, commercial speech recognition needs to take a more holistic approach, it’s the performance of the overall system that matters. Companies are continuing to build both types of ASR technology helping to close the gap between academic and commercial systems.
Rémi Francis, Speechmatics