Second, we must ask why word error rates (WER) can go so low in-distribution but fail to generalize. Driving LibriSpeech’s testset so low has clearly been useful to propel the research community forward but why is the level of transfer poor to other out-of-distribution testsets?
We speculate it’s a similar phenomenon to the vision community’s experience with Imagenet: it’s eminently possible to overfit to spurious correlations in a given dataset, quirks which are correlated with correct classification, but which don’t hold when the data distribution is shifted. For speech, this could be the neural network building a deep reliance on certain frequencies appearing in the input when a certain phone is spoken, but those frequencies having no representation under distributional shift.
For a single percentage point decrease in WER, perfectly robust systems will see a corresponding single percentage point decrease across all out-of-distribution testsets. Figure 2 shows that the Whisper system makes great strides in this direction by leveraging large quantities of labeled data, which is broadly distributed, but a gap still remains.
Key takeaways:
We must train and evaluate broadly and out-of-distribution if we truly want a handle on the human-level accuracy question
It’s far too easy to overfit in-distribution; robust systems are non-trivial to create, and we need a mechanism to enable out-of-distribution generalisation
Supervised learning is amazing at scale
By keeping the architectural setup very simple Whisper can tell us lots about the impact of dataset scaling. The architecture is a vanilla encoder-decoder transformer setup to perform a simple token prediction task. Although this presents challenges in forming a real-time system, adding custom vocabulary and performing reliable long-form recognition it’s a perfect sandbox for analysing the impact of training on internet-scale datasets.
Their results show that models can be scaled to around the billion-parameter mark and continue to exploit weakly labeled datasets. This is perhaps the key idea in the whole paper: it’s possible to both obtain and exploit 680,000 hours of mostly correct English transcriptions. At these data and model sizes, architectural and loss innovations become secondary and wash out.
By following this simple recipe, the Whisper model approaches human-level performance on the Kincaid46 dataset which is a great achievement.
Key takeaways:
It pays to collect the largest weakly labeled dataset possible, with the bar for inclusion set high
Heuristics to auto-clean or reject data are as important as architectural or training-loss innovations, particularly if your largest models are data-bottlenecked
Internet-scale supervised learning is plateauing for English ASR
At Speechmatics, we believe that scale is a key ingredient for almost all the AI systems of the future. For both supervised and unsupervised systems, scaling labeled data along with parameter count in a transformer model is now a proven recipe. However, what happens when the labeled data runs out? What’s the play when we need to collect or pay for exponentially more to realise the next step-change in accuracy? This is the challenge in ASR today.
The Whisper setup tells us what happens when model and data scaling saturate. If the authors are correct, the Whisper model has converged to the implicit error level in the training data. To hit the next step change we'd then need an order of magnitude more of perfect-level (not human-level) transcribed audio. That would be ~10M hrs of perfectly labeled data which is out of reach even for companies with the deepest pockets.
As the authors highlight: “Performance improves rapidly on English speech recognition from 3,000 to 13,000 hours and then slows down noticeably between 13,000 and 54,000 hours. Using the full dataset, which corresponds to another 12.5× increase in size results in only a further 1 point drop in WER. This mirrors the diminishing returns observed with model size scaling for English speech recognition and could similarly be explained by saturation effects when approaching human-level performance.”
It's therefore clear that supervised learning is powerful, scales well and we need it, but it can plateau even on internet-scale datasets when the number of edge cases is large and out-of-distribution generalisation is tough. And as is the case here, you can be left with a far-from-perfect model.
Here at Speechmatics we are investing in a longer game which we believe is both the faster and economically feasible route to perfect-level ASR. In particular, self-supervised learning allows us to reach a similar word error rate with 100x less labeled data. This kind of data-efficient learning catalyses ASR training but also gives us a route to combat the saturation and exponential data requirements in the classic supervised regime. On this view, labeled data is necessary but not sufficient for training the next-generation ASR systems.
Most strikingly, young children learn to understand speech after only several years of play and with minimal supervision, whereas the supervised ASR systems of today require a lifetime of constant second-by-second supervision. As such we believe a key component for both AGI and the perfect-level ASR systems of the future will involve data-efficient representation learning. Moreover, we’ll know we are doing well when we find we need less and less labeled data because we’ll have learnt all the important things about speech from pretraining. Crucially, this approach aligns with the core of what we believe it means for a system to be increasingly intelligent: we observe larger amounts of generalization from smaller amounts of supervision.
Key takeaways:
Supervised learning scales but is prone to brittleness in out-of-distribution scenarios and can plateau before your problem is solved
Data-efficient pre-training such as self-supervised learning offers an exciting alternative trajectory to combat saturation effects seen in models trained on internet-scale ASR datasets
The Whisper results indicate that internet-scale supervised learning has plateaued for English ASR
Summary
This blog was written to provide some insight into OpenAI Whisper’s approach to speech-to-text, and its implications for speech research.
However, deploying ASR systems in production is hard. Good WERs are not enough – for a system to be useful, there is an array of extra factors to be considered. In our next post, we'll look at some of these factors and dig into how our latest systems are scaling and performing.
Will Williams, VP Machine Learning and Lawrence Atkins, Machine Learning Engineer
[1]: Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
[2]: Kaplan, Jared, et al. "Scaling laws for neural language models." arXiv preprint arXiv:2001.08361 (2020).
[3]: Xiong, Wayne, et al. "Achieving human parity in conversational speech recognition." arXiv preprint arXiv:1610.05256 (2016).
[4]: Radford, Alec, et al. Robust Speech Recognition via Large-Scale Weak Supervision. Technical report, OpenAI, 2022. URL https://cdn. openai. com/papers/whisper. pdf, 2022.