Sample Efficiency
Good sample efficiency is the ability of a model to reach a certain level of performance with fewer samples of labeled data; it is more efficient at using the labels it is given. This also means that with more labeled data, the model should achieve even better performance. Traditional speech features have shown to reach great performance in ASR[2][3]; however, it usually takes very large samples of labeled data to reach this performance because the input feature representations are low dimensional and limited in terms of the information they contain.
Labeled data is often scarce for ASR; it is expensive and difficult to label speech, and the labeled data that exists generally comes from a few domains and doesn’t capture the real variety of spoken language. This further limits the ability of supervised ASR methods to learn efficiently as they are constrained in terms of the variety of speech they are exposed to.
On the other hand, a model trained with SSL features is much more efficient at learning the task of ASR. Take this example; the SSL model has seen many examples of the sound /b/, these share certain similar acoustic properties. It learns to represent these segments of audio in a similar way, despite having no understanding that these portions correspond to the sound /b/. Then some labeled data comes along. The supervised model sees that one of the representations in that cluster, maps to a /b/ in the output space. Therefore, it is likely that the other features in the cluster are also a /b/. Thus, the model starts to learn a mapping, not just of one feature to /b/, but a whole cluster of varied features.
By training on unlabeled data, we are also able to capture many more domains in the training data meaning that the model will learn features of more diverse speech. Through pre-training, we prime the model with a huge amount of speech information so that the labeled data has a much greater impact. All of this contributes to an ASR system that is more data efficient and better at understanding every voice.
Results
With Ursa we scaled our self-supervised learning both in terms of data and model size. This has led to richer representations and greater sample efficiency, as the representations reflect more diversity within the input data.
Generally, deep learning benefits from more data and more training. In production, you will normally train on as much data as is available for as long as possible. However, not all languages have multiple thousands of hours of training data, and the point of sample efficiency is that we shouldn’t need tens of thousands of hours to get excellent performance. Scaling SSL should lead to better sample efficiency; therefore, we decided to test how accuracy changes as we reduce the amount of English ASR training data.
In these experiments, we used two different SSL models to generate the input features for our downstream supervised ASR training. One model uses 2B parameters, and the second model has 500M parameters and was pre-trained on less data. We hypothesized that we should see greater sample efficiency from the larger model. We used no data selection methods for ASR training, so the ordering of the labeled data was completely random. The models were tested on a variety of publicly available test sets (explained in the Ursa blog) and we report weighted word error rates in Figure 2.