Exploring the transformative power of sample efficiency in the speech recognition industry is a hot topic. As technology continues to advance at a rapid pace, the quest for more accurate and efficient automatic speech recognition (ASR) systems has become increasingly important. In this article, we delve into the concept of sample efficiency, its implications, and how it is revolutionizing the way we approach speech recognition.
The increased interest of the wider public in artificial intelligence (AI) and machine learning (ML) has, in my opinion, been driven by progress in two subfields: supervised learning and reinforcement learning.
From the perspective of industry, it’s really the improvement in supervised learning that’s drawing attention. You can be fairly sure that when a product “uses AI” it’s going to involve supervised learning (if it uses any machine learning at all).
In this blog, I’m going to discuss some shortcomings of supervised learning and research addressing these shortcomings. Many have written similar articles and blogs before, so what I want to explore is the implications of this research on industry.
Supervised learning
Consider training a classifier to distinguish between pictures of cats and pictures of dogs. The classifier takes an input image and maps it to either the label "cat" or the label "dog". Supervised learning is all about learning mappings from input spaces to output space. For this example, the input space is the collection of original images and the output space contains the labels "cat" and "dog".
Training this mapping requires lots of labeled data. For the example above, you need not only lots of pictures of cats and dogs but also access to the corresponding labels. This labeling is normally done by hand. Someone (or more commonly many people) must painstakingly go through every image you have of cats and dogs and hand label the image with either "cat" or "dog". This is the current paradigm of supervised learning; training requires lots of labeled data.
The amount of labeled data required by an algorithm is called its sample efficiency. As supervised learning systems need lots of labeled data, they are very sample inefficient. This poor sample efficiency of supervised learning leads to the following issues:
It's slow: Hand labeling data takes a lot of time. At Speechmatics, we use thousands of hours of hand transcribed audio as training data. I try not to think about the collective time that took to type out...
It's expensive: No single person can hand label data at the scale needed to train strong supervised learning systems. The workaround is paying lots of people to do this labeling at the same time, which has associated costs.
Your labels can be wrong: If someone's job is hand labeling data for 16 hours a day then they are going to make mistakes. These mistakes can be disastrous when training; as the saying goes: "garbage in, garbage out".
There are more sample efficient algorithms: A human can recognize a new object after only seeing one or two examples. Whatever the human brain is doing, it is incredibly sample efficient. This is good news because it sets a very high lower bound on the sample efficiency algorithms could obtain.
Self-supervised learning
The reasons outlined above show the need for algorithms that are far more sample efficient than supervised learning. Self-supervised learning is a broad class of techniques that can achieve this sample efficiency.
With supervised learning, a model takes some data and predicts a label. With self-supervised learning, the model takes some data and predicts an attribute of the same data. This definition is quite cryptic and is best understood through example. Some examples of self-supervised tasks include:
Looking at the first half of a sentence and predicting the next word
Turning an image black and white and trying to predict the original colors
Randomly rotating an image and predicting what the rotation was
Watching 5 seconds of video and predicting the next frame
Although these examples are disparate, they all follow a common pattern: take some data, potentially augment the data, predict a property of the original data. Note that no labels are needed at any stage. The resulting representations can be used to train classifiers with much higher sample efficiency than standard supervised learning.
In the preceding section, I listed four shortcomings of supervised learning. By severely reducing the dependence on labeled data, self-supervised learning addresses the first three shortcomings that I listed.
It also moves us closer to the lower bound on potential sample efficiency mentioned in the fourth point. It is widely thought that most of the learning infants do is through self-supervision. In that sense, the current self-supervised methods being explored bring us slightly closer to recreating the algorithm used by the human neocortex.
It is worth stressing that self-supervised learning is still in its infancy and is just one set of techniques for improving sample efficiency. Some of the research we have been doing at Speechmatics investigates how meta-learning can improve sample efficiency. Below you can watch a video summarizing a paper we presented at the MetaLearn workshop at NeurIPS last year.
Although I have very low confidence on whether it will be done using self-supervision, meta-learning or some other techniques, I am confident that machine learning will only become more sample efficient as years pass.
Implications for industry
From my admittedly narrow experience of how machine learning is being used in wider industry, it appears that the "ML ecosystem" is developing to account for bad sample efficiency. For instance:
Many "data labeling" companies can label large amounts of data ever cheaper and faster, a valuable service when industry is dominated by supervised learning
Lots of services "clean" data, ensuring the labels for a dataset are both consistent and correct
Emerging marketplaces are selling datasets after they have been labeled and cleaned
Some are arguing that companies should even include their datasets on their balance sheets
All of these are valuable and logical phenomena in the current ML paradigm. However, I don't think it is unreasonable to expect this to significantly change if very sample efficient algorithms are developed.
Demand-side of industry
On the demand side, there will be less need to label large amounts of data as less labeled data is needed. "Cleaning" a dataset will also look very different. You don't need labels to be consistent and correct if there are no labels in the first place.
Supply-side of industry
On the supply side, there is far, far more unlabeled data available for free (think the entire Internet) than there is labeled data altogether. There is less of a business case for selling labeled data when there is an abundance of unlabeled data available for free that algorithms can make use of.
It's likely that the most dramatic changes will be due to second-order effects which I am absolutely not going to speculate about.
That all said, what is far more important than any of these individual effects is realizing that machine learning algorithms are going to change, hopefully significantly, over the next 10 years. The current shortcomings will be ameliorated, and new problems will arise. This will be a threat to some and an opportunity for others.
For businesses building tooling around ML, there should at least be an awareness that deep learning is only a paradigm of machine learning and a different paradigm may dominate in the future. When shooting a moving target, those who aim at the target will inevitably miss. The best marksmen aim at where the target will be in the future.