Blog - Technical
Jul 8, 2023 | Read time 4 min

YouTube’s Captions Represent the Direct Need for Speech-to-Text Innovation

YouTube’s automated captioning service is notoriously unreliable and represents the dire need for innovation within the speech-to-text industry. Find out what we’re doing about it.
Benedetta CevoliSenior Machine Learning Engineer

The Problem with YouTube's Captions

When you think of speech-to-text, you likely think of captioning. It's the most apparent use of speech-to-text most people will recognize, and with good reason. Research has shown huge benefits of video captions, beyond hearing aid. Captions play a key role in social media engagement and are very quickly becoming a must-have for any content creator.

YouTube is one of the most popular sites on the internet, ranking only behind Google (which owns YouTube anyway). The service provides auto-captioning – AI that translates speech-to-text as quickly and as accurately as possible. Accuracy, however, isn't a guarantee. 

And because of that, some people decide to disable auto-captioning completely and use their own captions to make sure that their content is accurately transcribed. This is particularly true for official channels or large channels with big audiences. These channels often have the budget to take care of their own captioning, so you may see differing results on YouTube. For the most part, however, YouTube's auto-captioning is notoriously unreliable. 

It demonstrates the AI industry's constant need for innovation. According to 3PlayMedia, 80% of viewers use captions for reasons other than hearing loss, highlighting how captioning has grown beyond the need for greater accessibility. 

Captions are a necessity – it's time we treat them that way. 

YouTube Is More Than Entertainment Now

Since its inception in 2005, YouTube's grown exponentially, amassing two billion users in 2022. It's clear, then, that YouTube has evolved beyond 'Charlie bit my finger.' Now, it's where you can learn anything and everything, in an easy-to-consume, digestible way. Educational and potentially life-saving videos aren't in every language, so millions of users will rely on the captioning service. 

According to a study, YouTube's automated captions are 60-70% accurate – equivalent to 1 in 3 incorrect words. Of course, the accuracy rate greatly depends on audio quality, but the clear need for accurate captioning means that the AI must be able to cope with any background noise, accents, or jargon. 

Of course, YouTube is a great platform for many reasons, but good quality and accessible captions are a must. You'll likely still understand most of the text, but the margin for error is still too large in this day and age. 

Addressing the Problem

At Speechmatics, we know the importance of the accuracy of our speech-to-text engine. Here, we're using YouTube's automated captioning as an example of the dire need for innovation across the speech-to-text board. Good automatic speech recognition (ASR) allows users to save time and money – resources they can use to create enjoyable and helpful content. In that endeavor, we compared our ASR to our significant competitors using 24 YouTube videos, ranging from 'Every Outfit Winnie Harlow Wears in a Week: 7 Days, 7 Looks' to 'Diving World Cup 2021: Men's 10m Final.' We found that our ASR displayed levels of accuracy above 90% for content with multiple speakers and accents, background noise, and challenging vernacular. 

This is only possible due to the introduction of self-supervised learning. Before using it, we trained our ASR on approximately 30,000 hours of labeled audio content. This type of data is very costly and comes with big accessibility issues, some voices are just left out. Since then, we've taken that number closer to 1,100,000 hours as we improve our engine by using a wealth of unlabeled data. Self-supervised learning is helping us to bridge the gap between well-curated, labeled speech representing only a selection of speakers and varied, everyday speech that covers a breadth of voice cohorts.

It's pretty straightforward – you get better results when you put accuracy first. 

Captioning's Prominence Shows No Signs of Slowing Down

We will continually train our ASR with as many voices as possible. We will also continue to ensure we carry out our aim to understand every voice and create genuinely accessible ASR. 

It's a good thing, too, as captioning spreads across the internet. For example, TechTimes reported that Twitter is working on implementing closed captions to the site. The AI is thought to be coming from the company itself, so we can't comment on its potential accuracy. We do know one thing, however: the shift in perspective on captioning is a welcome one. Moving from a mere add-on to a necessity means the market will keep trying to innovate to produce the most accurate and accessible engine. 

For us, that has always been the aim of the game. When you prioritize accuracy, you spend less time acknowledging the ASR's mistakes and more time absorbing the content. 

Isn't that what every streaming service or content producer out there wants?

Benedetta Cevoli, Data Science Engineer, Speechmatics