Jul 8, 2023 | Read time 4 min

YouTube’s Captions Represent the Direct Need for Speech-to-Text Innovation

YouTube’s automated captioning service is notoriously unreliable and represents the dire need for innovation within the speech-to-text industry. Find out what we’re doing about it.
YouTube’s Captions Represent the Dire Need for Speech-to-Text Innovation
Benedetta Cevoli
Benedetta CevoliSenior Machine Learning Engineer

The Problem with YouTube's Captions

When you think of speech-to-text, you likely think of captioning. It's the most apparent use of speech-to-text most people will recognize, and with good reason. Research has shown huge benefits of video captions, beyond hearing aid. Captions play a key role in social media engagement and are very quickly becoming a must-have for any content creator.

YouTube is one of the most popular sites on the internet, ranking only behind Google (which owns YouTube anyway). The service provides auto-captioning – AI that translates speech-to-text as quickly and as accurately as possible. Accuracy, however, isn't a guarantee. 

And because of that, some people decide to disable auto-captioning completely and use their own captions to make sure that their content is accurately transcribed. This is particularly true for official channels or large channels with big audiences. These channels often have the budget to take care of their own captioning, so you may see differing results on YouTube. For the most part, however, YouTube's auto-captioning is notoriously unreliable. 

It demonstrates the AI industry's constant need for innovation. According to 3PlayMedia, 80% of viewers use captions for reasons other than hearing loss, highlighting how captioning has grown beyond the need for greater accessibility. 

Captions are a necessity – it's time we treat them that way. 

YouTube Is More Than Entertainment Now

Since its inception in 2005, YouTube's grown exponentially, amassing two billion users in 2022. It's clear, then, that YouTube has evolved beyond 'Charlie bit my finger.' Now, it's where you can learn anything and everything, in an easy-to-consume, digestible way. Educational and potentially life-saving videos aren't in every language, so millions of users will rely on the captioning service. 

According to a study, YouTube's automated captions are 60-70% accurate – equivalent to 1 in 3 incorrect words. Of course, the accuracy rate greatly depends on audio quality, but the clear need for accurate captioning means that the AI must be able to cope with any background noise, accents, or jargon. 

Of course, YouTube is a great platform for many reasons, but good quality and accessible captions are a must. You'll likely still understand most of the text, but the margin for error is still too large in this day and age. 

Addressing the Problem

At Speechmatics, we know the importance of the accuracy of our speech-to-text engine. Here, we're using YouTube's automated captioning as an example of the dire need for innovation across the speech-to-text board. Good automatic speech recognition (ASR) allows users to save time and money – resources they can use to create enjoyable and helpful content. In that endeavor, we compared our ASR to our significant competitors using 24 YouTube videos, ranging from 'Every Outfit Winnie Harlow Wears in a Week: 7 Days, 7 Looks' to 'Diving World Cup 2021: Men's 10m Final.' We found that our ASR displayed levels of accuracy above 90% for content with multiple speakers and accents, background noise, and challenging vernacular. 

This is only possible due to the introduction of self-supervised learning. Before using it, we trained our ASR on approximately 30,000 hours of labeled audio content. This type of data is very costly and comes with big accessibility issues, some voices are just left out. Since then, we've taken that number closer to 1,100,000 hours as we improve our engine by using a wealth of unlabeled data. Self-supervised learning is helping us to bridge the gap between well-curated, labeled speech representing only a selection of speakers and varied, everyday speech that covers a breadth of voice cohorts.

It's pretty straightforward – you get better results when you put accuracy first. 

Captioning's Prominence Shows No Signs of Slowing Down

We will continually train our ASR with as many voices as possible. We will also continue to ensure we carry out our aim to understand every voice and create genuinely accessible ASR. 

It's a good thing, too, as captioning spreads across the internet. For example, TechTimes reported that Twitter is working on implementing closed captions to the site. The AI is thought to be coming from the company itself, so we can't comment on its potential accuracy. We do know one thing, however: the shift in perspective on captioning is a welcome one. Moving from a mere add-on to a necessity means the market will keep trying to innovate to produce the most accurate and accessible engine. 

For us, that has always been the aim of the game. When you prioritize accuracy, you spend less time acknowledging the ASR's mistakes and more time absorbing the content. 

Isn't that what every streaming service or content producer out there wants?

Benedetta Cevoli, Data Science Engineer, Speechmatics

Latest Articles

[alt: Bilingual medical model featuring terms related to various health conditions and medications in Arabic and English. Key terms include "Chronic kidney disease," "Heart attack," "Diabetes," and "Insulin," among others, displayed in an organized layout.]
Product

Speechmatics achieves a world first in bilingual Voice AI with new Arabic–English model

Sets a new accuracy bar for real-world code-switching: 35% fewer errors than the closest competitor.

Speechmatics
SpeechmaticsEditorial Team
[alt: Illuminated ancient mud-brick structures stand against a dusk sky, showcasing architectural details and textures. Palm trees are in the foreground, adding to the setting's ambiance. Visually captures a historic site in twilight.]
Product

Your voice agent speaks perfect Arabic. That's the problem.

Most voice AI models are trained on formal Arabic, but real conversations across the Middle East mix dialects and English in ways those systems aren’t built to handle.

Yahia Abaza
Yahia AbazaSenior Product Manger
new blog image header
Technical

How Nvidia Dominates the HuggingFace Leaderboards in This Key Metric

A technical deep-dive into Token Duration Transducers (TDT) — the frame-skipping architecture behind Nvidia's Parakeet models. Covers inference mechanics, training with forward-backward algorithm, and how TDT achieves up to 2.82x faster decoding than standard RNN-T.

Oliver Parish
Oliver Parish Machine Learning Engineer
[alt: Healthcare professionals in scrubs and lab coats walk briskly down a hospital corridor. A nurse uses a tablet while others carry patient charts and attend to a gurney. The setting conveys a busy, clinical environment focused on patient care.]
Use Cases

Why AI-native EHR platforms will treat speech as core infrastructure in 2026

As clinical workflows become automated and AI-driven, real-time speech is shifting from a transcription feature to the foundational intelligence layer inside modern EHR systems.

Vamsi Edara
Vamsi EdaraFounder and CEO, Edvak EHR
[alt: Logos of Speechmatics and Edvak are displayed side by side, interconnected by a stylized x symbol. The background features soft, wavy lines in light blue, creating a modern and tech-focused aesthetic.]
Company

One word changes everything: Speechmatics and Edvak EHR partner to make voice AI safe for clinical automation at scale

Turning real-time clinical speech into trusted, EHR-native automation.

Speechmatics
SpeechmaticsEditorial Team
[alt: Concentric circles radiate outward from a central orange icon with a white Speechmatics logo. The background is dark blue, enhancing the orange glow. A thin green line runs horizontally across the lower part of the image.]
Technical

Speed you can trust: The STT metrics that matter for voice agents

What “fast” actually means for voice agents — and why Pipecat’s TTFS + semantic accuracy is the clearest benchmark we’ve seen.

Archie McMullan
Archie McMullanSpeechmatics Graduate