Jan 19, 2022 | Read time 4 min

Understanding Children’s Voices: How Voice-to-Text Assists eLearning

Read about how the pandemic has had a profound effect on education, how voice-to-text technology misunderstands young voices, and if we can rely on it to help educate the next generation.
Understanding-Children’s-Voices How-Voice-to-Text-Assists-eLearning
Benedetta Cevoli
Benedetta CevoliSenior Machine Learning Engineer

Understanding Children’s Voices: How Voice-to-Text Assists eLearning

The COVID-19 pandemic has had an enormous impact on the education of children the world over. With school closures and teacher absenteeism, the once-regimented structure of education was disrupted like never before. As we start to piece back together the process of learning, could speech-to-text recognition not only help prevent children slipping further behind, but also help them catch up to pre-pandemic levels?

The acceleration of eLearning tools due to the pandemic has been remarkable. China, for example, sent 250 million children home for online classes. However, deep learning was still necessary. The World Economic Forum reports how Zhejiang University uploaded more than 5000 courses online a mere two weeks after the transition began, using a system called DingTalk ZJU.

This home learning was replicated all over the world, with at least 210 countries and billions of students affected by the pandemic and stay-at-home orders. This forced teachers and students alike to become familiar with scanning documents, hosting online sessions, and other unfamiliar technologies – transcription software included.

Seen But Not Heard

At around the age of 2, far too often we bracket children into “those who speak” and “those who don’t”. We take for granted that children are continually learning to speak way beyond those early years. While adults have heard most everyday words time and time again – and used them just as much – most children discover new ones practically every day. When they repeat them, they do so in different ways, getting used to the sounds these words make. This process of trial and error is one of many reasons why current voice recognition isn’t as accurate with children as it is adults.

But children’s voice differs from adults in a variety of ways too. It’s not just the obvious difference in pitch, but the patterns themselves, which often trip up voice recognition. As reported by TechCrunch, children can hit different parts of words to adults, they can over-enunciate, punctuate differently and carry fewer common cadences. All of this, historically, leads to children’s voices being disproportionately failed by a technology that focused primarily on adults.

Engaging Young Voices

Arguments have been made for some time now, that subtitles can play a huge role in helping develop literacy skills. After all, the more children get to see words in action, the easier it is for them to understand and repeat them. It stands to reason then, that the benefits of live captioning in classrooms would have similar effects. But when young voices are still often misunderstood by speech-to-text technology, can we really rely on it to help teach our children? The answer is obvious, make the technology more accurate.

And that’s exactly what we’ve done at Speechmatics. We’ve seen our accuracy hugely improve and the gaps plummet when it comes to adults and children, thanks in large part to an introduction of self-supervised learning (SSL) into our training.

Before SSL, there was a frustratingly large unavailability of data to train on. Especially when it came to young voices. In fact, we were left to train on 30,000 hours of audio, and often, this was mostly adult speech. Since SSL, we can now train on 1.1 million hours of audio, exponentially increasing the number of voices from children.

Now that the majority of children have returned to the classroom, the technology adopted in the height of pandemic has become part of the everyday. When that “everyday” is a classroom full of noise, speech-to-text software has yet another obstacle in its path to accuracy: background noise. Again, here, Speechmatics has seen incredible success with our Autonomous Speech Recognition. Whether the background noise is pitch, reverb or volume, our results in our latest round of testing, show we’re far and above our competitors in terms of accuracy.

Benedetta Cevoli, Data Science Engineer, Speechmatics

Latest Articles

Carousel slide image
Use Cases

What Word Error Rate Is Acceptable for Legal Transcription?

Word error rate for legal transcription has no single acceptable threshold. But knowing how accuracy, audio quality, and review obligations connect to real legal risk is what separates a reliable transcript from a costly one.

Mieke Smith
Mieke SmithSenior Writer
Carousel slide image
Use Cases

The court reporter shortage crisis: data, causes, and what legal teams are doing about it

The court reporter shortage is reshaping litigation. Explore data, causes, and how legal teams are using digital reporting and AI transcription to adapt.

Tom Young
Tom YoungDigital Specialist
[alt: Bilingual medical model featuring terms related to various health conditions and medications in Arabic and English. Key terms include "Chronic kidney disease," "Heart attack," "Diabetes," and "Insulin," among others, displayed in an organized layout.]
Product

Speechmatics achieves a world first in bilingual Voice AI with new Arabic–English model

Sets a new accuracy bar for real-world code-switching: 35% fewer errors than the closest competitor.

Speechmatics
SpeechmaticsEditorial Team
[alt: Illuminated ancient mud-brick structures stand against a dusk sky, showcasing architectural details and textures. Palm trees are in the foreground, adding to the setting's ambiance. Visually captures a historic site in twilight.]
Product

Your voice agent speaks perfect Arabic. That's the problem.

Most voice AI models are trained on formal Arabic, but real conversations across the Middle East mix dialects and English in ways those systems aren’t built to handle.

Yahia Abaza
Yahia AbazaSenior Product Manger
new blog image header
Technical

How Nvidia Dominates the HuggingFace Leaderboards in This Key Metric

A technical deep-dive into Token Duration Transducers (TDT) — the frame-skipping architecture behind Nvidia's Parakeet models. Covers inference mechanics, training with forward-backward algorithm, and how TDT achieves up to 2.82x faster decoding than standard RNN-T.

Oliver Parish
Oliver Parish Machine Learning Engineer
[alt: Healthcare professionals in scrubs and lab coats walk briskly down a hospital corridor. A nurse uses a tablet while others carry patient charts and attend to a gurney. The setting conveys a busy, clinical environment focused on patient care.]
Use Cases

Why AI-native EHR platforms will treat speech as core infrastructure in 2026

As clinical workflows become automated and AI-driven, real-time speech is shifting from a transcription feature to the foundational intelligence layer inside modern EHR systems.

Vamsi Edara
Vamsi EdaraFounder and CEO, Edvak EHR