Blog - Technical
Jan 19, 2022 | Read time 4 min

Understanding Children’s Voices: How Voice-to-Text Assists eLearning

Read about how the pandemic has had a profound effect on education, how voice-to-text technology misunderstands young voices, and if we can rely on it to help educate the next generation.
Benedetta CevoliSenior Machine Learning Engineer

Understanding Children’s Voices: How Voice-to-Text Assists eLearning

The COVID-19 pandemic has had an enormous impact on the education of children the world over. With school closures and teacher absenteeism, the once-regimented structure of education was disrupted like never before. As we start to piece back together the process of learning, could speech-to-text recognition not only help prevent children slipping further behind, but also help them catch up to pre-pandemic levels?

The acceleration of eLearning tools due to the pandemic has been remarkable. China, for example, sent 250 million children home for online classes. However, deep learning was still necessary. The World Economic Forum reports how Zhejiang University uploaded more than 5000 courses online a mere two weeks after the transition began, using a system called DingTalk ZJU.

This home learning was replicated all over the world, with at least 210 countries and billions of students affected by the pandemic and stay-at-home orders. This forced teachers and students alike to become familiar with scanning documents, hosting online sessions, and other unfamiliar technologies – transcription software included.

Seen But Not Heard

At around the age of 2, far too often we bracket children into “those who speak” and “those who don’t”. We take for granted that children are continually learning to speak way beyond those early years. While adults have heard most everyday words time and time again – and used them just as much – most children discover new ones practically every day. When they repeat them, they do so in different ways, getting used to the sounds these words make. This process of trial and error is one of many reasons why current voice recognition isn’t as accurate with children as it is adults.

But children’s voice differs from adults in a variety of ways too. It’s not just the obvious difference in pitch, but the patterns themselves, which often trip up voice recognition. As reported by TechCrunch, children can hit different parts of words to adults, they can over-enunciate, punctuate differently and carry fewer common cadences. All of this, historically, leads to children’s voices being disproportionately failed by a technology that focused primarily on adults.

Engaging Young Voices

Arguments have been made for some time now, that subtitles can play a huge role in helping develop literacy skills. After all, the more children get to see words in action, the easier it is for them to understand and repeat them. It stands to reason then, that the benefits of live captioning in classrooms would have similar effects. But when young voices are still often misunderstood by speech-to-text technology, can we really rely on it to help teach our children? The answer is obvious, make the technology more accurate.

And that’s exactly what we’ve done at Speechmatics. We’ve seen our accuracy hugely improve and the gaps plummet when it comes to adults and children, thanks in large part to an introduction of self-supervised learning (SSL) into our training.

Before SSL, there was a frustratingly large unavailability of data to train on. Especially when it came to young voices. In fact, we were left to train on 30,000 hours of audio, and often, this was mostly adult speech. Since SSL, we can now train on 1.1 million hours of audio, exponentially increasing the number of voices from children.

Now that the majority of children have returned to the classroom, the technology adopted in the height of pandemic has become part of the everyday. When that “everyday” is a classroom full of noise, speech-to-text software has yet another obstacle in its path to accuracy: background noise. Again, here, Speechmatics has seen incredible success with our Autonomous Speech Recognition. Whether the background noise is pitch, reverb or volume, our results in our latest round of testing, show we’re far and above our competitors in terms of accuracy.

Benedetta Cevoli, Data Science Engineer, Speechmatics