Feb 23, 2022 | Read time 4 min

Emotion and Voice: The Next Goal for Speech-to-Text

Learn more about the importance of emotion in speech-to-text communication and the advancements Speechmatics are making to help understand every voice.
Emotion and Voice: The Next Goal for Speech-to-Text
Benedetta Cevoli
Benedetta CevoliSenior Machine Learning Engineer

Emotion and Voice: The Next Goal for Speech-to-Text

Your voice is an incredible indicator of how you're feeling. How often have we “sensed” that someone is not doing well, even when they tell us they are? In many ways, your voice is equal to body language or facial expression. It communicates more than we know we’re letting on and, critically, emotion in our voice is challenging to mask.

As we interact more with machines, especially voice assistants, how important is knowing how someone is feeling – as opposed to what someone is saying – when it comes to speech-to-text? While we can mask emotion or misread cues when it comes to visual communication if we’re able to unlock how someone is feeling through their voice, is there more scope for voice technology to help us understand each other?

Emotional Intelligence

Speaking to the Inside Voice podcast, Rana Gujral, the CEO of Behavioural Signals (specialists in emotional conversations with AI), discussed where we are now, with voice and emotion. "We're talking to machines, but it's a very one-sided interaction where we're giving commands," Rana explained. "We're not really having a dialogue; we're not really having a conversation. And that was the promise of these virtual assistants.”

If a machine cannot relate to our emotions, the two-way street of dialogue breaks down. Without empathy or sympathy, a considerable barrier can be created. So how can we make machines emotionally intelligent? What will the benefits be once we can?

One clear use case for improved emotion in voice technology is in the contact center industry. From a server point of view, machines have guided workers for years by informing them they are talking too slowly, or their client sounds tired, for example. But to repeat Rana’s point, if the other side of the conversation doesn’t understand emotion, is it even a conversation? Emotion recognition is essential for an empathetic and affective dialogue between humans and machines.

Accuracy and Emotion

When it comes to understanding every voice, emotion recognition is essential for an empathetic and affective dialogue between humans and machines and is too crucial a factor not to consider. Speech recognition accuracy varies significantly according to someone's emotional status. If automatic speech recognition (ASR) works ‘only’ with neutral speech, any real-life application becomes problematic. Emotions are part of every natural human interaction and have a considerable role in speech production and comprehension.

At Speechmatics, research is at the heart of everything we do, and we plan on delving into emotion more this year. In 2020, we conducted initial research into using self-supervised learning for emotion recognition. With the introduction of self-supervised models into our latest engine, we can now look to leverage this research and explore the “rich underlying structure of audio” otherwise missed by human-led data. Moreover, we can effectively explore different domains such as TV broadcasts and phone calls with less human-labeled data.

To begin to understand the external impacts of human emotion, we wanted to see how our technology would handle poor quality – or noisy – data. We tested 6 hours of audio taken from meetings, earning calls, online videos, and a host of other real-world examples that included ambient sounds (phones, machine noise, different conversations, etc.).

To make it even more of a challenge, we randomly changed the pitch, reverb, and volume levels, too – the sort of aspects varied emotions would also affect.

These real-world factors are an accessible, relatable start to uncovering how emotion impacts speech-to-text technology, as these different scenarios bring diverse levels of emotion in a person’s voice. As we look to increase the understanding of every voice in our technology, analyzing situations like these are vital in understanding the way we talk, and our emotions are not only impacted by our feelings but also our surroundings.

Emotion recognition is a key aspect of Speechmatics' aim of understanding every voice.

Benedetta Cevoli - Data Science Engineer, Speechmatics

Latest Articles

Carousel slide image
Use Cases

What Word Error Rate Is Acceptable for Legal Transcription?

Word error rate for legal transcription has no single acceptable threshold. But knowing how accuracy, audio quality, and review obligations connect to real legal risk is what separates a reliable transcript from a costly one.

Mieke Smith
Mieke SmithSenior Writer
Carousel slide image
Use Cases

The court reporter shortage crisis: data, causes, and what legal teams are doing about it

The court reporter shortage is reshaping litigation. Explore data, causes, and how legal teams are using digital reporting and AI transcription to adapt.

Tom Young
Tom YoungDigital Specialist
[alt: Bilingual medical model featuring terms related to various health conditions and medications in Arabic and English. Key terms include "Chronic kidney disease," "Heart attack," "Diabetes," and "Insulin," among others, displayed in an organized layout.]
Product

Speechmatics achieves a world first in bilingual Voice AI with new Arabic–English model

Sets a new accuracy bar for real-world code-switching: 35% fewer errors than the closest competitor.

Speechmatics
SpeechmaticsEditorial Team
[alt: Illuminated ancient mud-brick structures stand against a dusk sky, showcasing architectural details and textures. Palm trees are in the foreground, adding to the setting's ambiance. Visually captures a historic site in twilight.]
Product

Your voice agent speaks perfect Arabic. That's the problem.

Most voice AI models are trained on formal Arabic, but real conversations across the Middle East mix dialects and English in ways those systems aren’t built to handle.

Yahia Abaza
Yahia AbazaSenior Product Manger
new blog image header
Technical

How Nvidia Dominates the HuggingFace Leaderboards in This Key Metric

A technical deep-dive into Token Duration Transducers (TDT) — the frame-skipping architecture behind Nvidia's Parakeet models. Covers inference mechanics, training with forward-backward algorithm, and how TDT achieves up to 2.82x faster decoding than standard RNN-T.

Oliver Parish
Oliver Parish Machine Learning Engineer
[alt: Healthcare professionals in scrubs and lab coats walk briskly down a hospital corridor. A nurse uses a tablet while others carry patient charts and attend to a gurney. The setting conveys a busy, clinical environment focused on patient care.]
Use Cases

Why AI-native EHR platforms will treat speech as core infrastructure in 2026

As clinical workflows become automated and AI-driven, real-time speech is shifting from a transcription feature to the foundational intelligence layer inside modern EHR systems.

Vamsi Edara
Vamsi EdaraFounder and CEO, Edvak EHR