Your voice is an incredible indicator of how you're feeling. How often have we “sensed” that someone is not doing well, even when they tell us they are? In many ways, your voice is equal to body language or facial expression. It communicates more than we know we’re letting on and, critically, emotion in our voice is challenging to mask.
As we interact more with machines, especially voice assistants, how important is knowing how someone is feeling – as opposed to what someone is saying – when it comes to speech-to-text? While we can mask emotion or misread cues when it comes to visual communication if we’re able to unlock how someone is feeling through their voice, is there more scope for voice technology to help us understand each other?
Emotional Intelligence
Speaking to the Inside Voice podcast, Rana Gujral, the CEO of Behavioural Signals (specialists in emotional conversations with AI), discussed where we are now, with voice and emotion. "We're talking to machines, but it's a very one-sided interaction where we're giving commands," Rana explained. "We're not really having a dialogue; we're not really having a conversation. And that was the promise of these virtual assistants.”
If a machine cannot relate to our emotions, the two-way street of dialogue breaks down. Without empathy or sympathy, a considerable barrier can be created. So how can we make machines emotionally intelligent? What will the benefits be once we can?
One clear use case for improved emotion in voice technology is in the contact center industry. From a server point of view, machines have guided workers for years by informing them they are talking too slowly, or their client sounds tired, for example. But to repeat Rana’s point, if the other side of the conversation doesn’t understand emotion, is it even a conversation? Emotion recognition is essential for an empathetic and affective dialogue between humans and machines.
Accuracy and Emotion
When it comes to understanding every voice, emotion recognition is essential for an empathetic and affective dialogue between humans and machines and is too crucial a factor not to consider. Speech recognition accuracy varies significantly according to someone's emotional status. If automatic speech recognition (ASR) works ‘only’ with neutral speech, any real-life application becomes problematic. Emotions are part of every natural human interaction and have a considerable role in speech production and comprehension.
At Speechmatics, research is at the heart of everything we do, and we plan on delving into emotion more this year. In 2020, we conducted initial research into using self-supervised learning for emotion recognition. With the introduction of self-supervised models into our latest engine, we can now look to leverage this research and explore the “rich underlying structure of audio” otherwise missed by human-led data. Moreover, we can effectively explore different domains such as TV broadcasts and phone calls with less human-labeled data.
To begin to understand the external impacts of human emotion, we wanted to see how our technology would handle poor quality – or noisy – data. We tested 6 hours of audio taken from meetings, earning calls, online videos, and a host of other real-world examples that included ambient sounds (phones, machine noise, different conversations, etc.).
To make it even more of a challenge, we randomly changed the pitch, reverb, and volume levels, too – the sort of aspects varied emotions would also affect.
These real-world factors are an accessible, relatable start to uncovering how emotion impacts speech-to-text technology, as these different scenarios bring diverse levels of emotion in a person’s voice. As we look to increase the understanding of every voice in our technology, analyzing situations like these are vital in understanding the way we talk, and our emotions are not only impacted by our feelings but also our surroundings.
Emotion recognition is a key aspect of Speechmatics' aim of understanding every voice.
Benedetta Cevoli - Data Science Engineer, Speechmatics