Blog - Use Cases
Mar 28, 2025 | Read time 3 min

The future of voice AI: 3 experts weigh in on what's next

Three experts explore the biggest challenges—and breakthroughs—shaping the next generation of truly natural, intelligent voice AI.
Ricardo Herreros-SymonsChief Strategy Officer

Where voice AI is heading—and why it still struggles to sound human

Tech futurist Robert Scoble, AI analyst Irena Cronin, and myself (Ricardo Herreros-Symons) dive into the future of voice AI—covering emotion, latency, speaker recognition, and more.

👇 Catch the full X Space below, or read on for standout moments and expert insights:

When voice assistants first appeared, they could barely handle basic commands. Now, they’re everywhere — from our homes to hospitals to call centers. But are they actually getting better?

Despite huge leaps in tech, most systems still struggle with natural conversation, overlapping speech, and real-world noise. To unpack what’s working (and what still isn’t), three leading voices in AI sat down for a live discussion on where voice tech is heading next.

The human conversation challenge

"At Thanksgiving dinner, I had 15 people in my house all in one room," Scoble explained, setting up the fundamental challenge. "A human can talk to five people all at one time... and be able to interact with each person in real time. We're not quite there yet with Chat GPT advanced voice or Google Gemini's voice".

This multi-speaker recognition capability – known as speaker diarization – is one of the biggest hurdles for voice AI today. It's a problem we humans solve effortlessly at parties, but it's been a major stumbling block for machines.

"In a public space, our system at Speechmatics can say, as soon as the speaker starts speaking, just focus on their voice. Don't be interrupted by anyone else," Ricardo explained. "For me, that's been the most useful feature".

Real-world applications are extensive. Cars can distinguish between drivers and passengers, ensuring only authorized users can change routes. Hospitals can accurately transcribe multi-party consultations. And at home, your smart speaker might finally stop responding to random snippets of TV dialogue.

The awkward pause problem

"How hard is it to build this system to get the latency down?" Scoble asked, diving into the next critical challenge.

Current voice systems still have all the conversational timing of a nervous teenager on a first date. The unnatural delays between what you say and the AI's response break the flow of natural conversation.

"The more latency you give in a transcription, the more context the engine has," Ricardo pointed out. "But what you really want is anything much quicker than a second, because it almost gets into that difficult barging or cross-talking environment".

The sweet spot appears to be around 0.6 seconds—fast enough to feel responsive but not so quick that it interrupts. But achieving this requires teaching machines to understand conversational nuances like pauses versus full stops.

It's not what you say, but how you say it

Beyond simply transcribing words accurately and quickly, voice AI is increasingly picking up on how we express ourselves – our tone, emotion, and emphasis.

"You keep mentioning context," Scoble put to Ricardo "Other NLPs are starting to pass along the emotion of the person speaking to it. So it understands extra context and is able to have the LLM respond more in a human way".

Speaking of Speechmatics' approach, Ricardo elaborated: "We probably term this as paralinguistic models. That's an umbrella term for a lot of different things. It can be the emotion, the tone, the pitch. And obviously, as human beings, we understand that when we have a conversation".

These emotional cues are being captured as "paralinguistic tokens" that can be passed through language models and into text-to-speech systems. The result is voice AI that might actually respond appropriately when you're frustrated – instead of cheerfully misunderstanding you for the fifth time.

Moving the brains from cloud to device

As AI systems grow more sophisticated, a significant shift is happening in where the processing takes place. Instead of sending everything to distant servers, more voice AI is running directly on your devices.

"50% of this is going to be the hybrid model for sure. But being able to run a lot of this natively is going to be really important, because the costs will just escalate otherwise," Ricardo predicted.

This shift comes down to three factors: reducing delays, lowering costs, and addressing privacy concerns. For applications like in-car assistants or home devices, maintaining functionality when your internet drops out is crucial.

"I don't want to pay the same amount as I'd have to do to take human time to be able to do that," Ricardo noted, highlighting the economic imperative of making voice AI accessible and cost-effective.

Cultural and linguistic complexity

Voice AI is also navigating the fascinating quirks of different languages and speech patterns. When the conversation turned to real-world applications across different markets, Ricardo shared insights on how language structure affects communication styles.

"In Finnish or Suomi, the pronouns, since the person doing the action quite often appears at the end of the sentence, which means when somebody is talking, you don't actually know who's doing what until the very last word," Ricardo explained. This language structure might explain why Finnish speakers often come across as patient and economical with words.

By contrast, "Spanish, we use an awful lot of words to say not very much, which means we also shout a lot and it sounds like we're always very angry at each other. We're not, we're just being expressive on that front".

These cultural differences mean voice AI systems need to adapt not just to different vocabulary and grammar, but entirely different conversational styles.

Making AI listen to everyone

Perhaps most importantly, advances in self-supervised learning are making voice AI more accessible to diverse speakers. By training on "millions of hours of unlabeled data" from speakers around the world, newer systems can better understand different accents, speech patterns, and even speech affected by disabilities.

When Scoble asked about accessibility for people with speech difficulties, including his special needs son who is very hard to understand, Ricardo explained Speechmatics’ approach: "We are going to listen to millions and millions of hours of conversation from all across the world. Different voices, different acoustic environments, and we're going to train our self-supervised models to recognize the different representations."

This technological approach means systems require far less training data to achieve high accuracy across diverse populations.

"We can then take that and that's why we do very well on voices which are less well represented," noted Ricardo, highlighting Speechmatics' inclusive approach to voice recognition.

The near future

As the discussion with X Space attendees concluded, the experts were optimistic about rapid progress in the coming months. Voice AI's ability to process multiple speakers, respond with human-like timing, convey appropriate emotion, and function efficiently on local devices is improving at an unprecedented pace.

"Even by late this year, I think there'll be some interactions where you'll be able to say, wow, that really did feel natural in the way it laughed and the way it coughed or shouted at me," predicted Ricardo.

"What does this look like in 10 years?" Scoble asked. "What are the challenges to getting to the space of 10 years from now where we have robots in our homes and augmented reality glasses on our face, virtual beings that are walking around our house?".

Ricardo suggested that even five years is difficult to predict given the rapid pace of change. What seems clear is that truly natural AI conversation will require multimodality – combining voice with facial expressions, gestures, and other physical cues.

"If you actually want to get to the next level of natural conversation, we're going to need to be bringing in the physical cues," Ricardo explained. "What are the lips doing? What's the face doing? What is the gesture?".

The consensus among Scoble, Cronin, and Ricardo was clear: we're approaching a point where voice AI will finally begin to match the natural, contextual understanding that humans take for granted in everyday conversation.

For users, this means more intuitive interfaces; for developers, it opens new possibilities for applications we've only begun to imagine.

Listen to the full conversation here on X and sign up to their newsletter here.

The best ears in AI

Customer relationships are built on listening. Speechmatics ensures your AI understands every word spoken.