Why Google and Open AI’s latest announcements don’t solve all the challenges of AI Assistants
Understanding every voice is necessary to creating great AI assistants, but also a huge challenge.
Trevor BackChief Product Officer
Will WilliamsChief Technology Officer
A renewed focus on Conversational AI
Back in February, we shared our belief that “Her”-like AI assistants lie in our future. The movie is a great North Star for our vision due to its focus on a natural, truly seamless, interaction.
We also claimed that speech will be a core part of any future AGI stack.
Last week’s demos from OpenAI and Google have shown that Big Tech also recognize this vision to be true.
In the space of just a week, both companies held events to showcase their latest updates, and both showed a renewed focus on audio-centric interaction layers with their LLM stacks, and how this enables a more conversational AI device. These demos reignited media enthusiasm for assistants after some less well received product reviews of Humane AI and Rabbit.
Some of the most impressive demos focused on a ‘multimodal’ application, able to use text, images and audio inputs together. A particularly exciting moment for many observers was the ability to use this multimodal input to provide a personalized tutor to help a student learn calculus and trigonometry.
Our main takeaway, however, was that the demos all used audio input, our voices, as the method for interaction. No keyboards, no touch screens, no mice, no eye tracking (not even a brain interface). But our most natural, most seamless, most human way of interacting with our world: our voice.
The shift to audio-driven interaction
A simple way to think about these new assistant models is as three core components:
Multimodal input, including Audio-in for user interaction (using speech-to-text)
Intelligence stack (often using LLMs)
Audio-out (using speech synthesis)
Google I/O showcased several demos which utilize Gemini’s impressive multimodal capabilities.
OpenAI’s demo showcased their impressive progress in speech synthesis – how to get an application to talk in an emotive, human-like way. Which, over the last year, has seen incredible leaps forward from what we’re use to with Siri, Alexa or Google Assistant.
A few commentators (The Guardian and Bloomberg to name two) were critical of just how close OpenAI’s assistant sounded like the AI from “Her” (played by Scarlett Johansson), but nonetheless the range of emotion was impressive to experience.
One of the major benefits of this progress in speech synthesis, is that it can give the impression of a deeper understanding in the other components of the tech stack, even when that may not be present.
For example, in the OpenAI demo, we heard the following interaction:
“I want you to tell my friend a bedtime story about Robots and Love”
“Ooooooo, a bedtime story about Robots and Love? I got you covered…”
By responding quickly to utterances with human-like (but ultimately filler) phrases, the perception of responsiveness undoubtedly increases, but this does not demonstrate that the actual speech understanding has also improved. Lots of these filler responses were demonstrated – often repeating the question back in the same order, with some personality thrown in for good measure.
Speechmatics' commitment to shaping the future of AI assistants
At Speechmatics, we believe that deeply understanding the audio-in is required to build the seamless AI assistant of the future. This is why we focus so much of our time and energy on the first part of the component stack: the audio-in.
We’ve been focused on this for over a decade, and here are some reasons we think it truly can make a difference for the AI assistants of the (near) future:
Likely using "mini-batch" methodology. Blocks, of question and response.
Real-time first.
GPT-4o likely uses a “mini-batch” methodology where an end of user-speech is detected, and a full response immediately generated. This can also lead to moments where the system responds unnaturally fast, or interrupted, or failed to recognize it was interrupted.
We use “streaming” on ASR to recognize each word as it is spoken. Our future systems will be able to reformulate responses mid-response, interrupt gracefully, handle crosstalk and incorporate background events without triggering awkward transitions.
For OpenAI, there were moments the assistant failed to recognize it was speaking to multiple people. Understanding multi-speakers is critical to many use cases.
Speechmatics offers best-in-class speaker diarization. This includes speaker (rather than channel-wise) diarization, where we are able to recognize multiple different speakers through the same audio channel.
On a few occasions, GPT-4o reacted to audience audio rather than the presenter. For an assistant to be useful in the real-world, it needs to be robust to background noise (like traffic noise, a crowd, or an airport), and to background speakers (such as if you’re sat on a train).
Our ASR is specifically trained to be solid and more robust to background noise (just take this example from a referee in a basketball game with the audience).
American-English and simple Spanish showcased. Full evaluation pending.
50+ languages, including strong accents and dialect coverage.
OpenAI employee’s showcasing the demo had understandable American-English voices. ASR is known to work extremely well for these voices. But how does it work for understanding every voice? Every accent? Every language?
Speechmatics is independently verified as the most accurate ASR for a wide range of languages, accents and dialects. We want our technology to enable the understanding of every voice, not just the most prominently represented. The devil is in the detail when it comes to accents.
While OpenAI’s assistant responded with very low latency most of the time, there were still occasions when waiting for the response had an unknown duration. This variability in response time can be frustrating for a user. We provide premium products for our premium enterprise customers. This means robust, consistent, and reliable products which ‘just work’.
Overcoming challenges to enhance AI accessibility for all
Understanding every voice is both necessary to creating great AI assistants, and also a huge challenge.
Whilst the announcements were truly impressive, and represent a further evolution of AI technology, we’re 100% focused on developing the speech understanding required to enable accessibility for these AI assistants for everyone.
What is clear is others share our vision for the critical role of speech in how we interact with technology. Speech will be a core component of any future AGI stack.
We’re in an enviable position to capitalize on our decades of expertise in audio. Of course, we’ll be making a few of our own announcements over the next few months… stayed tuned 👀.