A key aspect of Intelligence is the rate of skill acquisition – the reason humans don't have this same long-tail problem is that we can generalize from a small number of hours of real-life interactions. Any worthwhile ASR (or indeed AI) system of the future must evolve in the same direction.
In today's landscape of Language Model Models (LLMs), automation, and generative AI, the significance of this accuracy has only grown. Any downstream application or value extracted from speech data is directly proportional to the precision of the original transcript. It's a classic case of "garbage in, garbage out".
Given the above, superb accuracy across any language means that anyone who can achieve this will always be the preferred choice of any 'conversational AI stack'. If you plan to use speech, you need the best at capturing speech. We are, and we will continue to pursue this.
Our place in generative AI
Many might argue that the advancements in speech-to-text technology only impact those already embedded in the voice-driven ecosystem — CCaaS providers, captioning companies, podcasters and the like. Yet, the scope is broader. Most AI interactions today revolve around rigid scenarios such as asking Alexa to set a timer or getting an automated set of meeting notes. We must think bigger.
AI assistants today are just the beginning. The goal is for people to talk to most tech tools, just like they have been communicating for thousands of years – with their voices. Speaking is often both more natural and impactful than typing. The tech of the future should not just hear but truly understand our spoken words. Instead of fumbling with clunky LLM prompts and handoffs between multiple systems, we should be able to simply talk, and the AI should respond in kind. No latency and no misheard words. The vision is a seamless voice interface powered by a fully speech-to-speech neural network.
In the future, people in both their work environments and in their personal lives will be able to interact with technology in this way. Come and join us to help build that future.
Sure, Speech Technology isn't a panacea here. It won't be that we only use our voices to interact with tech - but it's still a string that we simply don't have in our collective bow right now. Current generation systems are good, but we need to get much closer to 'perfect' before this kind of future will come into view.
What's next?
Short term, we're looking to build our next-generation self-supervised models to strengthen our ASR foundations. As always, we will be adding key languages to our offering. On the capabilities front we have some new APIs which we are excited will pair with our high-accuracy ASR really well - more on that soon. We will also be expanding our Speech Intelligence stack and increasing its impact by providing solutions tailored to specific customer needs.
Long term, as an AI company we remain committed to investing in paradigm changes that make these seamless voice interfaces of the future a reality. We're excited about our direction and the future development of Speech Intelligence – our success in powering speech technology will always be built on the inclusive and consistently accurate foundations of automatic speech recognition.