Actions, not words. Making speech technology useful.

The Evolution of Speech and the Unfulfilled Dreams of Sci-Fi Technology

Since the dawn of language, we have harnessed the power of speech to share ideas, tell stories, build relationships, teach new generations, bridge gaps, and share our inner thoughts with others. Speech is not merely a means of communication, but a cornerstone of human civilization. Our brains have evolved to make speech an innate and universal human ability. Speech is a high bandwidth method to communicate, conveying information faster than written text with a greater depth of nuance and emotional content.

Speechmatics has been at the forefront of automated speech recognition for close to a decade. During this time, Speechmatics has been focused on using AI to give machines this most basic of human traits, the ability to comprehend speech. Through implementing multiple breakthroughs into our product, we have transformed the quality and performance available, increasing the expectations that customers have in this market. And make no mistake, speech is a challenging domain. Audio data is inherently temporal, the signals are of variable lengths, with complex features and very high dimensionality. Alongside that, there is limited labeled data, and labeling is an expensive exercise to undertake.

Yet we're still a long way from what science fiction parades as the north star of speech technology… C-3PO was fluent in over six million forms of communication, not only understanding what is said but able to translate and comprehend the intent in real-time. The Star Trek Delta badge recognizes, communicates, and actions commands from its user. And Samantha in "Her" can console Theodore through conversational AI, immediately responding and making suggestions and recommendations.

Why have we not yet reached these pinnacles of speech recognition products?

This is because ASR has been challenging enough in its own right, even in just a single language.

This is because we have not had access to the underlying capabilities to make these possible. But this has changed.

Combining the incredible leaps Speechmatics has made in ASR (especially real-time ASR) with the powerful progress of large language models (LLMs) makes many of these science fiction dreams within reach. Speech can be seamlessly understood and comprehended in real-time, translated into hundreds of other languages, summarized, analyzed, actioned, integrated, and many more. Speech technology is no longer simply a means of converting audio into text, it is the gateway into a new era of artificial intelligence. This is what we call Speech Intelligence.

Harnessing the innate power of speech, beyond just recording it

Given today's rate of progress in technological advancements, the concept of Speech Intelligence can rapidly reshape how we interact with the world around us and revolutionize the way we harness the power of spoken language.

So what is Speech Intelligence? And what is it not? Speech recognition uses artificial intelligence to transform audio into text data. Text powers the current incredible capabilities of LLMs and Speechmatics to use these breakthroughs to enable more to be done from speech and audio data.

Beyond seamless real-time speech recognition in 50 languages (and counting), we also empower customers to effortlessly enable current conversations to become immediately accessible to a wide range of additional capabilities.

Beyond mere speech recognition, we enable the understanding of audio alongside spoken language. It deciphers and processes the rich spectrum of audio data, encompassing elements like music, environmental sounds, and more. This holistic understanding of audio and speech opens up many new possibilities for applications.

Beyond foundational ASR capabilities, Speech Intelligence also extends its reach into the realms of summarization, sentiment, topics, and more, adding layers of sophistication to the understanding of spoken content.

But perhaps one of the most exciting developments in Speech Intelligence is the shift from passive to active actions from speech.

Speech recognition is a passive restructuring of audio data into text. Speech Intelligence is active, and actionable.

Speech recognition is a feature. Speech Intelligence is a seamless, empowering, problem-focused product.

Speech recognition is transcribing call center conversations for regulatory requirements. Speech Intelligence is empowering agents with useful information and key conversation playbooks. All in real-time.

Speech recognition is transcribing a podcast so that it can be indexed and searchable on a platform. Speech Intelligence is real-time monitoring of live streams to identify key topics.

Speech recognition is captioning a lecture because it is required for hosting online. Speech Intelligence is comprehending the topics being discussed, restructuring and personalizing the experience for each student.

In this paradigm, spoken words become a catalyst for tangible outcomes. Voice-activated technology, virtual assistants, and smart devices respond to our verbal commands, allowing us to control our environment, access information, and perform tasks with remarkable ease.

In the same way that artificial intelligence is more than just analyzing data, speech intelligence is more than a couple of additional features connected with speech recognition.

Speech Intelligence is the culmination of decades of research into speech recognition and Generative AI. It makes Generative AI accessible via a more natural speech interface. It makes speech recognition active, actionable and useful.

The building blocks of Speech Intelligence

Speech Intelligence requires a seamless integration of multiple building blocks. Some of these exist already today, and some we are still imagining and bringing to life through research and product design in Speechmatics. We have already released some of these building blocks alongside our speech recognition product.

But we're just getting started.

These are some of the clear and obvious areas to improve the usefulness of transcripts. We're particularly excited about how the range of capabilities can be quickly expanded. We're also excited about how the combination of capabilities creates more value than the sum of their parts, by solving specific customer use cases.

Building capabilities to create more than the sum of it's parts...

While each capability, such as summarization, is a valuable feature in and of itself, it becomes more powerful when viewed as an "ingredient" of a larger "recipe" of capabilities that bundled together provide a solution to a specific use case. Being able to boil down a long customer complaint into its salient points, assign topic detection labels to triage the problem, and measure the customer's sentiment in real-time while the complaint is dealt with will not only improve the customer experience but also standardize the experience provided and reduce call handling times for providers.

Being able to recognize key words or entities in real-time media such as the radio or TV is useful to record when a brand, product, or personality has been mentioned. But being able to summarize what was said, what the sentiment and context was, and whether it was a natural conversation, a product placement, or a direct ad helps provide a solution to the media monitoring use case.

We don't assume we know all the different combinations that could be created by using our suite of features, in fact, we're actively excited about hearing more of other use cases! We hope to inspire and invite ideas by releasing some of our own examples we've already built internally, such as Agent Assist and Media Monitoring described above.

What does the future hold?

We are entering a new era of artificial intelligence. The breakthroughs are occurring at an ever-increasing pace. We believe that we can empower speech recognition with the agency of LLMs, and provide LLMs with speech as a natural interface.

We were promised Samantha from "Her" and instead we got home assistants that can set a timer or tell us the weather. We are nowhere near the goal of enabling interactions with computers to be as seamless as those we expect with other people. This is what drives us at Speechmatics to continue our mission to understand every voice.

This mission is more than a north star to drive our product roadmap, we believe it is essential to the way we imagine our future as a society, as a species, and as a community. We want to ensure that no voice is left behind. We want the rising tide caused by AI breakthroughs to be more broadly accessible. We want to ensure seamless inclusivity in the interaction with our product, and the way our product interacts with the world. This is why we're so focused on ensuring accuracy improvements across all languages. Not simply by providing a multi-lingual model which is biased towards the most prominently used languages, but by building an architecture that ensures all languages can reach the same accuracy we already expect for English.

Speechmatics, at its heart, is an AI company. The organizational DNA is built on innovating with the latest artificial intelligence research. We are in an enviable position to take advantage of this new era and be a leader in integrating Speech Intelligence as a core part of the future technology stack for AGI. We know how to translate research into products.

We're excited for this new journey towards Speech Intelligence.

Transform you audio and media data into one of the biggest value-driving assets you have

Book a meeting with one of our specialists to learn how you can unlock the value within speech.

Oct 17, 2023 | Read time 9 min

Actions, not words. Making speech technology useful.

The Evolution of Speech and the Unfulfilled Dreams of Sci-Fi Technology

Why have we not yet reached these pinnacles of speech recognition products?

Harnessing the innate power of speech, beyond just recording it

The building blocks of Speech Intelligence

Building capabilities to create more than the sum of it's parts...

What does the future hold?

Transform you audio and media data into one of the biggest value-driving assets you have

Related Articles

TL;DWOL (too long; didn't watch or listen): Speechmatics Have Launched Summarization

Leveraging Large Language Models to Transform Your Business

Achieving Accessibility Through Incredible Accuracy with Ursa