Introducing Flow – passing the Turing test with speech

A new era in Conversational AI

Today marks an exciting new chapter for Speechmatics.

We’re bringing all the expertise that has allowed us to lead the ASR market for a decade into the world of Conversational AI and AI assistants.

Today we are introducing the world to Flow.

Flow enables any company to build incredible speech interaction into their products.

It has a huge vocabulary, and is consistently accurate even at incredibly low latency (so no waiting for responses). It also includes highly accurate speaker identification which enables game-changing interactions that include multiple people or in noisy environments.

The range of use cases for this is vast. It includes generalist AI assistants, but can also be used for any AI agent or product that would benefit from people being able to speak to it.

Why are Speechmatics doing this?

Does it change our mission to understand every voice?

For us, the answer is simple – our mission remains, but the scope of our product just got slightly bigger.

The intuitive need for speech interactions

Why do we need speech interactions?

There’s a simple, and strong, reason to want to add speech interactions to our current roster of ways to use technology – it is deeply intuitive.

We’ve been using our voices for 100,000 years – far longer than any tool, let alone a keyboard and mouse.

It is the default way that we, as humans, communicate with each other.

In a world where technology is becoming increasingly powerful, technology that can ‘meet us where we are’ will ultimately be far more valuable than technology that requires a degree in computer science to use.

Technology we can work alongside and collaborate with to perform tasks, solve problems, complete tasks, all without lifting a finger (literally) will transform the way in which we work and play.

Addressing accessibility challenges

For many, our current methods of using technology will seem straightforward to use. A (touch) screen displaying information and images, and a mouse and keyboard to complete tasks.

Simple enough.

Except, for millions, it is not.

Worldwide, there are millions of people for whom this option presents significant barriers:

A World Economic Forum study in 2017 revealed that almost a quarter of respondents did not know how to operate a computer.
3 in 10 Americans struggle to use the internet.
10% of the world’s population has dyslexia (that’s 780 million people).
40 million people are blind, with a further 250 million having visual impairment issues.
7% of working adults have dexterity issues and will struggle to use a keyboard and mouse.

As well as the exciting opportunity to give people a new way to interact with technology, there is also the motivation of creating truly inclusive technology that all but eradicates the traditional barriers of adoption.

The current landscape

There may be some thinking that all the key AI players are implementing speech as the core interface to their intelligence stack.

There may be many die-hard users of Siri, or Google Assistant, or have tried ChatGPT since their introduction of GPT-4o.

There may be those thinking that speech is solved.

They would be wrong.

An easy way to show this is to imagine the Turing test.

First proposed in 1950, it is a test of a machine's ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human.

Another way of thinking about this is that if a machine can ‘fool’ someone into thinking it is another person, the test has been passed.

For a few years now, it can be argued that we’ve reached this point with text interactions. It is very difficult to speak via chat with an LLM and not forget that there is no person on the other side.

But imagine now trying to pass this test using speech. Are we at the stage where someone would be able to conduct a conversation (say on a phone) with an AI, and come away believing they had been speaking to a person? We believe the answer to this is no.

Building future-ready technology

Live human conversation is messy.

Picture yourself sat in a bustling coffee shop, texting a group of friends.

The grammar of the interaction is well understood. Someone else in the group says a sentence or two and sends out a message, you read it and decide to respond. You write a few words, send it, and wait. Each sent message usually represents a complete thought, with the person deciding when to send it. The conversation flows slowly and steadily.

Now imagine sitting in that same coffee shop with the same friends, having an animated conversation.

It is far faster, more dynamic, and full of subtle cues to make the interaction function.

Face to face conversations are full of emphasis, intonation, accents, slang, dialects, interruptions, cross talk, pauses, stammers, stutters, noises, laughter, sarcasm, and more.

This doesn’t even include the fact that everyone there recognizes each other’s voices (in theory, you could all close your eyes and continue the conversation).

It also doesn’t include the fact that those at the table can distinguish between those a part of the conversation and background chatter from the rest of the coffee shop.

These characteristics, while deeply intuitive and understood by people after a lifetime of social interactions, are really tough challenges for technology to understand and overcome.

Imagine opening Siri, Google Assistant, or ChatGPT in that coffee shop and making it a part of the conversation.

Do you think it would blend right in and be one of the groups?

Or do you think it would fall over, its basic function broken by the number of curveballs being thrown its way?

We still have a way to go.

What’s also important to note is that the ‘speech-in’ part of any interaction with AI is the most important link in the chain.

Even with the world’s greatest intelligence and voice synthesis (speech-out), the entire interaction would be fundamentally undermined if it could not understand what was being said to it.

Solving this therefore represents the biggest challenge to widespread adoption.

Only when we can pass the Turing test using speech will it become a new and ubiquitous way to interact with technology.

Responsive, inclusive and enterprise-ready

So how should we be building this technology?

Flow incorporates all the things that makes Speechmatics a best in class choice ASR, and the things that make us great at ASR will map onto the world of Conversational AI.

This means high accuracy, fantastic real-time abilities that surpasses competition and a range of deployment options.

But in order to reflect the nuances of human conversation, we think these three pillars best represent the guiding values for building Flow:

1) Responsive

We use this in the same way you might describe a person (rather than a website). This means the technology actively listens, understands and replies to everyone else who is part of the conversation and does so without long pauses. The emphasis here is on being ‘low latency’, but also includes responding appropriately to the way in which things are said.

2) Inclusive

For speech technology to move beyond a fun gimmick, it must be usable for all. Unlike previous generations of interface (the keyboard, mouse etc) this is not a question of training or skills. This is about the technology being able to understand what you are saying. Given the rich tapestry of human voices, this means it needs to understand every language, accent, dialect, intonation, sentiment and emphasis. Speech technology cannot be built only with the US user in mind.

3) Enterprise-ready

In order for speech to become widely used, a large number of companies have to integrate it into their product (or build new products built on speech). We believe that in order for this to happen, there cannot be a one-size-fits-all approach to the speech engine that underpins it. Businesses have a broad range of privacy and security needs, as well as the requirement to integrate with various different systems (for example an educational assistant must be able to access specific course materials and pupil information without sharing these with the world). Without this level of flexibility, only a handful of companies will build speech-powered products.

Showcasing Flow's capabilities

Though Flow remains very new to the world (and us), it already holds up against the market, and has a number of unique strengths.

To showcase Flow, we’ve built a number of AI assistants and created the following demonstrations so you can see for yourself where Flow excels:

In the case that it is not self-evident from the above, there are several ways Flow stands out against the available options today.

Flow was built using our real-time engine, which means that it is powered by something best-in-class. In the world of AI assistants this means that it responds quickly to questions, and handles interruptions and cross talk extremely well. There is no ‘mini-batch’ sleight of hand going on here – Flow really does hear and understand every word being said, as it is being said.

It also has powerful speaker identification and diarization, which means that depending on the situation, can address multiple people be name, ignore speakers until called upon, or ignore background voices even when they are clearly heard. Outside of a lab environment, this is absolutely essential to ensure that people’s experiences don’t break down with the slightest bit of background noise, or in a group setting.

Flow also harnesses the market leading accuracy of our enhanced engine, which means it can understand a huge vocabulary of words, uttered in any accent or dialect. This is fundamental to widespread adoption, where no ‘average voice’ exists.

Flow may not yet be able to pass the Turing test with speech, but our starting point is strong, and we at Speechmatics have more than our share of plans and ideas on how we will get there.

Onwards towards Turing

We want to become the first company in the world to pass the Turing test using speech.

This will require significant breakthroughs and more improvements, but we’re already off to a very strong start. And now we’re giving you the chance to get your hands on it.

This goal is consistent with our desire to understand every voice. For our ASR customers, they should be reassured, since any breakthrough in this space will benefit our ASR (and vice versa) – the goals are not incompatible – they complement each other given the underlying technology is the same.

Be amongst the first to start using Flow to power incredible voice interactions.

Come on this journey with us and join the waitlist.

We can’t wait to see what you build with it.

Go with the Flow

Build responsive, seamless and inclusive speech interactions, using the ultimate voice API.

Jul 30, 2024 | Read time 7 min