When we look at our fastest-growing customers, they span diverse sectors and regions - from multinational contact centers to regional healthcare providers, global media captioning companies and local emergency services.
But analyzing their success reveals one striking commonality: they’ve all been early adopters of real-time voice AI, and have moved on from simply seeing speech transcription as a compliance requirement or an afterthought.
This trend is accelerating to the point where I’m confident it's no longer a question of if companies will upgrade their services with real-time, but when.
Let me unpack this transformation and why it matters.
Batch transcription has many incredible advantages –scalability, simplicity, and cost efficiency. It remains highly valuable for specific use cases and will continue to serve its purpose.
High-profile, well-funded companies are entering the speech technology market, usually with the loud proclamation that they have 'the most accurate' speech recognition in town.
Aside from opaqueness around how exactly they measure this, almost all share one thing in common – their headline accuracy percentage (97% accurate! Cue applause) - always refers to batch (or file) transcription. Sending away a pre-recorded audio file and letting the model send back a transcription, based on the entirety of that audio as context.
We can understand this - the highest of all the accuracy scores for any company are usually their batch English results, and companies always want to put their best foot forwards.
It also makes sense why this is the case - there's lots of training data, and having the whole file means it's more straightforward for machine learning models to take a good guess at what is being said.
But two things to keep in mind here:
These results are like shooting a free throw in basketball, in a quiet space, on your own. It's a very different story during a live game with just a few seconds to go. For us, real-world results matter more than anything.
The overwhelming and growing demand is not for file transcription, but for live, instant, real-time transcription.
Why? Let's dive in.
When we analyze the trajectory of voice AI adoption, everything points to one conclusion: immediacy is no longer a luxury – it's an expectation.
The market shift is undeniable, with real-time applications driving innovation across sectors. In a world where consumers demand instant responses and immediate solutions, batch processing can't keep pace.
What's driving this evolution? In contact centers, managers need to monitor dozens of concurrent calls, this can help them spot escalating situations instantly rather than waiting for post-call analysis.
Emergency services use real-time transcription to surface critical keywords or emotions immediately, helping prioritize urgent responses that could save lives. Meanwhile, in drive-thrus and retail environments, the technology ensures accurate orders despite noisy conditions.
Financial services and healthcare organizations are discovering another crucial advantage: proactive compliance. Rather than catching sensitive information breaches after the fact, real-time transcription flags potential issues as they happen, transforming how businesses operate.
Below is a comparison demonstrating how fast and how accurate a range of speech-to-text providers are for real-time transcription of diverse English speech:
Models evaluated: Amazon default, Assembly AI default, Deepgram Nova-3, Google Latest Long, Microsoft Azure default, Whisper Turbo (Large V3)
From the graph you can see that even at the lowest word latency (bottom of the graph) Speechmatics gives the least errors (left side of the graph).
This means that Speechmatics can transcribe what was spoken in just hundreds of milliseconds and get it right most often.
We try to be transparent with our evaluations by sharing which dataset we used. In this evaluation, the Kincaid46 dataset is used to evaluate final Word Error Rate (WER) against how fast each word is returned. Kincaid is a well-known public benchmark for evaluating transcription performance in diverse real-world scenarios, and is out-of-domain for Speechmatics.
The Kincaid dataset consists of audio files sourced from a variety of sources, including telephone and VoIP calls, meetings, scripted and unscripted broadcasts. Each category incorporates a range of speakers, accents, and topics, making Kincaid46 a robust test for assessing transcription systems’ ability to handle variability in both content and conditions. It is chosen as one of the generally hardest, but also public datasets.
For those who aren’t data scientists - if you’re looking for accurate, lightning-fast transcription, no one touches us. For real-time transcription that doesn’t compromise on accuracy, we set the benchmark.
Human conversations are messy and unpredictable. People talk over each other, switch topics mid-sentence, or get interrupted by background noise. In a normal conversation, our brains naturally filter out these complexities – we focus on the person we’re speaking with, tuning out surrounding distractions.
Traditional Automatic Speech Recognition (ASR) struggles with this. It doesn’t have the same selective hearing capabilities that humans do, often misinterpreting speech when multiple people are talking or when background noise is present. That’s why so many existing voice AI solutions fail in real-world scenarios – they get thrown off by overlapping speech, background chatter, or ambient noise.
Take a drive-thru scenario: a customer places an order while children chatter in the backseat. Unlike conventional ASR systems that might misinterpret overlapping voices, Speechmatics’ advanced AI locks onto the primary speaker and filters out background noise, ensuring only the intended speech is processed.
Similarly, in contact centers, where interruptions are constant, real-time transcription captures every word accurately – even when customers and agents speak at the same time. This precision is critical in compliance-heavy industries, ensuring that legally required affirmations, such as a customer agreeing to terms and conditions, aren’t lost in the noise.
It’s important to note that real-time also moves its applications beyond the existing worlds of telephony, media, contact centres, education and a handful of others already extremely familiar with speech-to-text.
World-class, accurate real-time understanding can transform every single interaction with technology. It may spell the end of online forms - you can simply talk to an agent. It may transform inclusivity, allowing every single person with a voice to build using technology.
The applications aren't just vast - they are uncountable. But only if they work accurately, in real-time, across a broad range of use cases.
To ensure we understand medical terms as well as we do drive through orders, we have an ace up our sleeves...
Another major challenge in transcription is handling industry-specific jargon, product names, and acronyms. Many industries – especially healthcare, finance, and technology – use specialized terms that standard speech recognition models may not recognize.
In the medical field, drug names and procedural terms must be transcribed accurately. In retail, new product names must appear correctly in documentation and marketing materials.
Out of the box, Speechmatics supports a vocabulary of over 2 million words for English alone. But language is constantly evolving with new product names and terms being created. This is where our custom dictionary comes in.
For years, our enterprise customers have told us how valuable our Custom Dictionary is - the ability to instantly, effortless add complex words, acronyms and anything else relevant to their use case to our models, and have that word always be transcribed accurately.
This is a reflection on the uniqueness of every single business’ customer, world and language.
While other providers are only now realizing the importance of this feature, we've been refining it to offer seamless integration, keeping businesses ahead of the curve.
Many businesses assume transitioning to real-time transcription is complicated or disruptive, but that's not the case. Organizations that combine batch and real-time transcription get the best of both worlds – instant insights for immediate action and post-call analytics for long-term strategy.
Companies that currently transcribe files after calls or events can continue this approach while gaining additional value. Since our real-time accuracy matches what's available in our batch transcription engine, you get both post-call analytics and immediate insights, bringing more value to your operation and potentially reducing costs.
The reality is clear: batch processing isn't disappearing overnight, but the future belongs to real-time. Forward-thinking companies are already implementing hybrid approaches, using real-time for immediate needs and batch for deeper analysis.
The path to implementation is gradual and low-risk: many businesses start by layering real-time transcription into just one part of their workflow, such as live agent assistance in a contact center, and then expand the use of transcribed conversations into other areas as they see the value. This approach helps companies stay ahead of evolving customer expectations while minimizing disruption.
The move to real-time transcription isn't just a trend to keep track of. Those that integrate it early gain a competitive advantage, while those who wait risk falling behind.
If you’re thinking about trying out our real-time transcription – what are you waiting for?
Want to learn more? Stay tuned for our next article, where we'll break down the myths about integrating real-time transcription and why it's more accessible than you might think.