Blog - Technical
Dec 5, 2024 | Read time 6 min

Building inclusive speech tech that empowers every voice

How do we create technology that truly understands every voice? Dive into our journey of creating speech technology that’s as dynamic and diverse as the voices it serves.
Benedetta CevoliSenior Machine Learning Engineer

In our recent blog, When AI Bias Shapes the Future Workplace, our Director of Engineering - Teri Drummond - explored how bias in AI systems can create barriers instead of breaking them.

Today, I want to dive deeper into what building inclusive technology means to us at Speechmatics and share how we’re working to Understand Every Voice – an approach that is both technical and deeply human.

At Speechmatics we’re guided by one mission: Understand Every Voice. Before looking into how we approach the problem, let’s look at what our voices look like and why understanding speech is hard.

Building inclusivity from within

Creating inclusive technology isn’t just about the models we build—it’s also about the culture of the people building the technology. At Speechmatics, we’re intentional about fostering an environment that values diverse perspectives and empowers our team.

One example of this is our mentorship program, which has been particularly impactful for women in engineering, a group underrepresented in tech. Our mentorship program is simple but powerful. Once a month, women in engineering come together for a peer mentoring session where we openly discuss challenges, learnings, and provide feedback and support.

The safe, open space we create through mentorship allows individuals to raise ideas and concerns, which has led to real improvements in our products and work environment. 

Empowering through knowledge sharing: Journal Clubs

We’ve also extended our commitment to inclusivity through initiatives like company-wide journal clubs, where we come together to discuss scientific papers and industry findings on topics like inclusion and bias in tech. Colleagues and I take turns leading presentations, including topics covered in Teri’s When AI Bias Shapes the Future Workplace.

These sessions have proven to be an incredibly effective way to foster meaningful conversations about inclusivity, with my peers and leadership deeply invested in learning and reflecting on these important issues. The feedback from these sessions has been overwhelmingly positive, motivating us to continue these discussions and apply what we’ve learned across the company.

These journal clubs have inspired everyone, from new hires to senior leaders, to think critically about inclusivity and how it should be embedded in our technology and processes. As a result, we’ve even established a dedicated team to test our systems for persona-assigned LLM bias, driving actionable change and reinforcing our commitment to building equitable technologies.

Embedding inclusivity in hiring practices

Inclusivity is woven into every aspect of our company, including our hiring process. One of the most impactful ways we demonstrate our commitment to inclusivity is through our Culture Panel interviews. These panels serve as a two-way conversation, allowing us to ensure candidates align with our cultural values, while also giving them an opportunity to see how we live those values in practice.

We use the Culture Panel as a chance to showcase how much we care about fostering an inclusive, empowering environment. It’s also a great opportunity for candidates to see first-hand how our leadership and team members work together to uphold our company values.

Furthermore, we ensure that diverse voices are represented on interview panels. When we interview women, we commit to having at least one woman on every panel throughout the interview process. This simple but powerful step allows underrepresented candidates to see themselves represented, making them feel more comfortable and supported in their potential new roles.

These practices don’t just help us find candidates who fit our mission; they also provide candidates with confidence that inclusivity is not just a buzzword at Speechmatics. It’s a core part of who we are, and it permeates everything we do. However, these initiatives can only succeed when they are championed by strong, diverse leadership. At Speechmatics, we are proud to have women in senior leadership roles, including Katy, our CEO, who is a passionate advocate for diversity, equity, and inclusion (DEI). Katy continuously drives our commitment to inclusivity and ensures that it runs through every level of our organization by supporting various initiatives that bring diverse perspective together.

Katy is also deeply involved in cross-functional (XFN) strategy meetings, where diverse voices and perspectives from across our teams are brought together to drive innovation. These sessions are integral to our approach, as they foster collaboration across disciplines, encourage unique perspectives, and ensure that inclusivity is woven into the fabric of our strategic decision-making.

By championing initiatives like these, Katy exemplifies how leadership can create space for meaningful contributions from every corner of the organization.

Building inclusive tech for every unique voice

Have you ever picked up a call from a familiar number—maybe your mum’s—and instantly realized it was your sister on the other end, just from the way she said “Hello”? That’s because voices are as unique as fingerprints.

Each of us has a voice that acts as a personal identifier, revealing who we are the moment we speak. This holds true across languages. In Italy – where I grew up – it’s common to simply say “It’s me” when asked, “Who is it?” on the intercom, trusting the voice alone to convey the speaker’s identity. 

While we can group speakers based on characteristics like language, accent, and other features, no two voices are exactly alike. This is because our voices carry the imprint of our life experiences. The places we’ve lived, the people we’ve spoken with, and the languages or dialects we’ve been exposed to all shape the way we sound.

Even subtle details—like a particular word choice, an accent picked up from childhood friends, or the rhythm of speech influenced by a second language—create a vocal fingerprint that’s entirely our own. In fact, our voices are never static. Just as no two voices are alike, no voice ever stays the same. Depending on the context, what I want to say, and whom I’m speaking to, my voice changes completely.

For example, I don’t speak to my dog the same way I would during a presentation or when I’m out with friends at a pub. Our voices are also highly dynamic, continuously changing and developing over our lifetimes as we grow older and expand our experiences with people and the world around us.

The challenge of understanding speech

To truly understand every voice, we need to capture the vast variability both across different voices and within an individual’s voice over time. This is hard enough on its own, but it becomes even more challenging when the datasets we work with sound like clean, controlled, and far from the messiness of real-world speech.

In reality, the speech we aim to understand sounds more messy, noisy, and full of variability. 

Training audio

Real world audio

When this is unsuccessful, i.e. speech models don’t capture the actual variability in speech, this is where bias in speech tech occurs.

If a model is trained on US accents only, it will struggle to understand anything else. As a non-native speaker with a combination of accents derived from my unique live experiences - from my early studies in Ireland, hours of Hollywood movies/TV series, 7+ years in the UK, and many interactions with English speakers from all over the world - I have experienced first-hand the challenges that speech technology faces with the rich, messy, and incredibly dynamic nature of real-world speech. 

This is why speech is a hard problem to solve, and I feel deeply connected to Speechmatics’ mission of understanding every voice, regardless of accent or background, in any situation.

For speech tech to be actually useful and reliable in the real world it has to work well in any real life situation, from busy, noisy environments to people talking over each other. 

Inspired by how humans learn to speak

Our approach to understanding speech draws inspiration from how children learn to speak. When children are first immersed in the world, they’re exposed to a rich spoken environment. They begin to recognize patterns and sounds before moving on to formal learning, like reading and writing.

Similarly, we train large models on millions of hours of speech data, exposing them to a wide variety of voices, accents, and languages. This enables our models to learn the universal characteristics of speech. Once this foundation is in place, we train another model to map sounds into text, using audio-text pairings.

Pretraining on audio alone allows the model to learn and represent sounds really well, enabling us to learn the sound to text mappings in a very sample-efficient way.This process gives us the ability to understand voices in noisy, real-world environments, including those that are often underrepresented in datasets.

Today, our inclusive technology can understand over 50 languages, which represents more than half of the world’s population, , which represents more than half of the world’s population. Every time I think about that, I find it mind blowing!

Try listening to this audio and count the number of words you can understand.

Now look at what our engine can do!

"Well, I did once, so I guess I kind of like a bra.

This is called a chocolate cardigan over Loyola, and it's required to become a scuba diver.

And you get the idea that you try to leave Guatemala with them for two weeks.

It was amazing seeing like an Aztec or Mayan little village making their own tortillas with corn on with one thing, chickens just walking around like it's nothing.

It's so cool.

And then they went to work for for a couple of days.

And they have all indigenous cats, jaguars and ocelots, margays, stuff like that. Tigers, lions.

It was it was a sanctuary.

So it was like set up shop there and just chill.

It was nice.

It was a nice place.

They had waterfalls and all these beautiful things.

Um, After that, I don't know.

We went scuba diving in the Blue Hole in Belize, which is like second best to the Great Barrier Reef, apparently.

And it was amazing."

Tackling AI bias is a continuous journey

Our mission to understand every voice is about more than technology — it’s about the people behind it. Building inclusive products requires us to think about the team that makes them possible.

Diverse perspectives lead to better ideas and better outcomes, and this is something we actively champion at Speechmatics and are forever committed to continue to improve upon.

The journey to inclusivity is ongoing, and there’s always more to learn. But by focusing on the voices that make us unique and the culture that empowers our work, we’re redefining what it means to create truly inclusive technology.

I invite you to share your experiences and approaches to building inclusive AI — let’s keep the conversation going and learn from each other as we strive toward a more inclusive future.