If you want to build the world’s most powerful and accurate speech-to-text technology, there’s one thing you really need – and that’s the right team in place. So what does that team look like, what sort of projects are they working on, and what traits do they need to succeed?
To find out, we sat down with Will Williams, our VP of Machine Learning. From the ins and outs of how machine learning is driving innovation at Speechmatics, to the key trait of a machine learning expert, here’s what he told us.
Will, how long have you been at Speechmatics and what does your work entail?
I’ve been here for over eight years now – in fact, as an intern, I was one of the company’s first employees! During that time I’ve worked on pretty much every part of our tech stack, so I’ve had great exposure to all aspects of our product.
Right now, I’m heading up our machine learning efforts – whether that’s building our capabilities to ensure we can support more speech recognition languages, working on models to improve our accuracy, or helping us become more deployable across a whole range of different devices.
How did you get into machine learning as a field, and speech-to-text more specifically?
I encountered this burgeoning field when I was quite young and just found it super exciting. I started teaching myself, following courses online. I actually moved to Cambridge to do a Master's degree in machine learning.
The name Tony Robinson (our founder) cropped up in this lecture series I was watching by Geoffrey Hinton, one of the godfathers of machine learning. He was talking about Tony’s heroic efforts in the eighties to make speech recognition work, which really piqued my interest.
So you can imagine my surprise when I put out a message on Google+ asking if anyone had any summer internships going, and Tony Robinson himself replied. I jumped at the chance to work with him, started learning on the job, and ended up forgetting about the Master’s altogether. I’ve been here ever since – and it’s been a brilliant place to learn and develop.
Speechmatics is heavily driven by its research and development arm. Could you talk about some of your team’s past focus areas? What sort of projects have you worked on?
The ultimate goal with speech recognition is to make it as accurate as possible. In other words: how do you get it to a point where you can rely on it in any scenario (and then add in those valuable extras, like metadata and sentiment analysis)? It’s an elusive problem, but one we enjoy trying to tackle.
A lot of my initial research was around the ‘language model’, which essentially tells you the probability of a given sentence. That’s really useful because it allows you to differentiate between close calls or constructions that sound similar. So if I say 'recognize speech', am I talking about 'recognizing speech', or am I talking about 'wrecking a nice beach'? That sort of thing.
What about your current and future goals? How are you and the team using machine learning to drive advancements in Speechmatics’ tech?
Another big focus recently has been the acoustic model, which takes a snippet of audio, and tells you: in that snippet, what was the probability I said ‘buh’ or ‘phuh’ or ‘guh’? It goes across all the phonemes, and you have a probability distribution for each of these slices, which is obviously very useful information for a speech recognition system to have.
We’ve been focusing on that, the last couple of years. Building out giant neural networks that actually learn on unlabeled data, to build good representations to make that acoustic model work well. It’s called representation learning, and the idea is: let's take a slice of audio and produce some representations inside a neural network that make all kinds of downstream tasks easy.
Transcription is the main application, but it would also know which language the speaker is talking in, whether they were angry, and so on. We find we can train on neural networks really well. Those representations get increasingly rich, and we effectively have to do nothing in terms of downstream training to train our actual speech recognition system on top of this.
Is there such a thing as a ‘standard day’ in your role, when projects and research are always evolving?
Maybe not a standard day, but there are definitely some commonalities. I try to read maybe two papers every day, and I also like to attend stand-ups with different teams to get a sense of what’s going on. Often, I spend some time writing architecture docs for an upcoming research project, or even working with the marketing and sales teams and taking interviews.
At this precise moment in time, I’m spending a lot of time hiring. We need great people to keep leading in our field, so finding those people is a big priority.
What sort of person do you look for during the hiring process – are there any particular traits that you think make someone well suited for this line of work?
In my experience, a lot of people at Speechmatics are exceptional superstars: they’re smart, they’re focused, and they have a really good sense for where their research should go next. They can go heads-down and work alone when needed, but they’re also collaborative, with a clear ability to pull together to solve really complex problems.
One of the biggest things I look for in a hire is a desire to learn; someone who wants to take control of their learning. I’m also always looking out for what I can only describe as grit – the tenacity to chase down a problem until it’s fixed.
That’s really important, in this work, because the thing that makes machine learning so difficult is that when problems occur, it’s not just like fixing a bug in code. You could have a mathematical problem in your system that’s incredibly difficult to track down, so you have to be tenacious.
Final question – what are your hopes for the future of machine learning?
I think a lot about the nature of machine intelligence. And what it really boils down to – without getting into the world of business aphorisms – is being able to do more with less.
Think of it like this: I could prime somebody walking into an advanced physics lecture with all the questions the lecturer might ask. So they might watch the lecture and look like they have all the answers, and it would probably seem very impressive. But that’s not real intelligence.
Now, if you had a 12-year-old in the room who’d never studied physics before, but who could take the minimal input of that lecture and somehow grasp all of the concepts the lecturer was covering... that would be true intelligence. And that’s what we’re striving to achieve at Speechmatics. A situation where our models can encounter something they’ve never seen before, and still do something really smart with it.
It’s a pretty lofty ambition – but the team at Speechmatics has never been stronger, so it’s a challenge I look forward to tackling every single day.