Interestingly, a lot of the things that people told us they care about in an ASR provider (languages, accents, etc.) are a data set problem!
With this assumption aside, we moved on to discussing the metrics people use to differentiate between providers, and how we could produce a project that is capable of fairly assessing these. The idea being that once presented with the pure facts, the user should be able to easily choose the provider that suits them, and the data they care about, best.
From these conversations, we learned that while word error rate is used almost exclusively as the standard of ASR performance, users are starting to care about much more than just the accuracy of the words. The overall readability of the transcript, the quality of punctuation and capitalisation, and the speaker diarisation were just some of the issues that were raised.
These factors cannot be fairly assessed by counting insertions, deletions and substitutions and clearly a tool that could present all of these data points is much needed in the community.
Our motivation behind working on this particular problem stemmed from discussions between myself, James Dobson and Will Williams around the frustration that multiple groups have to essentially replicate the same work to compare multiple providers. Datasets aside, the evaluation code should be more or less exactly the same and it’s a complete waste of resource for companies to reproduce the same code internally.
Creating a tool that evaluates in an agreed upon and standardised way saves time and facilitates reproducible and fair comparisons.
We intend to continue to contribute to this project, and to the discussion in general off the back of this event and may even have some exciting news to come in the future, so stay tuned! In the meantime, you can try our real-time demo here!
Tom Wordsworth, Speechmatics