Accuracy Team Lead, John Hughes, explains how and why Word Error Rate as an accuracy measure is outdated and often misaligned with human judgment.
John HughesAccuracy Team Lead
Word Error Rate
Word Error Rate (WER) has been the primary accuracy metric used to evaluate Automatic Speech Recognition (ASR) systems for decades. Engineers use it to make research decisions and customers use it to evaluate vendors. Many multi-million-pound deals can often hinge directly on small differences between vendors. But WER has many flaws. The most notable is the level of importance given to certain mistakes.
Is it important to penalize an extra um, a missing hyphen, or an alternative spelling? Or how about a difference in written numerical formatting (1,000 vs 1000 or eleven vs 11)? These examples increase WER but do not affect meaning in the slightest.
Then again, how about omitting the word ‘suspected’ before murderer? Or predicting a wrong word that changes the sentiment? An ideal metric would know these examples are serious errors and score a system appropriately.
We have another blog on this topic - looking at the novel work Speechmatics is doing to find a more meaningful metric aligned with human perception of quality by harnessing the power of large language models.
How to calculate Word Error Rate (WER)
To calculate WER, first, you must align the reference and recognized transcript. This is done by minimizing the Levenshstein distance (or edit distance) between the two texts. After this, you can count the number of substitutions (S), insertions (I), and deletion (D) errors. The WER is simply the proportion of the total number of errors over the number of words in the reference. If all the words are correct, then the WER is zero and that’s what developers are aiming to drive towards with every research decision they take.
To illustrate how you align a reference and recognized transcript, here’s an example:
Here, the ASR system omits information about seeing the dog. There are two clear ways to align this, but they both lead to different WERs. Firstly, in alignment 1 you align “and I saw a cat” with the end of the reference. This leads to 5 deletions “and I saw a dog” which is a WER of 50%. Alignment 2 leads to an extra error due to the substitution of “dog” with “cat”. This leads to a WER of 60%. By minimizing the edit distance, you’ll ensure you get the alignment with the minimal WER (in this case Alignment 1).
Downfalls of WER
As mentioned above, the main pitfall is incorrectly giving importance to some errors over others.
Minor problems such as misspellings of names or the wrong number of repeated words – say when someone speaks with a stutter – are penalized just as heavily as errors that cause misinformation.
In addition, it is very difficult to get a test set that has no mistakes from human transcribers. This introduces an intrinsic floor which means it is impossible for the system to reach 0% WER. On top of mistakes, they often write the transcript using a different specification. After all, there are many ways to write the same thing. Here are two examples:
Numeric entities – can be in written or spoken form and there are differences in formatting such as commas in large numbers. Also, there are different ways of saying currency such as 10 pounds, 10 quid and a tenner. All are the same and will be £10 in written form.
Dates – there are countless ways of writing which include different ordering or shortened forms.
If the reference is misaligned with the ASR output this can lead to increasing WER significantly. It can be trivial to normalize some of these issues, but the problem is harder when comparing across vendors because they all have different ways of formatting.
Furthermore, in order to get an accurate alignment, you must strip all punctuation and convert characters to lower case. Punctuation and capitals at the start of sentences or for proper nouns are essential for readability, but WER doesn’t take any of that extra information into account.
Let’s see a particularly bad example of how WER can be completely misaligned with reality.
To the human eye, there doesn’t appear to be much wrong with this output. Conversely, this has a high WER of 125%! There are two substitution errors (‘I’ for ‘I’m’ and ‘5’ for ‘five-year-old’) and three insertions (‘am’, ‘year’ and `old’). The WER is greater than 100% because there are more errors than words in the reference.
This example only has one substitution error, so the WER is 20%. As you can see, this error is much more severe since it causes misinformation about the status of the message, but the WER is lower than the first example.
Wouldn’t it be great if we had a way of evaluating ASR that wasn’t impacted by different ways of saying the same thing and one that evaluated problems in line with human judgment?
Interested to discover where we’re going next in solving this problem? Learn how we’re using a novel new metric that aligns with human judgment by harnessing the power of large language models.