Footnotes | * Our quoted percentages are a relative reduction in word error rate (WER) across 21 open-source test sets when comparing two systems. To illustrate, a gain of 10% would mean that on average 1 in 10 errors are removed. WER is calculated by dividing the number of errors by the number of words in the reference, so a lower number indicates a better system. Ursa’s enhanced model is used for the comparisons unless stated otherwise. † Replicate our experiment that shows Speechmatics surpasses human level transcription on the Kincaid46 dataset using this python notebook. It uses our latest API so requires an API key which can be generated on our portal. ‡ We would like to extend a special thanks to FluidStack who provided the infrastructure and a month of GPU training time to make this possible. ** We are aware of the limitations of WER, one major issue being that errors involving misinformation are given the same weight as simple spelling mistakes. To address this, we normalize our transcriptions to reduce penalties for differences in contractions or spelling between British and American English that humans would still consider correct. Going forward, we intend to adopt a metric based on NER. †† Tests conducted in January 2023 against Amazon Transcribe, Microsoft Azure Video Indexer, Google Cloud Speech-to-Text (latest_long model), and OpenAI’s Whisper (large-v2 model) compared to Ursa's enhanced model in the Speechmatics Batch SaaS. ‡‡ Our quoted numbers for Whisper large-v2 differ from the paper[7] for a few reasons. Firstly, we found that the Whisper models tend to hallucinate, causing increases in WER due to many insertion errors as well as having non-deterministic behavior. Secondly, though we endeavored to minimize this, our preparation of some of these test sets may differ, but the numbers in the tables will show consistent comparisons. |
References | [1] Kincaid, Jason. "Which Automatic Transcription Service Is the Most Accurate? - 2018." Medium, 5 Sept. 2018. Accessed 24 Feb. 2023. [2] Hoffmann, Jordan, et al. "Training compute-optimal large language models." arXiv preprint arXiv:2203.15556 (2022). [3] Panayotov, Vassil, et al. "Librispeech: an asr corpus based on public domain audio books." 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015. [4] Del Rio, Miguel, et al. "Earnings-22: A Practical Benchmark for Accents in the Wild." arXiv preprint arXiv:2203.15591 (2022). [5] Kendall, Tyler, and Charlie Farrington. "The corpus of regional african american language." Version 6 (2018): 1. [6] Ardila, Rosana, et al. "Common voice: A massively-multilingual speech corpus." arXiv preprint arXiv:1912.06670 (2019). [7] Radford, Alec, et al. "Robust speech recognition via large-scale weak supervision." arXiv preprint arXiv:2212.04356 (2022). [8] Papineni, Kishore, et al. "Bleu: a method for automatic evaluation of machine translation." Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 2002. [9] Wang, Changhan, Anne Wu, and Juan Pino. "Covost 2 and massively multilingual speech-to-text translation." arXiv preprint arXiv:2007.10310 (2020). |
Author | John Hughes |
Acknowledgements | Aaron Ng, Adam Walford, Ajith Selvan, Alex Raymond, Alex Wicks, Ana Olssen, Anand Mishra, Anartz Nuin, André Mansikkaniemi, Andrew Innes, Baskaran Mani, Ben Gorman, Ben Walker, Benedetta Cevoli, Bethan Thomas, Brad Phipps, Callum Hackett, Caroline Dockes, Chris Waple, Claire Schaefer, Daniel Nurkowski, David Agmen-Smith, David Gray, David Howlett, David MacLeod, David Mrva, Dominik Jochec, Dumitru Gutu, Ed Speyer, Edward Rees, Edward Weston, Ellena Reid, Gareth Rickards, George Lodge, Georgios Hadjiharalambous, Greg Richards, Hannes Unterholzner, Harish Kumar, James Gilmore, James Olinya, Jamie Dougherty, Jan Pesan, Janani T E, Jindrich Dolezal, John Hughes, Kin Hin Wong, Lawrence Atkins, Lenard Szolnoki, Liam Steadman, Manideep Karimireddy, Markus Hennerbichler, Matt Nemitz, Mayank Kalbande, Michal Polkowski, Neil Stratford, Nelson Kondia, Owais Aamir Thungalwadi, Owen O'Loan, Parthiban Selvaraj, Peter Uhrin, Philip Brown, Pracheta Phadnis, Pradeep Kumar, Rajasekaran Radhakrishnan, Rakesh Venkataraman, Remi Francis, Ross Thompson, Sakthy Vengatesh, Sathishkumar Durai, Seth Asare, Shuojie Fu, Simon Lawrence, Sreeram P, Stefan Fisher, Steve Kingsley, Stuart Wood, Tej Birring, Theo Clark, Tom Young, Tomasz Swider, Tudor Evans, Venkatesh Chandran, Vignesh Umapathy, Vyanktesh Tadkod, Waldemar Maleska, Will Williams, Wojciech Kruzel, Yahia Abaza. Special thanks to Will Williams, Harish Kumar, Georgina Robertson, Liam Steadman, Benedetta Cevoli, Emma Davidson, Edward Rees and Lawrence Atkins for reviewing drafts. |
Citation | For attribution in academic contexts, please cite this work as
BibTeX citation
|