Most people would agree that the second phrase is a much more pleasant reading experience.
Punctuation is a huge assist with readability, and in certain cases can help to avoid ambiguity. A recent example in the news was the furore around the omission of the so-called "Oxford comma" from the inscription "Peace, prosperity and friendship with all nations" on a recently minted fifty pence coin (with author Philip Pullman calling for it to be "boycotted by all literate people").
How punctuation works in automatic speech recognition
Traditionally, in order to have punctuation marks appear in transcribed text, it was necessary to pronounce each character by name, such as “full stop”, “comma”, “question mark”, etc. However, advances in machine learning has enabled the development of automatic punctuation placement. In this section, we’ll address how Speechmatics uses punctuation for use in automatic speech recognition, according to Machine Learning Engineer, Tom Ash.
When building our language models with advanced punctuation, we don't internalize the rules of The Oxford Style Manual, The Chicago Manual of Style, Fowler's Dictionary of Modern English Usage or any other venerable guide to good language usage. Instead, Speechmatics uses machine learning techniques to filter training data so we can get a good picture of what appropriate punctuation looks like for sentences in the target language, whether that be English, French, Japanese or Turkish, for example.
The steps we took
Our first step in creating advanced punctuation was to create relevant data to learn from. This actually meant undoing a lot of our standard pipelines for ASR, where we normally try and remove it in preparation for language modeling. We spent time refactoring our code to leave the ‘good’ punctuation in, and also filter out lines that had either too many or too few punctuation marks.
At first, we wanted to cover all the punctuation marks we could think of. However, we soon realized that any punctuation marks that come in pairs (quotation marks, parentheses) were going to be almost impossible to integrate with a streaming ASR system. In a streaming system, words come out in chunks smaller than a sentence. We would, therefore, have to insert opening quotes, for example, before we even realized we were in a quotation! We then looked at colons and semi-colons and realized that because they are used so rarely it was going to be hard to get enough training data to train on. In the end, we boiled it down to full stops, commas, question marks and exclamation marks. These were also the punctuation marks that our customers were most interested in and most useful to their use cases.
Key decisions made
One of the biggest trade-offs we had to make was in latency versus accuracy. The higher the latency, the more context you will be able to take into account when deciding on a punctuation mark to output. However, latency is a crucial issue for some of our use cases. We, therefore, had to balance it with accuracy to get the most appropriate system. We worked closely with the team that works on our streaming system to get the right operating point. This required us to take into account the intricacies of how we endpoint our chunks of text when streaming transcripts on live audio.
Another key decision that we considered when designing our punctuation system was regarding how much audio versus textual context was required. The academic literature gives somewhat mixed views on this. It seems that different researchers use different approaches in their systems. Taking both into account brings extra engineering challenges into play that are not present when only using one information source. However, this does give the fullest picture to train a model on.
What punctuation characters do people want in their transcription?
Research with our customers found that most people prefer transcriptions with full punctuation. It introduces pauses in the correct places and makes the transcript easier to follow. For captioning use cases, punctuation significantly improves the readability of those captions and enables the audience to better understand the context of the audio.