Blog - Technical
Oct 17, 2023 | Read time 5 min

The Future of Media: ASR and Speech Intelligence

Capturing the spoken word - how ASR and AI are transforming media content.
Will WilliamsChief Technology Officer

Automatic speech recognition (ASR) can transcribe and caption video content in real-time. This saves human transcribers a huge amount of labor and shows regulators a commitment to accessible content.

But can it be more? Definitely.

Combine ASR with the latest innovations in artificial intelligence (AI) and it adds value to media companies that far exceed acceptably accurate captioning.

The changing face of video

Video is fast becoming the preferred format for media consumption. In 2022, demand for video content skyrocketed, accounting for nearly 66% of total internet volume in the first six months of the year – a 24% increase on the same period in 2021.

This growing preference for video content has sparked nuances in the way audiences consume it. Younger cohorts are showing a preference towards the short-form videos native to TikTok, this helped the format to steal internet traffic volume from more traditional social networking sites last year. Similarly, second screening is on the rise, with viewers consuming media on mobile devices at the same time as watching video on a larger screen.

The result is a devaluing of audio. If you're consuming video on a mobile phone outside of the home, sound might not always be appropriate. And if you're watching two things at once, you can only choose to listen to one.

Audiences that are increasingly used to seeing video content captioned are also more comfortable relying on those captions. Four out of five viewers aged 18-25 use subtitles all or part of the time when watching content, compared to just one in four viewers aged 56-75.

In short, captioning has moved on from an exercise in accessibility to an essential component of video content. Only two-thirds of uncaptioned content is watched until the end, while 91% of videos with captions are viewed in their entirety.

AI media captioning

The behaviors around video content consumption might be changing at pace, but the media industry has been slower to cotton on to the commercial implications of that.

Media companies and independent software vendors have seen AI develop to the point at which ASR and automated captioning have become accessible to even the smallest budgets. This, coupled with the belief that media captioning is little more than a compliance requirement, means that for many product teams, ASR has become a begrudged line on their budget. An item sought at the cheapest possible price per hour, or cost per minute, while still providing acceptable accuracy.

In a market where audiences increasingly expect and rely on captions, it's a short-sighted approach. Videos with captions have an increased reach of 16% compared to those without, and that's before we've even addressed the additional applications and functionality that ASR can support when combined with capabilities powered by large language models (LLMs).

Speech Intelligence for media

The development of LLMs simplified automated captioning, enabling it to achieve better accuracy than human transcribers – saving time and reducing errors in the process. However, the introduction of LLMs and advancements in AI mean transcription is only a small part of a much bigger picture.

This bigger picture – the combination of ASR with AI capabilities – is Speech Intelligence. More than just transcription, Speech Intelligence is the key to unlocking value from the spoken word through a collection of features and capabilities powered by AI. Built on ASR and integrated into media distribution and captioning platforms, it can fuel the growth of both software providers and their end users.

Converting verbal content into text opens up a whole world of audiences to media companies. Translation capabilities mean captions can be provided in multiple languages, in real-time, making content accessible to the broadest possible audience with minimal additional work. Speechmatics Ursa model, for example, can create live captions in both the original spoken language and 69 supported language pairs.

These capabilities aren't limited to live audiences, either. Media companies can unlock additional value from their back catalogs with Speech Intelligence. Foreign language captions can be automatically applied to existing content, making it internationally accessible for the first time.

Features like summarization work content even harder. This allows media companies to automatically create episode summaries, produce show notes, and highlight key insights – in multiple languages – with the click of a button. Similarly, topic detection enables a huge volume of existing content to be quickly tagged in multiple languages, ensuring back catalogs are easily navigable to staff and audiences around the world. Not only does this save end-users time, but it expands their total addressable market without the need to increase their team and enhances audience engagement.

Sentiment analysis of recorded speech can further increase reach, engagement and accessibility of content. For example, while captioning and transcription make content accessible to the deaf and hard of hearing, sight-impaired audiences can gain extra insight from audio tags giving information on the emotion and sentiment of the speaker.

With the right speech partner, media captioning platforms can leverage Speech Intelligence to deliver a host of valuable functionality for their customers. From the obvious efficiency gains of reducing manual transcription to more differentiated features that utilize translation, sentiment analysis and summarization.

Foundational accuracy

Speech Intelligence has the potential to increase engagement, platform utility and content reach – but the outputs are only as good as the accuracy of the underlying speech-to-text model these applications are built on. To successfully make use of the spoken word, it needs to be captured accurately and understood fully by ASR models that can reliably record a range of different dialects, accents and demographics. Without highly accurate speech-to-text capabilities, downstream applications will have limited usability or may not work at all for speakers from certain backgrounds.

Stand out with Speech Intelligence

Moving away from a focus on just captioning and transcription gives product leaders scope to create media captioning and distribution platforms that delight their partners and add real value. The spoken word is our primary means of communication, and the applications that can be built on it are infinite.

As video consumption continues to change, Speech Intelligence will be the means by which product teams deliver a platform that is noticeably differentiated. It will ensure your product leads in the market – however, that market may change.