Q&A with Red Bee Media’s Tom Wootton: A First-Hand Account of Speech-to-Text in Media Captioning

In this article, Speechmatics’ Ricardo Herreros-Symons interviews Tom Wootton of Red Bee Media. They talk about how the two companies work seamlessly together to deliver closed captions, subtitles and audio description for streaming services and the broadcast television industry. They also discuss Red Bee Media’s latest contract with BT Sport and how – together with Speechmatics – they’ll be providing over 200 hours of live captioning a month, helping BT to become one of the most accessible sporting channels in the world.

RH-S:

Can you tell us about yourself, Red Bee Media, and how they integrate with Speechmatics’ speech recognition technology?

TW:

My name’s Tom. I head up the product area for our Access Media and Management services and Broadcast Services within Red Bee Media. Red Bee Media provides media services, specifically to broadcasters. It could be streaming or content owners, government services, things like that. They all come into the ambit of what Red Bee Media does.

We deal with ‘Supply’, how do you get content from one place to show it successfully in another? And ‘Enrich’, which is how you add metadata to help people discover programs or crucially, how they can access those programs if they have sensory impairments or are blind or partially sighted. That's typically subtitles, closed captions, audio description, and sign language translation. And I should also say they can provide considerable assistance to people with cognitive impairments. Then finally, 'Show' is how you present content successfully to audiences, whether that's broadcast television or streaming.

RH-S:

That’s where Speechmatics technology comes in...

TW:

We knew artificial intelligence and machine learning driven technologies would be changing how we produce subtitles and closed captions. Historically, when I first started doing subtitling, people just typed very quickly. It was a very manual process. Then we moved to speech recognition where you have individual “re-speakers" who say all the punctuation, say all the speaker changes. They are very, very skilled people. This is people literally sitting in a studio booth getting low-latency audio from a player or encoding system, listening to that, listening to what's being said, and repeating what's being said with punctuation, and switching between different colors. It's incredible to listen to.

The importance of it coming out accurately is obviously that we need to make sure that the information is being conveyed correctly. Historically, the technology was called speaker-dependent speech recognition, because you would train that model to listen to that person's voice very accurately.

Now, of course the big step change – which we saw coming in with companies like Speechmatics – was the ability to take automatic speech recognition and make it speaker independent and apply it to the content of the program rather than having “re-speakers.” This brings down costs and enables accessibility to be used across a far wider range of content than ever before.

RH-S:

With the sheer amount of content now, would the old ways be impossible to manage financially?

TW:

Speech recognition enables more accessibility. That’s always how we’ve seen it. We've had to work very hard with the likes of Speechmatics, but also internally, because you cannot just throw technology at that problem. As our Head of Technology said yesterday in a conversation about Plug and Play technologies. He said, “Plug into what?”, “Play what?” Those are crucial questions about how you take technology like automatic speech recognition and turn it into a service that can be used on broadcast television.

RH-S:

And currently, speech recognition is used in two main areas within broadcast? Live and online?

TW:

You're absolutely right. Sports and news will typically be live, but there's a vast amount of content out there that will be pre-recorded drama, documentaries, things like that. Typically, pre-recorded content is expected to be extremely high quality because you have time to get it right. In a pre-recorded environment it needs to be at the highest level of quality. Everyone expects that. So, we've used batch processing for a while to really speed up the process of getting to a final file that can be used.

Where you end up is effectively using that batch automatic process and then Quality Controlling it with a person. That improves efficiency, improves productivity, but you still have the person in there just to make sure it's all coming out correctly.

RH-S:

Tell us about BT Sport and Red Bee and how that involves Speechmatics...

TW:

It really is fantastic. It's a UK-broadcast first. It relies on the customer, a broadcaster, who is really committed to providing more accessibility and will work with you to get there. And it relies on our internal teams, and working with Speechmatics, to get to a place where we can provide the service that they want.

We’re using automatic speech recognition to drive ARC, which is our automatic captioning solution product, to provide over 200 additional hours of accessibility for BT Sport programming a month. Which is absolutely incredible. It means 200 hours that wouldn't otherwise have been accessible, is now accessible. It's a real milestone. But what you don't see with milestones is how much work has gone into just getting there.

RH-S:

What do you see as the next change in speech recognition?

TW:

In terms of ASR specifically, we want to see improvements in speaker differentiation. It’s really important that you can tell when one politician is speaking, as opposed to when another politician is speaking. We've got quite a high bar for speech differentiation. I think it's near, but it needs to be much more reliable in recognizing speaker differentiation.

RH-S:

What do you see as the next frontier in terms of the difficult use cases where speech isn't quite working?

TW:

Two consistent answers; Music and Quizzes. Quizzes are an interesting use case because there's a load of sounds in quizzes to indicate wrong answers, right answers and things like that. But timing is also a challenge, and who’s speaking and when is a challenge. Also, you've got people sometimes giving single answers. And if you get a missed recognition on a single word, it's very difficult for a deaf or hard of hearing person to be able to tell whether that's accurate or not.

With music we do a lot of work for festivals and things like that. We did quite a lot in Eurovision. We did quite a lot of lyrics preparation so we can make sure the lyrics are the lyrics.

Ultimately, you can't have a situation where some things are being conveyed less well than others. In these particularly hard use cases, there's a lot of room still for skilled subtitlers to be working side-by-side with automation.

RH-S:

Do you ever think we'll get to the point where we change how we train presenters to cope best with speech recognition itself?

TW:

Here are two examples which take either side of the coin. There was a Los Angeles news station a few years ago where they were looking to trial speech recognition. They had a Latino weather presenter. It wasn't coming up very well for him and they asked him to speak differently. And that seems to me a very bad example of what we're talking about. You're asking people to adapt quite important aspects of who they are, and the community they may belong to, in order to make automation work. That seems to be suboptimal, to put it mildly.

On the other hand, I think we all have experiences where we try and ensure we're speaking clearly to make sure we're understood by all sorts of different audiences. There may well be appropriate times where there's at least a level of awareness. Equally, you don't want to take away the flavor of things. You wouldn't want sports commentary to be delivered in a way that was flattened and devoid of emotion. We need to find a balance in those areas.

RH-S:

Thanks very much for speaking with us.

TW:

It’s been a pleasure. I’m genuinely proud of what Red Bee and Speechmatics are achieving. It’s a paradigm shift in how accessibility is being provided.

Book a meeting today with a Media and Event Captioning specialist and we’ll give you the tools you need to differentiate your Media and Event Captioning in the market and help you deliver on constantly evolving customer expectations.

Jan 19, 2023 | Read time 8 min

Q&A with Red Bee Media’s Tom Wootton: A First-Hand Account of Speech-to-Text in Media Captioning

RH-S:

TW:

RH-S:

TW:

RH-S:

TW:

RH-S:

TW:

RH-S:

TW:

RH-S:

TW:

RH-S:

TW:

RH-S:

TW:

RH-S:

TW:

This article is an edited and abridged version of a LinkedIn Live conversation. You can watch the full video here.

Related Articles

How to Differentiate Your Media and Event Captioning Service

How Automatic Captioning Can Transform the Way You Consume Media

YouTube’s Captions Represent the Direct Need for Speech-to-Text Innovation