Applause can serve a similar function. They indicate the meaning of intended interpretation of the words spoken – without them, this meaning is lost or obscured.
What does this mean for our technology and mission?
True understanding requires more than speech-to-text
Our goal is one of understanding.
This is deliberately stronger than merely transcribing. Writing down the words spoken is, of course, a vital and valuable process, but it is not the end point.
We want to capture and harness as much meaning as possible from the spoken word and the media that contains speech. Given the above examples, we clearly cannot stop with simply writing down the words uttered with exceptional accuracy. We need to be able to gather all the richness of information contained within audio, and aim for true understanding.
Why?
Well, above this being an interesting challenge and a noble goal in itself, there are three reasons:
1) Technology that understands us
Technology, and in particular AI, needs to be able to understand what we want. With LLMs and the rise of more conversational technology, it's vital that we are able to communicate in our ‘highest bandwidth’ mode, and that is speech. This is our default way of communicating ideas, wants, desires, needs. In order for our technology to be as helpful as possible, it too needs to be able to understand and interpret what we’re asking of it.
2) Technology to help us understand each other
Technology that understands everyone can also be used to help us understand each other. Global language barriers aren’t going anywhere, and if technology can understand what one person is saying, then it can help convey this meaning to others. Different languages may always exist, but they need not be a barrier in the future.
3) Speech as a source of value
A huge amount of collective knowledge is never written down. Since we talk by default to communicate, much insight and understanding may never make the page. Even if it does, a lot of that context and meaning is lost by only writing down the words uttered. With deep understanding and record keeping of conversations, a brand new, unlimited and growing pool of insight and analytics can be unlocked.
All three reasons are worthy aims.
But where does one start with this?
Well, we already launched Sentiment that begins to understand positive, neutral and negative statements in speech, and now we are launching Audio Events.
Audio Events
By extending our ASR capabilities to include non-speech sound detection, we are opening new possibilities for media analytics and accessibility.
This feature can detect a variety of sounds that traditional speech recognition technologies might miss, such as music, laughter, and applause, enhancing both the accuracy and richness of media content.
Utilizing advanced AI, the Audio Events feature analyzes audio and video content to identify and label specific non-speech sounds. This capability is integrated seamlessly into our existing ASR API and is available for early access testing.
[Music] to our ears – the many benefits of audio events
The reliable identification of these events offer a number of benefits across industries:
Enhanced Media Accessibility
By providing more detailed captions that include non-speech sounds, content becomes more accessible to the deaf and hard of hearing community, fulfilling both legal and ethical standards for inclusivity.
Improved Content Analysis
Media companies can gain deeper insights into their content, understanding not just what is said, but also the context and emotional reactions conveyed through other sounds.
Efficiency in Operations
For industries such as EdTech and CCaaS, this feature helps streamline operations by automating the detection and captioning of audio cues, reducing manual labor and associated costs.
While primarily designed for the media sector, the Audio Events feature is valuable for educational technology as it enhances e-learning platforms with richer and more interactive audio descriptions.
Audio Events is hugely valuable today. It can provide additional captions to the 99% of content that does not have any audio descriptions, but also represents a bolder long term aim.
A direction of travel and an opportunity
The launch of the Audio Events feature is just the beginning.
We are committed to continuous innovation and are already exploring further enhancements, including the ability to identify more complex sound environments and integrate additional audio classifications.
For example, in the video below we’ve created a proof of concept, leveraging the power of LLMs to be able to provide more descriptive audio captions to any media.
All captioning in this video is AI generated, and whilst not perfect, you can see how descriptive and useful the captions are: