CSI User Experience Conference 2012 Part 3 – Live subtitles & voice recognition technology 

It’s clear that much of the frustration from many UK TV viewers surrounds live subtitles and so the technology of voice recognition software and the process of respeaking used to achieve this was one of the topics of debate in a panel on the User Experience following Ofcom’s presentation.

Deluxe Media’s Claude Le Guyader made some interesting points:

In the case of live subtitling…it’s a lot of pressure on the person doing the work, the availability of the resource and the cost, it all means that the event of voice recognition was embraced by all the service providers as a service to palliate, to that lack of resource (in this case – stenographers). As we know voice recognition started, it’s not perfect, still not perfect, I don’t know if you have seen on your iPhone, it’s quite funny, with a French access it’s
even worse! (This is in reference to Siri which is not used so far as I am aware to create live subtitles but it is part of the same technology used – voice recognition). With the voice recognition you need to train the person. Each person (in this case – a subtitler or respeaker) needs to train it, now it’s moved on and there are people using voice recognition very successfully as well, so it’s evolving but the patience, you know, it does run out when you are round the table again years are discussing the same issue, but it’s not a lack of will, I think it’s just a difficult thing to achieve, because it involves so many different people.

Voice technology does seem to be constantly evolving and the fact that it is being implement in more and more products (the iPhone and Siri is a great example) I think is a positive thing. It increases consumer awareness of what this technology can do and consequently I think people will expect this technology to work. There are numerous ways voice technology is being used. To move away from just live subtitling and summarising points made at the conference for a moment but still within a broadcast TV context, another use is illustrated by Google TV. In the below video you can see voice recognition technology allowing a viewer to navigate the TV:

Voice recognition technology is also used to create the automatically generated captions on You Tube videos. At the moment this does illustrate the technologies limitations as most readers here I am sure are aware – the captions created this way are completely inaccurate most of the time and therefore useless. I think we can all agree that respeaking to produce live subtitles creates errors but produces a much better result than a machine currently. Google recently added automatic captioning support for six new languages. Investment into this technology even if it is currently imperfect shouldn’t be discouraged because surely this is the only way for the technology to improve:

A new research paper out of Google describes in some detail the data science behind the the company’s speech recognition applications, such as voice search and adding captions or tags to YouTube videos. And although the math might be beyond most people’s grasp, the concepts are not. The paper underscores why everyone is so excited about the prospect of “big data” and also how important it is to choose the right data set for the right job….No surprise, then, it turns out that more data is also better for training speech-recognition systems…The real key, however — as any data scientist will tell you — is knowing what type of data is best to train your models, whatever they are. For the voice search tests, the Google researchers used 230 billion words that came from “a random sample of anonymized queries from google.com that did not trigger spelling correction.” However, because people speak and write prose differently than they type searches, the YouTube models were fed data from transcriptions of news broadcasts and large web crawls…This research isn’t necessarily groundbreaking, but helps drive home the reasons that topics such as big data and data science get so much attention these days. As consumers demand ever smarter applications and more frictionless user experiences, every last piece of data and every decision about how to analyze it matters.


Following on from this example the natural question ask is will Apple integrate its voice technology Siri into Apple TV? It has been rumoured but not yet confirmed. (Interestingly it is already confirmed that Siri is being added to Chevrolet cars next year) If there is competition between companies for innovation using this technology, all the better. I found an interesting blog post pondering the future of Siri for Apple here. Although this blogger thinks that Google Voice is better. Voice technology is also being used in the world of translation. Last month Microsoft gave an impressive demo of voice recognition technology translating a speakers speech in English into Chinese text as well as speak it back to him in Chinese in his own voice:

Skip to 4.22 to be able to read captions from this presentation.

All of these examples I hope will contribute in some small way to an improvement in live subtitling. Mark Nelson would disagree with me. He wrote an article explaining how he believes that a peak has been reached and that greater reliance on voice technology could lead to the deaf and hard of hearing being left behind.

What do you think? Do you think live subtitling will improve as a result of voice recognition technology or do you have another view? Please leave your comments below.