Who’s better at reading lips – humans or AI?

HAL 9000: I know that you and Frank were planning to disconnect me, and I’m afraid that’s something I cannot allow to happen.

Astronaut Dave Bowman: Where the hell did you get that idea, HAL?

HAL 9000: Dave, although you took very thorough precautions in the pod against my hearing you, I could see your lips move.

A bit behind schedule – if you don’t recognize that movie dialogue, it’s from Stanley Kubrick’s 2001 – but computers are moving rapidly towards mastering lip-reading.

They’re not there yet. (No wonder: for humans, lip reading is brutally difficult and highly error prone.) But new research shows they’re clearly outperforming humans, and improving fast. So if you’ve been captured on CCTV, with or without audio, it might soon be practical to decipher whatever you were talking about.

Lip-reading has been an active focus of AI research for years. Two new papers from Oxford University show just how far it’s come.

In the first, Oxford University computer science researchers trained their LipNet AI system on a painstakingly developed set of 29,000 short video training clips that offered the absolute best possible scenario for lip-reading.

According to Quartz:

Every person was facing forward, well-lit, and spoke in a standardized sentence structure.

The vocabulary was conveniently tiny, too.

The researchers then tested both humans and LipNet on 300 equally “ideal” videos. The humans still served up a woeful 47.7% error rate. (Think you can do better? Try one yourself.)

LipNet, however, only missed 6.6%. Its 93.4% accuracy blew away the previous record of 79.6%.

What made it so good? It doesn’t just interpret “spatiotemporal” changes in the mouth’s shape as a human speaks, it also makes predictions based on the entire sentence being spoken. That way, it can use sentence context to improve its guesses. Check out the original paper for complete details.

So, you’re thinking: well and good, but real-world video isn’t so carefully crafted for lip reading. What about video that’s a bit more realistic?

For that, we turn to an entirely separate paper, with authors from Oxford University’s Department of Engineering Science and Google’s DeepMind project. This one’s based on 5,000 hours of news and debate video broadcast by the BBC, encompassing many different speakers and nearly 17,500 different words (an average native English speaker knows between 20,000 and 35,000 words.) This video dataset is less artificial than LipNet’s, but still generally well-lit, with relatively few shot changes or distractions.

Once they’d trained their new AI system on old BBC video, the researchers set it loose on new BBC programmes. According to New Scientist, it achieved 46.8% accuracy. Not spectacular… but humans could only muster 12.4%.

Both sets of researchers have identified opportunities to improve their systems, and LipNet’s Yannis Assael says he’ll start experimenting with the BBC data. Most agree that bigger, more realistic datasets will help drive progress. It’s just a matter of time before those are built. So don’t be surprised if AI lip-reading improves significantly in the near future.

While advanced surveillance is clearly one application for this work – at least where long-range microphones can’t do even better – it’s not the only one. Aside from automating caption generation, it may improve hearing aids and enable better speech recognition in noisy places.

Still, as Jack Clark writes in his Import AI newsletter, in the future if you’ve got something revolutionary to say, you may need to wear a mask.