Imagine sitting in a soundproof room, talking to somebody, a bag of potato chips lying on the floor.
Given the vibrations made to all objects by sound waves, that bag of crisps is giving off enough information for a normal video camera to pick up and decipher your conversation, using only visual information.
MIT, Adobe and Microsoft researchers have done just that: they’ve created an algorithm that can reconstruct sound, including intelligible speech, from the tiny vibrations it creates in objects.
One of many experiments: the researchers have filmed and deciphered the vibrations – quivers too subtle for the naked eye to discern – that reciting “Mary Had a Little Lamb” caused in a potato chip bag across a room, 15 feet away, with the video camera stationed in back of a soundproof window.
Abe Davis, a graduate student in electrical engineering and computer science at MIT and first author on the researchers’ paper, told the Washington Post that the technology has limitations.
It might not, in fact, lead to better sound reconstruction than other methods already in use.
That means that filming vibrating snack bags – or plants, or earbuds, all of which the researchers experimented on – won’t necessarily be the next tool in the NSA’s arsenal, he said:
Big brother won't be able to hear anything that anyone ever says all of a sudden.
Davis is himself interested more in new kinds of imaging. But he’s not ruling out use of the technology by law enforcement or forensics, either:
It is possible that you could use this to discover sound in situations where you couldn't before. It's just adding one more tool for those forensic applications.
For the algorithm to work well, the researchers had to capture video at a frequency – i.e., frames per second (fps) rate – higher than that of the audio frequency they were trying to decipher.
Thus, they sometimes captured video at 2,000 to 6,000 fps: a rate significantly higher than most commercial high-speed cameras.
But even with ordinary digital cameras shooting at 60 fps, they were able to reconstruct audio.
Ironically, the researchers found that the “rolling shutters” of consumer grade video cameras gave much better results than they expected.
A rolling shutter sacrifices image quality to save on cost by sampling each line of the image in sequence rather than grabbing the whole frame in an instant.
As a result, rolling shutter cameras end up with what amounts to a series of very low resolution “image slices” at a much higher sampling rate.
That’s a big negative for visual image quality (if the object moves during the frame capture, you end up with distortion and blurring) but turns out to be a big positive when the whole aim of the exercise is not to recognize the object but actually to measure its motion.
You can hear sound samples on the project page for what the researchers have dubbed the Visual Microphone.
I agree with one commenter on Gizmodo’s writeup, who noted that once again, we are betrayed by the snacks we hold most dear:
vj9c9: Curse you, potato chips! First you make me too big to run, and then you betray me to my enemies!
The researchers will present their paper at the Siggraph computer graphics conference next week in Vancouver.
Think anybody from the FBI or the NSA might show up?