You’ve probably heard of metadata, which is a fancy name for “data about data.”
For example, a list of the phone calls you’ve made lately, and how long they lasted, but not what you said during the call.
Or a list of the filenames on your hard disk, along with how big they are and when you last edited them, but not what’s inside any of the files.
As you can imagine, metadata is gold dust to law enforcement during a criminal investigation: it can help with chronology; it can establish connections amongst a group of suspects; it can confirm or break alibis; and much more.
But metadata doesn’t feel like quite as much of a privacy invasion as full-blooded surveillance, so many countries tolerate collecting and using it on much more liberal terms than collecting the data itself, such as the actual contents of your files, or transcripts of your phone calls.
Of course, metadata is just as golden to social engineers – crooks who try to trick you into giving away information you’d usually keep to yourself by seeming to know “just enough” about you, your activities and your lifestyle.
Crunching through metadata to do with network connections is usually called traffic analysis, and you might be surprised how much it gives away, even when the traffic itself is strongly encrypted.
Here’s an intriguing example from a bevy of security researchers in Israel, who eavesdropped on encrypted web traffic (on their own network, of course).
They monitored a range of measurements about the TLS traffic that passed by, even though they couldn’t monitor anything inside the packets:
Using various machine learning techniques, they claim to have been able to classify their packet captures to make surprisingly insightful estimates of which combination of operating system, browser and web service were in play.
For example, they could guess fairly reliably that “this user was watching YouTube using Safari on OS X,” while “that user was using Twitter from Internet Explorer on Windows.”
That might not sound like a terribly important or worrying result, but remember that TLS encryption is supposed to provide confidentiality.
In other words, anything that leaks out about what’s inside a TLS-protected data stream is information that an eavesdropper isn’t supposed to be able to figure out.
What to do?
If you’re a programmer, you can take precautions against this sort of attack by introducing what you might call “deliberate inefficiencies”.
For example, by inserting random and redundant noise into your traffic, such as variable delays and additional random data, you can disguise patterns that might otherwise stand out.
In the case of this research, however, there’s no need to panic.
Not yet, anyway.
The classifications that the researchers were able to perform so far were very broad indeed, and sometimes not at all certain.
After all, “Internet Explorer on Windows” is a good guess for most TLS traffic, and you can figure out which traffic is going to Twitter by looking at the packet destinations alone.
Nevertheless, this is a handy reminder that the argument “it’s harmless to collect metadata in bulk because it isn’t the actual data itself” is fundamentally flawed.