If you weren’t already worried about the privacy dangers of online ad tracking, now would be a good time to start. Researchers have found a way to de-anonymise web surfing records, putting a recent US privacy ruling in jeopardy.
Online ad networks track your browsing history across multiple sites so that they can serve you more effective advertising. Search for something on an ecommerce site that participates in one of these networks, and you’ll be shown related ads on another site that also participates.
This often creeps people out, but the counterargument has always been that the data is anonymous. Instead of linking your real name to your web surfing records, these trackers use a unique customer ID.
In this way, they may know that the same person who searched for Christmas getaways on one website is now reading an article about marmosets on another website, but because they don’t know who that person is, they can responsibly litter the marmoset article with holiday advertisements, without anyone on the internet knowing that you’re leaving your house unattended on December 25.
That’s all well and good, but what if you could deduce a person’s identity by matching their anonymous web surfing with their social media timeline? What if, instead of a customer ID, you could replace it with their Twitter handle?
Academics from Stanford and Princeton have done just that. Their research relies on the idea that people are more likely to follow links showing up on their social media feed, and in particular the links from people they follow on Twitter that show up in their feed. They reasoned that because the set of links in a Twitter feed is often unique, you can match it against links in an anonymous surfing history.
The group collected anonymous web browsing histories from almost 400 volunteers, and mined them for links that came from Twitter (marked with the domain name t.co, which Twitter uses to shorten URLs) visited in the last 30 days. It attempted to de-anonymize histories with at least five such links by comparing them against 300m Twitter feeds.
The researchers found that they could identify more than 70% of volunteers on average. The more links in someone’s history that originated from Twitter, the more accurate the identification. The team correctly identified 86% of participants in the experiment with between 50 and 75 URLs. So if you follow a lot of links from Twitter, you’re more likely to be identified.
This isn’t just a theoretical exercise. The team built a system to de-anonymise web browsing histories in under a minute using the concept, proving that it’s workable in practice.
The team at Princeton has a record of exposing flaws in anonymous datasets. Arvind Narayanan, one of the researchers, runs a blog called 33 Bits of Entropy, named for the fact that there are about 6.6bn people in the world, meaning that you only need 33 bits of information to determine their identity. He has moved on a bit from his de-anonymising research, but in the past he has embarrassed Netflix by using its research dataset to work out who was watching what movies.
Here’s another tidbit from the research: it points out that the same principles apply to any set of items selected anonymously by someone with an identifiable historical record of selections. For example, anonymous papers might cite other work and could be compared with a broader spectrum of academic papers to see if similarities show up.
We wonder if it’s possible to run it against the eight references in the original bitcoin paper, created by the mysterious Satoshi Nakamoto, to help track him down, assuming that he had published academic work before? Not necessarily, says Jessica Su, one of the researchers:
I wouldn’t be able to tell you without having access to a dataset that included that paper. However, I am currently trying to deanonymize physics papers, and a modification of our method gives us 28% accuracy. If we limit to papers with exactly eight intra-database citations, we get 36% accuracy. These are very preliminary results.
Who is likely to use social media history in combination with ad tracking? The trackers themselves could. The team looked at four such trackers: Google, Facebook, ComScore and AppNexus – and found that they all had enough information to de-anonymize their users.
Who else might use this information? The NSA, for one. It already tracks Google ads to find Tor users. The research points out that well-resourced adversaries could eavesdrop on network traffic to work out which domains a particular device is visiting (although thankfully HTTPS makes that more difficult).
Other potential users could include potential employers, anyone granting credit, or insurance companies who might love to know about your recent search for cancer symptoms or risky pursuits. Anyone who could benefit from knowing what you’re searching for would find this attack useful.
The good news is that commercial parties like these could only match your anonymous browsing history against your public social media profile if they had that data. The bad news is that it has long been for sale.
It’s particularly galling for privacy advocates, because the selling of customer data was about to get a lot harder. The FCC in the US issued an order restricting ISPs from collecting customers’ sensitive data unless they specifically opted in. Under the order, service providers must get customer permission to sell sensitive personal data, defined as “reasonably linkable” to an individual.
Anonymised data may be seen as not reasonably linkable, meaning that it can be collected and used. But clearly, with a bit of automated detective work, it’s pretty easy to make that link.
How can you stop this from happening? Tracker-blockers such as Ghostery, uBlock Origin or Privacy Badger can help, the researchers say, while not revealing your real-world identity on social media profiles is a useful albeit cumbersome form of protection. Given the recent actions of US border guards, the latter might be a good idea anyway.