A study by a Timothy Libert, a doctoral student at the University of Pennsylvania, has found that nine out of ten visits to health-related web pages result in data being leaked to third parties like Google, Facebook and Experian:
There is a significant risk to your privacy whenever you visit a health-related web page. An analysis of over 80,000 such web pages shows that nine out of ten visits result in personal health information being leaked to third parties, including online advertisers and data brokers.
What Libert discovered is a widespread repetition of the flaw that the US government’s flagship Healthcare.gov website was dragged over the coals for in January.
The sites in question use code from third parties to provide things like advertising, web analytics and social media sharing widgets on their pages. Because of the way those kinds of widgets work, their third party owners can see what pages you’re visiting.
The companies supplying the code aren’t necessarily seeking information about what you’re looking at but they’re getting it whether they want it or not.
So if you browse the pages about genital herpes on the highly respected CDC (Centres for Disease Control and Prevention) site you’ll also be telling marketing mega-companies Twitter, Facebook and AddThis that you’ve an interest in genital herpes too.
It happens like this: when your browser fetches a web page, it also fetches any third party code embedded in it directly from the third parties’ websites. The requests sent by your browser contain an HTTP header (the annoyingly misspelled ‘referer’ header) that includes the URL of the page you’re looking at.
Since URLs tend to contain useful, human-readable information about what you’re reading, those requests can be quite informative.
For example, looking at a CDC page about genital herpes triggers a request to addthis.com like this:
The fact that embedded code gets URL data like this isn’t new – it’s part of how the web is designed and, like it or not, some third parties actually rely on it – Twitter uses it to power its Tailored Suggestions feature for example.
What’s new, or perhaps what’s changed, is that we’re becoming more sensitive to the amount of data we all leak about ourselves and, of course, health data is among the most sensitive.
While a single data point such as one visit to one web page on the CDC site doesn’t amount to much, the fact is we’re parting with a lot of data and sharing it with the same handful of marketing companies.
We do an awful lot of healthcare research online and we tend to concentrate those visits around popular sites.
A 2012 survey by the Pew Research Center found that 72% of internet users say they looked online for health information within the past year. A fact that explains why one of the sites mentioned in the study, WebMD.com, is the 106th most popular website in the USA and ranked 325th in the world.
The study describes the data we share as follows:
...91 percent of health-related web pages initiate HTTP requests to third-parties. Seventy percent of these requests include information about specific symptoms, treatment, or diseases (AIDS, Cancer, etc.). The vast majority of these requests go to a handful of online advertisers: Google collects user information from 78 percent of pages, comScore 38 percent, and Facebook 31 percent. Two data brokers, Experian and Acxiom, were also found on thousands of pages.
If we assume that it’s possible to imply an individual’s recent medical history from the healthcare pages they’ve browsed over a number of years then, taken together, those innocuous individual page views add up to something very sensitive.
As the study’s author puts it:
Personal health information ... has suddenly become the property of private corporations who may sell it to the highest bidder or accidentally misuse it to discriminate against the ill.
There is no indication or suggestion that the companies Limbert named are using the health data we’re sharing but they are at least being made unwitting custodians of it and that carries some serious responsibilities.
Although there is nothing in the leaked data that identifies our names or identities, it’s quite possible that the companies we’re leaking our health data to have them already.
Even if they don’t though, we’re not in the clear.
Even if Google, Facebook, AddThis, Experian and all the others are at pains to anonymise our data, I wouldn’t bet against individuals being identified in stolen or leaked data.
It’s surprisingly easy to identify named individuals within data sets that have been deliberately anonymised.
For example, somebody with access to my browsing history could see that I regularly visit Naked Security for long periods of time and that those long periods tend to happen immediately prior to the appearance of articles written by Mark Stockley.
For a longer and more detailed look at this phenomenon, take a look at Paul Ducklin’s excellent article ‘Just how anonymous are “anonymous” records?‘
It’s possible to stop this kind of data leak by setting up your browser so it doesn’t send referer headers but I wouldn’t rely on that because there are other ways to leak data to third parties.
Instead I suggest you use browser plugins like NoScript, Ghostery or the EFF’s own Privacy Badger to control which third party sites you have any interaction with at all.
What the study hints at is bigger than that though – what it highlights is that we live in the era of Big Data and we’re only just beginning to understand some of the very big implications of small problems that have been under our noses for years.
Image of medical cross courtesy of Shutterstock.
11 comments on “How nine out of ten healthcare pages leak private data”
Will using TOR prevent this?
No. Or, perhaps, “No, not really.” Or, “It depends.” Tor does make it possible for you to browse to sites without leaving a direct trail back to yourself (e.g. to your IP number at home). The way the browser is set up in the Tor Browser Bundle means that you are less likely to retain information between browsing sessions that would make it obvious that you were the same person coming back for more.
But imagine, for example, that you use Tor to do your online banking…to make that work, you still have to tell the bank who you are by logging in.
Similarly for any Tor-based browsing: sometimes, the very things you want to find out about are inextricably tied to you, and if you give away enough apparently unimportant or anonymous facts about yourself to the same website or group of websites, you might narrow down the set of people that could possibly include you.
Here’s a very crude example, perhaps based on having a look around to see if you are eligible to donate blood:
1. You are an earthling. (That makes you a tidily anonymous 1 in 7,000,000,000)
2. You are male. (50% of 7 billion.)
3. You were born in the USA. (5% of 50%.)
4. You now live in the UK. (1% of 5% of 50%.)
5. You had a tattoo in the past 12 months. (You guess the percentage – let’s say 2% of 1% of 5% of 50%.)
6. You travelled to Africa in the past 12 months. (And so on – how about 2% of 2% of 1% of 5% of 50%?)
I’ve made up those percentages, but if they are correct, I’d say that narrows you down to about 1 in 700 people already.
If the above anonymous-sounding info happens to be shared along the way with another website where you subsequently narrow yourself down further, e.g. you reveal that you live in Glasgow and work in IT.
Mark linked to an article that looks into this issue in a more real-world way (disclaimer: I wrote it):
And of course the code can be changed by the provider, on a whim, like AddThis who added browser canvas fingerprinting to their sharing icons one day…
What about Control/Shift “N” in Chrome?
That doesn’t stop the referer but it does stop the cookies used to stitch your browsing history together. There are other methods for tracking that are more resilient that cookies though, such as fingerprinting.
I’ve been trying to train No Script at work. Easier than it used to be, but still not something I can recommend to my highly educated co-workers. Let alone my parents or grandparents. My tech savvy parent already complains their computer has so much security there’s no room left for them. I’m not sure how we put our collective feet down about this, but we should. The web was supposed to free us, but I feel more like my digital identity has been sold into slavery against my will.
Hmmm. In the US, HIPAA requirements cover all medical information. It seems to me that an enterprising lawyer could sue both companies, and would likely win.
The medical site is collecting personally identifiable information. That’s clearly covered under the HIPAA restrictions (although sites like WebMD claim they’re not, I suspect they would lose rather handily in court).
If I’m correct above, then that means that all of their vendors (including 3rd-party web sites) are also limited by HIPAA. If any medical information (including simple medical terms like “herpes”) can be tied to an individual, then the company collecting that information is responsible for it. And HIPAA strictly limits what they can do with it, including strictures on how it can be stored.
I’m betting (and hoping) that Google’s lawyers (and other companies’ lawyers) have told them they need to purge that information, or store it with HIPAA-level security (something they’re not likely to want to do, since inspections would be involved).
It would be an interesting case because everything is anonymous and working as it normally does and is supposed to (in terms of HTTP), and there’s not necessarily a link between what we browse and what’s wrong with us.
Given a large enough dataset I suspect somebody could a) demonstrate that the link between what we look at and what’s wrong with us is quite strong and b) identify us in anonymous data given a public ‘seed’ – e.g. I know that Mark Stockley is an author for Naked Security and researched this topic at the end of February 2014, can I find him in this immense pool of browsing data.
I’m thinking is all that’s needed to sic the federal hounds on this one is a proof-of-concept by a white-hat.
And, the creative little gremlin in me also wonders just how much work I would have to do to get the site’s server to fork over the client’s IP address. That would expose a hole in the specs, but if it worked, think of the pile of dollars that data would bring.
Maybe I’m in the wrong business. 🙂
As I was entering a Doctors appointment into my Google Calendar, I was typing in the office name and a list of the doctors there popped up. Made me realize that Google now knows everyone’s doctor. Definitely not anon any more.