If you thought that removing identifying information from a database of sensitive personal records was enough to retain privacy, it’s time to think again. A study published this week asserts that it’s even easier to re-identify information than we first thought.
The idea of de-identifying (anonymizing) data has been around for a while. It involves removing sensitive information like names and exact addresses from databases so that you can still analyse the data without identifying specific people. Article 28 of the EU’s GDPR recommends it as a way of reducing the risk to sensitive records.
The study, released in Nature Communications, calls all that into question. Its authors at the Université catholique de Louvain (Belgium) and at Imperial College London (UK) say that it’s easy to re-identify a high percentage of people in de-identified data sets.
Furthermore, the researchers challenge a key assumption among organizations that de-identify data, which is that releasing a subset of a data sample makes it much harder to re-identity data with confidence.
The conventional wisdom goes like this: Let’s say you’re an organization in charge of people’s sensitive data. You want to make that data public so that crowdsourced researchers can crunch the numbers and find patterns in it, but you want to stay compliant with privacy rules.
So, you release only a small sample of a large data set – say, 1,000 of 100,000 people. The data contains a postal code, birth date, and the results of a cancer treatment.
An employer might search that data set and find just one record matching one of its own employees. “Aha!” They would say. “Now we know that our employee has been getting cancer treatment. So much for your privacy!”
You’d counter that there might be other people with the same birth date and postcode in the rest of the data that you hadn’t released. This gives you plausible deniability – the privacy advocate can’t be sure that the person in the data set is John Smith.
According to the researchers’ paper, that’s no longer true:
Our paper shows how the likelihood of a specific individual to have been correctly re-identified can be estimated with high accuracy even when the anonymized dataset is heavily incomplete.
The reason is that the more pieces of individual information a data set contains about you (say, the number of people you live with, the colour of the car you drive or whether you have a pet) the less likely it is that there’s another person with those characteristics. Gather enough pieces of information, and it turns out that you’re a uniquely special flower after all. They said:
Using our model, we find that 99.98% of Americans would be correctly re-identified in any dataset using 15 demographic attributes.
The researchers wrote a machine learning program and trained it on incomplete data sets to test their theory out. They used 210 demographic and survey data sets, and were able to identify people with a high degree of confidence, even in subsets representing just 1% of the data or less.
The result led them to question the whole de-identification concept:
Our results suggest that even heavily sampled anonymized datasets are unlikely to satisfy the modern standards for anonymization set forth by GDPR and seriously challenge the technical and legal adequacy of the de-identification release-and-forget model.
We’ve known for a while that people with good statistical chops can re-identity anonymous data sets. For example, back in 2015, MIT researchers showed how to make surprisingly accurate inferences about shoppers even from extremely vague purchasing data.
What this latest research proves is that it’s even easier than we thought to reconstruct people’s identities, even when only a tiny subset of the data is released. When it comes to de-identification, it suggests that it might be time to go back to the drawing board.
The researchers have created an online tool that lets you check to see how identifiable you might be given your own characteristics.
One comment on “You can probably be identified from your anonymized data”
Unsure if 12/11/1990 living in SW7 and Male is the survey comedically leaking data or a random default. Worth checking?
(If it’s leaking, please feel free to nuke my comment, no need to make it worse)