The Big Data picture – just how anonymous are “anonymous” records?


anon-170On Naked Security we regularly write about, or at least make mention of, something called Big Data.

It’s what I think of as an “anything and every­thing” term.

There are things that obviously aren’t big data, like the modest collection of pictures you took of your cat until you snapped one that made a decent computer wallpaper image.

And there are things that obviously are big data, like the giant database of Wi-Fi access points from Google’s StreetView cars that it uses to aid and abet its geolocation services.

Of course, even your cat pictures – the ones that were captured with a single short press of the BURST option on your new iPhone – probably take up several times more storage than your first computer had in total.

But they fail to make the cut as “big data” not only because they’re small by modern standards, but also because you can’t dissect, compare and contrast them to look for patterns in the whole cat world, and from that to draw inferences about one particular cat in the database.

Now, if you had pictures of 1,000,000 different cats, organised by location, that would be big data.

Big Data versus privacy

Clearly, the words “big data” ought to raise privacy concerns.

Automatic Number Plate Recognition (ANPR) cameras are a good example, because your plate number stays constant while your location changes.

One ANPR data point might bust you for speeding, or running a red light, which is fair enough.

But put a city’s worth – or a State’s worth, or, heck, why think small, a whole country’s worth – of ANPR data together, and you have intrusive surveillance, especially if the data includes everyone, even drivers who haven’t come near breaking any laws.

You can nevertheless argue that, even though raw ANPR dumps on their own may be “big data,” they are essentially anonymous.

Therefore they are safe, and possibly valuable, to make available broadly:

State/Plate  Date        Time      Location
-----------  ----------  --------  -------------------------
NSW  NSG123  2014-12-01  11:54:11  Harbour Bridge N Approach 
QLD  556ARX  2014-12-01  11:54:14  Harbour Bridge N Approach
QLD  189BBQ  2014-12-01  11:54:17  Lang Park
NSW  BA45MO  2014-12-01  11:54:22  Lang Park
NSW  AM99WA  2014-12-01  11:54:23  Harbour Bridge N Approach
VIC  RST776  2014-12-01  11:54:32  Carrington St  
NSW  XLR8    2014-12-01  11:54:33  Lang Park
NSW  BA45MO  2014-12-01  11:54:34  Carrington St 
NSW  44BSD   2014-12-01  11:54:37  Lang Park

After all, unless you also have a database to turn the plate numbers into vehicle owners, all you know is that a car, some car but the same car, plated BA.45.MO, passed Lang Park and made it into Carrington Street within 12 seconds. (You could do it, but you’d need the hammer down.)

So for things like planning road safety measures, predicting traffic volumes, helping fuel companies decide where to build petrol stations, and so on, perhaps ANPR “big data” is OK, and useful, to release?

In fact, you could go one step further so that no-one, not even the vehicle licensing agencies, could actually work out which cars were there:

Hash    Date        Time      Location
------  ----------  --------  -------------------------
OEERIB  2014-12-01  11:54:11  Harbour Bridge N Approach 
7K5NR5  2014-12-01  11:54:14  Harbour Bridge N Approach
IFQS8K  2014-12-01  11:54:17  Cahill Expressway
ZJXJUN  2014-12-01  11:54:22  Lang Park
CPU069  2014-12-01  11:54:23  Harbour Bridge N Approach
6VJNJU  2014-12-01  11:54:32  Carrington St  
GG38UB  2014-12-01  11:54:33  Lang Park
ZJXJUN  2014-12-01  11:54:34  Carrington St 
6MBHSI  2014-12-01  11:54:37  Lang Park

→ If you ever need to do this sort of anonymisation, a salting-and-hashing system, like you might use for passwords, can help. But don’t make a hash of it and leave the data at risk of an attack that works backwards to the orginal plate data, like New York City did with its cab drivers.

How random is random

Of course, even the “randomly-assigned identifier” approach has some problems.

Let’s say I happen to know for sure that you turned onto the Cahill Expressway at 11:54:17 on the given date – perhaps I was tailing you in the car behind, or was able to match your car up with a CCTV camera feed of my own.

I can now assume that your anonymous tag is IFQS8K, and track you throughout the rest of the database.

That’s worrying, but the privacy risk is mitigated by the fact that I need a precise data point of my own in order to zoom in on you so precisely.

In other words, it seems as though only someone already keenly interested in me, who already has a good picture of my movements, could use the anonymised ANPR data to construct a good picture of my movements.

What about vague data?

And that raises the question, “If I don’t have precise data to get you in my sights, how much vague data would I need instead?”

And the answer, of course, depends entirely on the nature of the data, and your definition of vague.

For example, with an Australia-wide ANPR data dump, how many cars would show up in three different states some time in three consecutive weeks?

I don’t know the answer, but you can see where this is going: it’s all about intersecting sets.

Of the 150,000 cars that cross the Harbour Bridge each day, you might guess that no more than 1% also go on the Melbourne City Link in the same month.

Of that 1%, let’s assume that only 1% went on the Gateway Bridge in Brisbane as well. (I suspect the ratios are smaller than 1%, but let’s keep things simple.)

So even if all I know is that you happened to go on those three roads at some time in the last month, I’ve already pinned you down to one of just 15 cars!

Now add in a bit more detail, such as that you used the Gateway Bridge once and only once, and it was in the morning, because you blogged about getting the sun in your eyes…


That’s a made-up, theoretical example of deanonymising so-called “safe” big data.

What about the real world?

But can this sort of thing work in the real world?

It certainly can!

This paper [paywall] by a group of MIT graduate students shows you why:

It’s a tricky read, because it’s weighed down by jargon, and it’s written for a mildly technical audience.

But even a non-technical skim-read proves the point.

The authors started with three months of credit card data, which was an anonymised transaction log a bit like the made-up ANPR data we showed above.

They tried to “mine” it – to match up individuals with to their anonymous transaction tags – using ever-less precise information about each transaction.

Note that this imprecision can be applied either to what you know about the individual you are tracking, or to the data as a whole.

The authors were particularly interested in the latter: how big a privacy-sapping problem would remain even if the data points in the original data were all made wildly imprecise to “assure” privacy?

For example, in the ANPR sample, perhaps the data would be rendered harmless to privacy if all it said was:

Hash    Date        City
------  ----------  ------------
7K5NR5  2014-12-01  NORTH SYDNEY
IFQS8K  2014-12-01  SYDNEY
ZJXJUN  2014-12-01  SYDNEY
CPU069  2014-12-01  NORTH SYDNEY
6VJNJU  2014-12-01  SYDNEY
GG38UB  2014-12-01  SYDNEY
ZJXJUN  2014-12-01  SYDNEY
6MBHSI  2014-12-01  SYDNEY

How vague is vague?

Your gut feeling might be that this sort of vagueness would inevitably stop you from working out who’s who in the data set, no matter how much data it contained.

But with the credit card data, our MIT authors found that vague can still be surprisingly precise.

For example, they “defocused” the payment card records so that each record:

  • Grouped each payment into the first half or the second half of the month. (Actually, a 15-day window.)
  • Grouped payments into collections of shops near to each other. (Each group had 350 shops counted as if they were one.)
  • Grouped price into a series of ranges. (As an example, prices from $5 to $16 were considered as one.)

In other words, if you bought a jam doughnut and a coffee from the snack shop at the ferry terminal on the 12th of the month, your transaction would look the same, apart from its anonymous tag, as someone who bought a ticket to Ryde at the train station on the 7th of the month.

That’s pretty jolly vague, isn’t it?

Indeed, it’s vague enough that when the authors knew the details of any four transactions you’d made during the three month data period, as, for example, would any shop that you had visited four times, they had a chance lower than 15% of guessing which anonymous tag in the file was yours.

But with 10 known transactions, something you might easily rack up with multiple retailers due to daily habits at at a coffee shop, a parking lot, or a newsagent, their chance of pinpointing you rose above 80%.

Loosely speaking, the anonymous data they had access to, even when coarsened astonishingly, turned out to be not-so-anonymous after all.

Interestingly, and I offer this without comment or interpretation, they claim to be able to guess the identity of women about 1.2x more accurately than men.

Likewise, rich people are allegedly about 1.75x easier to pinpoint than poor people.

Big Data matters

And that, my friends, is why Big Data matters.

I’m afraid that I don’t really know what to advise you, except to say that when someone claims they have “anonymised” something, they simply might not be sure.

You can’t rely on your gut feeling about just how anonymous it ended up; nor can they.

Even the vaguest-looking data might have your name in it, if only you know how to look.

So, stick to the advice we gave on Safer Internet Day: if in doubt, don’t give it out.

Image of ginger kitten courtesy of Shutterstock.

Image of anonymous cats courtesy of Shutterstock.

Image of speed camera available under CC BY-SA 2.0 licence.