What do you do in your spare time if you’re a self-confessed “urbanist, data junkie and civic hacker,” like New York resident Chris Whong?
Well, you might be interested in the ebb and flow of urban transportation, for a start.
And there is surely a lot of ebbing and flowing in New York City (NYC).
For example, have you ever wondered whether there is an optimum time for taxicab shift changes to maximise availability?
→ Scheduling shift changes at 5pm certainly sound like a bad idea. I have been in a city where drivers went off duty at 17:00, and you could find any number of people who would complain about it for at least as long as it took to hail a rush-hour cab – but were they right? When roads are already busy, do you actually improve throughput by shoving more cabs into circulation, or not? You need some Big Data to answer that question objectively!
Whong decided he might be interested in doing some sort of analysis on taxi movements in his home city, and was pleasantly surprised that it took him just two days to extract some 50GB of highly detailed GPS taxi trip data from the NYC Taxi and Limousine Commission (TLC), using a Freedom of Information Law request, or FOIL for short.
And, oh dear, what a hash the Taxi and Limo folks had made of the data, literally and figuratively.
FOILs, of course, have to balance the right of and need for people to access public service data against the privacy of the people who are wrapped up in that data.
I’d say it’s reasonable for the Chris Whongs of the world to be able to find out that there was a taxi standing at 264 West 40th Street at 11pm on the Fourth of July that made a pickup for a journey that cost $9.50 with a $2 tip.
And it’s also reasonable for him to be able to work out that the same taxi was then off the road for a week, before returning to service with a different driver on a morning shift, and so on.
But we’d probably all agree that identifying exactly which taxi, and which driver, would be unfair to the driver, and perhaps even to his fares, since regular passengers often have their regular drivers.
So, “freedom of information” notwithstanding, it seems uncontroversial that identifying details such as taxi and driver ought to be anonymised.
And, as another big data fan named Vijay Pandurangan quickly realised, once he’d taken his own look at Whong’s data, the TLC had tried to do just that.
In every record, the taxi (medallion number) and driver (hack license number) identifiers were replaced with 32-character hexadecimal strings, something like this:
Call me old-fashioned, but my first thought on seeing data presented like that, 16 bytes at a time in hex, was, “Those are MD5 hashes.”
Pandurangan also noticed that one driver called CFCD208495D565EF66E7DFF9F98764DA seemed to do a lot of jobs, presuambly without ever sleeping and with the ability to be in several places at the same time.
When you see curious data spikes like that, much like IP numbers that “geolocate” to the open ocean at Latitude 0°E Longitude 0°S, you have to think, “Uninitialised or unknown data with some default value, probably zero.”
Pushing just the text string 0 through one loop of MD5 hash algorithm is pretty much the key to the castle.
According to Pandurangan, medallion numbers (which appear on the roof light of the taxi) are always in the form 9X99, XX999 or XXX999, where X represents a letter and 9 is a digit.
Even if all letters and numbers are allowed, that gives 10×26×10×10 + 26×26×10×10×10 + 26×26×26×10×10×10, or 26,000 + 676,000 + 17,576,000 possibilities – a total of about 18.3 million. [Not 22 million, as Pandurangan wrote.]
You can compute a lookup table [not a Rainbow table, as Pandurangan wrote] of all 18.3M MD5 hashes in seconds, even on a vanilla laptop.
→ The sort of graphics card often used for calculations in Bitcoin mining can compute billions of hashes per second. A special purpose $20,000 server made from commodity hardware can do hundreds of billions of hashes per second.
Similarly, NYC taxi drivers’ license numbers are all six or seven digits long, and always start with a 5.
So there are just 100,000 + 1,000,000, or 1.1 million [not 2 million, as Pandurangan wrote] of those, exhaustively hashable in about the time it takes to blink.
In other words, with not too much effort, you can plough through the data provided by the TLC and deanonymise it by rewriting the hash of every taxi and every driver with the actual medallion and hack license data from the lookup tables you just calculated.
One obvious solution would be for the TLC to have generated a random 16-byte salt, then done the hashing, and then deleted the salt for evermore, thus effectively randomising the medallion and hack numbers while always using the same hash value for the same taxi or driver.
That would make it computationally infeasible to produce a table mapping the hashes to their original values.
Of course, whether that would be anonymous enough for public consumption is another matter altogether.
For example, if you have other data that identifies a specific cab and a specific driver (for example, because you were the passenger and recorded the time and the location of your journey), you can then geolocate that driver for the whole time he was at work for the entire year of 2013, from Whong’s FOIL data alone.
And that raises the question, “Should your freedom of information trump the driver’s own privacy?”