What do you do in your spare time if you’re a self-confessed “urbanist, data junkie and civic hacker,” like New York resident Chris Whong?
Well, you might be interested in the ebb and flow of urban transportation, for a start.
And there is surely a lot of ebbing and flowing in New York City (NYC).
For example, have you ever wondered whether there is an optimum time for taxicab shift changes to maximise availability?
→ Scheduling shift changes at 5pm certainly sound like a bad idea. I have been in a city where drivers went off duty at 17:00, and you could find any number of people who would complain about it for at least as long as it took to hail a rush-hour cab – but were they right? When roads are already busy, do you actually improve throughput by shoving more cabs into circulation, or not? You need some Big Data to answer that question objectively!
Whong decided he might be interested in doing some sort of analysis on taxi movements in his home city, and was pleasantly surprised that it took him just two days to extract some 50GB of highly detailed GPS taxi trip data from the NYC Taxi and Limousine Commission (TLC), using a Freedom of Information Law request, or FOIL for short.
And, oh dear, what a hash the Taxi and Limo folks had made of the data, literally and figuratively.
FOILs, of course, have to balance the right of and need for people to access public service data against the privacy of the people who are wrapped up in that data.
I’d say it’s reasonable for the Chris Whongs of the world to be able to find out that there was a taxi standing at 264 West 40th Street at 11pm on the Fourth of July that made a pickup for a journey that cost $9.50 with a $2 tip.
And it’s also reasonable for him to be able to work out that the same taxi was then off the road for a week, before returning to service with a different driver on a morning shift, and so on.
But we’d probably all agree that identifying exactly which taxi, and which driver, would be unfair to the driver, and perhaps even to his fares, since regular passengers often have their regular drivers.
So, “freedom of information” notwithstanding, it seems uncontroversial that identifying details such as taxi and driver ought to be anonymised.
And, as another big data fan named Vijay Pandurangan quickly realised, once he’d taken his own look at Whong’s data, the TLC had tried to do just that.
In every record, the taxi (medallion number) and driver (hack license number) identifiers were replaced with 32-character hexadecimal strings, something like this:
Call me old-fashioned, but my first thought on seeing data presented like that, 16 bytes at a time in hex, was, “Those are MD5 hashes.”
Pandurangan also noticed that one driver called CFCD208495D565EF66E7DFF9F98764DA seemed to do a lot of jobs, presuambly without ever sleeping and with the ability to be in several places at the same time.
When you see curious data spikes like that, much like IP numbers that “geolocate” to the open ocean at Latitude 0°E Longitude 0°S, you have to think, “Uninitialised or unknown data with some default value, probably zero.”
Like this:
Ouch.
Pushing just the text string 0 through one loop of MD5 hash algorithm is pretty much the key to the castle.
According to Pandurangan, medallion numbers (which appear on the roof light of the taxi) are always in the form 9X99, XX999 or XXX999, where X represents a letter and 9 is a digit.
Even if all letters and numbers are allowed, that gives 10×26×10×10 + 26×26×10×10×10 + 26×26×26×10×10×10, or 26,000 + 676,000 + 17,576,000 possibilities – a total of about 18.3 million. [Not 22 million, as Pandurangan wrote.]
You can compute a lookup table [not a Rainbow table, as Pandurangan wrote] of all 18.3M MD5 hashes in seconds, even on a vanilla laptop.
→ The sort of graphics card often used for calculations in Bitcoin mining can compute billions of hashes per second. A special purpose $20,000 server made from commodity hardware can do hundreds of billions of hashes per second.
Similarly, NYC taxi drivers’ license numbers are all six or seven digits long, and always start with a 5.
So there are just 100,000 + 1,000,000, or 1.1 million [not 2 million, as Pandurangan wrote] of those, exhaustively hashable in about the time it takes to blink.
In other words, with not too much effort, you can plough through the data provided by the TLC and deanonymise it by rewriting the hash of every taxi and every driver with the actual medallion and hack license data from the lookup tables you just calculated.
One obvious solution would be for the TLC to have generated a random 16-byte salt, then done the hashing, and then deleted the salt for evermore, thus effectively randomising the medallion and hack numbers while always using the same hash value for the same taxi or driver.
That would make it computationally infeasible to produce a table mapping the hashes to their original values.
Of course, whether that would be anonymous enough for public consumption is another matter altogether.
For example, if you have other data that identifies a specific cab and a specific driver (for example, because you were the passenger and recorded the time and the location of your journey), you can then geolocate that driver for the whole time he was at work for the entire year of 2013, from Whong’s FOIL data alone.
And that raises the question, “Should your freedom of information trump the driver’s own privacy?”
Image of yellow cab courtesy of Shutterstock.
You’re being a little harsh on Vijay. His page acknowledges that “Rainbow tables” is not the right term for a reverse-hash table, and he clearly states that the 2.2 million calculation is approximate–he states “about 2.2 million” and shows the approximation he made.
Your article would have been a little more interesting if you had included Whong’s conclusion and if you had explained the repeated hash value. (Vijay states it’s the hash of 0, probably inserted when the real number was unavailable.) Your point, that a hash is not an encryption , was obscured.
When will the problem of not getting notified about responses to posts be fixed?
Laurence Marks
Vijay actually said that “people have pointed out that [my lookup table] may not meet the exact requirements to be called a rainbow table.” But he was wrong. What he should have said is, “Sorry, what I created was a straight lookup table, not a rainbow table at all,” and corrected his article.
Terminology *is* important here, since the “rainbow table” is used specifically to denote a special sort of time-space tradeoff *when it is infeasible to create a full lookup table*. In other words, if a cryptographic crack can be reduced to a straight lookup table, you are looking at a problem that is much more dramatic than one that can only be attacked using rainbow tables. Therefore using the term “rainbow table” not only plain wrong, it misleadingly understates the problem.
He also wrote that the total count of medallion numbers was “1000*26**3 = 21952000,” which is not an approximation, is incorrectly computed (1000*26**3 is 17,576,000), and is not the total count anyway, since he left out the other arrangements of medallion numbers.
Considering the deliberately precise evaluation he gave before reducing it to an approximation, I think I am perfectly justified in explaining why my figure is significantly different (his is about 25% higher). After all, if I’d simply accepted his 22M approximation without comment, I’d be wrong, too.
As for explaining the repeated hash value, I did (clearly enough, I thought), [a] show and state that a reasonable initial guess at its significance would be “[u]ninitialised or unknown data with some default value, probably zero,” and [b] actually showed the text string “0” being MD5ed to give the hash in question, under the heading “Calculating the MD5 checksum of the string represented by zero.”
Lastly, we don’t have a “notify me about responses” option, I’m afraid. (Strictly speaking, that means we can’t “fix” it, since it doesn’t exist.)
Most people come back to see what’s going on with discussions they are interested in, not least because relevant responses aren’t always published as replies…some people prefer to start a new thread referring back to earlier posts, especially if they plan to introduce extra information or questions.
Any way to get such or similar data on the ride-sharing cars ?