One month ago today, we wrote about Adobe's giant data breach.
As far as anyone knew, including Adobe, it affected about 3,000,000 customer records, which made it sound pretty bad right from the start.
But worse was to come, as recent updates to the story bumped the number of affected customers to a whopping 38,000,000.
We took Adobe to task for a lack of clarity in its breach notification.
One of our complaints was that Adobe said that it had lost encrypted passwords, when we thought the company ought to have said that it had lost hashed and salted passwords.
As we explained at the time:
[T]he passwords probably weren't encrypted, which would imply that Adobe could decrypt them and thus learn what password you had chosen.
Today's norms for password storage use a one-way mathematical function called a hash that [...] uniquely depends on the password. [...] This means that you never actually store the password at all, encrypted or not.
[...And] you also usually add some salt: a random string that you store with the user's ID and mix into the password when you compute the hash. Even if two users choose the same password, their salts will be different, so they'll end up with different hashes, which makes things much harder for an attacker.
It seems we got it all wrong, in more than one way.
Here's how, and why.
The breach data
A huge dump of the offending customer database was recently published online, weighing in at 4GB compressed, or just a shade under 10GB uncompressed, listing not just 38,000,000 breached records, but 150,000,000 of them.
As breaches go, you may very well see this one in the book of Guinness World Records next year, which would make it astonishing enough on its own.
But there's more.
We used a sample of 1,000,000 items from the published dump to help you understand just how much more.
→ Our sample wasn't selected strictly randomly. We took every tenth record from the first 300MB of the comressed dump until we reached 1,000,000 records. We think this provided a representative sample without requiring us to fetch all 150 million records.
The dump looks like this:
By inspection, the fields are as follows:
Fewer than one in 10,000 of the entries have a username - those that do are almost exclusively limited to accounts at adobe.com and stream.com (a web analytics company).
The user IDs, the email addresses and the usernames were unnecessary for our purpose, so we ignored them, simplifying the data as shown below.
We kept the password hints, because they were very handy indeed, and converted the password data from base64 encoding to straight hexadecimal, making the length of each entry more obvious, like this:
Encryption versus hashing
The first question is, "Was Adobe telling the truth, after all, calling the passwords encrypted and not hashed?"
Remember that hashes produce a fixed amount of output, regardless of how long the input is, so a table of the password data lengths strongly suggests that they aren't hashed:
The password data certainly looks pseudorandom, as though it has been scrambled in some way, and since Adobe officially said it was encrypted, not hashed, we shall now take that claim at face value.
The encryption algorithm
The next question is, "What encryption algorithm?"
We can rule out a stream cipher such as RC4 or Salsa-20, where encrypted strings are the same length as the plaintext.
Stream ciphers are commonly used in network protocols so you can encrypt one byte at a time, without having to keep padding your input length to a multiple of a fixed number of bytes.
With all data lengths a multiple of eight, we're almost certainly looking at a block cipher that works eight bytes (64 bits) at a time.
That, in turn, suggests that we're looking at DES, or its more resilient modern derivative, Triple DES, usually abbreviated to 3DES.
→ Other 64-bit block ciphers, such as IDEA, were once common, and the ineptitude we are about to reveal certainly doesn't rule out a home-made cipher of Adobe's own devising. But DES or 3DES are the most likely suspects.
The use of a symmetric cipher here, assuming we're right, is an astonishing blunder, not least because it is both unnecessary and dangerous.
Anyone who computes, guesses or acquires the decryption key immediately gets access to all the passwords in the database.
On the other hand, a cryptographic hash would protect each password individually, with no "one size fits all" master key that could unscramble every password in one go - which is why UNIX systems have been storing passwords that way for about 40 years already.
The encryption mode
Now we need to ask ourselves, "What cipher mode was used?"
There are two modes we're interested in: the fundamental 'raw block cipher mode' known as Electronic Code Book (ECB), where patterns in the plaintext are revealed in the ciphertext; and all the others, which mask input patterns even when the same input data is encrypted by the same key.
The reason that ECB is never used other than as the basis for the more complex encryption modes is that the same input block encrypted with the same key always gives the same output.
Even repetitions that aren't aligned with the blocksize retain astonishingly recognisable patterns, as the following images show.
We took an RGB image of the Sophos logo, where each pixel (most of which are some sort of white or some sort of blue) takes three bytes, divided it into 8-byte blocks, and encrypted each one using DES in ECB mode.
Treating the resulting output file as another RGB image delivers almost no disguise:
Cipher modes that disguise plaintext patterns require more than just a key to get them started - they need a unique initialisation vector, or nonce (number used once), for each encrypted item.
The nonce is combined with the key and the plaintext in some way, so that that the same input leads to a different output every time.
If the shortest password data length above had been, say, 16 bytes, a good guess would have been that each password data item contained an 8-byte nonce and then at least one block's worth - another eight bytes - of encrypted data.
Since the shortest password data blob is exactly one block length, leaving no room for a nonce, that clearly isn't how it works.
Perhaps the encryption used the User ID of each entry, which we can assume is unique, as a counter-type nonce?
But we can quickly tell that Adobe didn't do that by looking for plaintext patterns that are repeated in the encrypted blobs.
Because there are 264 - close to 20 million million million - possible 64-bit values for each cipertext block, we should expect no repeated blocks anywhere in the 1,000,000 records of our sample set.
That's not what we find, as the following repetition counts reveal:
Remember that if ECB mode were not used, each block would be expected to appear just once every 264 times, for a minuscule prevalence of about 5 x 10-18%.
Now let's work out, "What is the password that encrypts as 110edf2294fb8bf4 and the other common repeats?"
If the past, all other things being equal, is the best indicator of the present, we might as well start with some statistics from a previous breach.
When Gawker Media got hacked three years ago, for example, the top passwords that were extracted from the stolen hashes came out like this:
(The word lifehack is a special case here - Lifehacker being one of Gawker's brands - but the others are easily-typed and commonly chosen, if very poor, passwords.)
This previous data combined with the password hints leaked by Adobe makes building a crib sheet pretty easy:
Note that the 8-character passwords 12345678 and password are actually encrypted into 16 bytes, denoting that the plaintext was at least 9 bytes long.
A highly likely explanation for this is that the input text consisted of: the password, followed by a zero byte (ASCII NUL), used to denote the end of a string in C, followed by seven NUL bytes to pad the input out to a multiple of 8 bytes to match the encryption's block size.
In other words, we are on safe ground if we infer that e2a311ba09ab4707 is the ciphertext that signals an input block of eight zero bytes.
That data shows up in the second ciphertext block in a whopping 27% of all passwords, which, if our assumption is correct, immediately leaks to us that all those 27% are exactly eight characters long.
The scale of the blunder
With very little effort, we have already recovered an awful lot of information about the breached passwords, including: identifying the top five passwords precisely, plus the 2.75% of users who chose them; and determining the exact password length of nearly one third of the database.
So, now we've showed you how to get started in a case like this, you can probably imagine how much more is waiting to be squeezed out of "the greatest crossword puzzle in the history of the world," as satirical IT cartoon site XKCD dubbed it.
Bear in mind that salted hashes - the recommended programmatic approach here - wouldn't have yielded up any such information - and you appreciate the magnitude of Adobe's blunder.
There's more to concern youself with.
Adobe also decribed the customer credit card data and other PII (Personally Identifiable Information) that was stolen in the same attack as "encrypted."
And, as fellow Naked Security writer Mark Stockley asked, "Was that data encrypted with similar care and expertise, do you think?
If you were on Adobe's breach list (and the silver lining is that all passwords have now been reset, forcing you to pick a new one), why not get in touch and ask for clarification?