Anatomy of a cryptographic oracle – understanding (and mitigating) the BREACH attack

A whole lot has been talked, over the past week, about a newly-documented attack against secure HTTP, better known as HTTPS.

The attack is dubbed BREACH, fancifully expanded as Browser Reconnaissance and Exfiltration via Adaptive Compression of Hypertext.

Its notoriety comes from the fact that it was presented at the BlackHat 2013 conference.

There seems to be a fair bit of misunderstanding of exactly how BREACH works, and therefore how it might be mitigated or fixed.

Some articles have likened BREACH to previous attacks against secure HTTP, such as Lucky 13 and BEAST.

But that’s not the full story, because those earlier attacks were against TLS.

TLS is the cryptographic wrapper, known as Transport Layer Security, that provides a secure conduit for electronic transactions of many sorts, notably web and email.

BREACH is really an attack against HTTP itself, the hypertext transfer protocol that is used ubiquitously for delivering web pages.

Indeed, you could even use the BREACH attack against an unencrypted web session, if you were unable to sniff the entire session but could only capture certain metadata about it, such as the lengths of HTTP replies.

What sort of attack is BREACH?

BREACH is what’s known as an oracle attack.

That’s where you are able to fire some questions at a computer system, observe the answers that come back, and use them to infer facts that the answers didn’t intend to disclose – regardless of how garbled or irrelevant those answers might seem at first.

→ The name comes from the Oracle at Delphi, a Greek priestess who would answer questions while under the influence of psychotropic vapours. Presumably, the answers were trippily disconnected enough to be useless on their own, but were nevertheless interpreted by priestly intermediaries in an attempt to make them shed light on the issues under consideration.

A easy-to-understand example is given in a discussion on StackExchange from nearly a year ago, where the poster, Thomas Pornin, describes a trick he says was used in World War One by French codebreakers.

They noticed that the Germans included the rank of the intended recipient in the preamble of their messages, and though it was encrypted, the Germans did not disguise the length of this field.

So messages to the Colonel (Oberst) would stand out from messages to a junior officer (Oberleutnant) simply because of the difference in length of the two words.

Since messages to the Colonel were generally much more important, they were attempted first.

How does BREACH work?

The attack relies on the fact that many web pages are compressed to save bandwidth before they are sent out.

Loosely speaking, the web server takes the body of an HTTP reply and squashes it with an algorithm known as deflate, the same compression that is used in PKZIP and gzip.

On arrival, your browser inflates it and processes it from there as if the HTTP session had been uncompressed all along.

Even when a compressed HTTP response is wrapped in TLS encryption to produce a secure HTTPS connection, the actual length of the encrypted data stream (more precisely, the length of the compressed data inside the encrypted stream) can be determined.

And by trying lots of different inputs, each slightly different from the last, and seeing how well or badly they compress, you may be able to guess some of what’s inside, even though you can’t decrypt any of the messages

A practical example

Imagine a vendor’s website that has a page on which you can search for invoices.

Of course, you need to login to get access, so when you first visit the search page, you’ll be asked to do so:

Once that’s out of the way, the web server will probably set a session cookie that your browser will send in the headers of future HTTP requests in order to denote that you have successfully logged in.

(An attacker can’t just use an automatic tool like Firesheep to grab the cookie by sniffing – the traffic is encrypted over HTTPS.)

Now you can carry out a search:

You get back an answer which looks like this:

The raw HTML, decrypted and inflated, looks like this:

Because the content is encrypted in transmission, it doesn’t seem inappropriate to include what amounts to Personally Identifiable Information (PII) in the body of the web page, in this case the login ID 10462897.

But notice that even though the web page is encrypted, there is one part that you can not only predict, but control precisely: the search term that you entered into the search box.

For example, once you’ve logged in and a suitable session cookie has been set by the vendor’s server, you can bypass the search box and control the search results by directly tweaking the search URL:

Now stop to ask yourself, “In the web page above, where the login ID appears twice in close proximity, what would happen to the compression ratio?”

The deflate algorithm works by detecting repeated byte sequences in the input, and avoiding repeating them in the output.

Instead of outputting 10462897 a second time, the compressor says, “I already saw the next eight characters 66 characters ago.”

So if you run a series of searches, with each search term being a guess at the login ID, you’d expect to get back HTTP replies of varying length.

In theory, the better your guess, the better the reply will compress (because of the repeated data), and the shorter it will be.

From oracular response to probable fact

For our imaginary website, I wrote a tiny HTML page that uses JavaScript to force you to perform 10000 searches of the vulnocorp.test website, each embedded in a tiny IFRAME:

I left each IFRAME visible, just for visual confirmation that the multi-search script was working and harvesting results for every four-digit combination, despite the script’s tiny size:

Using the Firebug debugger shows us the HTML that was generated and actually rendered in the browser:

(In practice, those IFRAMES could be be made smaller, and hidden in amongst legitimate-looking content, or even made entirely invisible.)

I measured the compressed sizes of the 10,000 HTML pages that were returned.

There wasn’t much variety in the compressed lengths, which ranged from 267 to 270 bytes, but there was a massive skew in how many times each length showed up, to the extent that I had to use a logarithmic scale to make the graph fit:

It would be a good guess that, amongst the 30 replies that gave a minimum compressed length (i.e. compressed the best through having most repetition), we would find:

  • Search terms with four repeated digits, e.g. 7777.
  • Search terms with three or four characters that match a substring in the login ID.

And so it turned out, as you can see from the list of the 30 “best compressed” search terms:

In the above list I was easily able to construct a chain of four-digit values in which the last three digits of one value overlapped with the first three of the next.

There’s a potential ambiguity at 6289, the fourth block, which can be followed by 2890, 2893 or 2897.

But only 2897 is followed by other sequences in the list (8790 and 8793), suggesting that the three digits 897 make up the end of the login ID.

Technically, I haven’t decrypted anything, but I have nevertheless recovered a modest but significant block of data from the encrypted stream: the value of the eight-digit login ID, 10462897.

And that, in an admittedly fairly extensive nutshell, is the nature of the BREACH attack.

Defending againt BREACH

Clearly, to use my imaginary attack in real life, the following are required:

  • The HTTP pages must include both the PII that I want to guess and some chosen plaintext that I can control.
  • The server must be requested, and configured, to serve the pages with compression turned on.
  • I need to persuade you, while you are logged in, to visit a page containing my malicious do-lots-of-searches JavaScript.
  • I have to know what sort of PII I am looking for in order to structure my searches correctly.
  • I have to be able to sniff your traffic in sufficient detail to recover the length of each HTTP reply.

So, if you run a web server and you turn off compression for any pages that include PII, you neutralise this attack.

On the other side of the equation, you may be able to force your browser not to invite the use of HTTP compression in the first place, thus neutralising the attack.

For example, in Firefox, the about:config page allows you to edit the Accept-Encoding: header sent by the browser, changing it from the default of gzip, deflate to an empty string, meaning, “No compression, thanks”:

If any readers know how to achieve similar results in their favourite browsers, please let us know in the comments!

NB. Be aware that if you configure your browser to disallow compressed HTTP replies, your bandwidth use is likely to increase significantly. Compression will be off even for unencrypted web pages (where an attacker who can sniff your HTTP reply lengths can probably just grab the raw data anyway), and for pages where BREACH is irrelevant because there is no PII to extract. By way of example, I just loaded a Naked Security page that used up 183 KBytes with compression enabled, but would have taken more than 500 KBytes without.