Anatomy of a cryptographic oracle - understanding (and mitigating) the BREACH attack

Filed Under: Cryptography, Featured

A whole lot has been talked, over the past week, about a newly-documented attack against secure HTTP, better known as HTTPS.

The attack is dubbed BREACH, fancifully expanded as Browser Reconnaissance and Exfiltration via Adaptive Compression of Hypertext.

Its notoriety comes from the fact that it was presented at the BlackHat 2013 conference.

There seems to be a fair bit of misunderstanding of exactly how BREACH works, and therefore how it might be mitigated or fixed.

Some articles have likened BREACH to previous attacks against secure HTTP, such as Lucky 13 and BEAST.

But that's not the full story, because those earlier attacks were against TLS.

TLS is the cryptographic wrapper, known as Transport Layer Security, that provides a secure conduit for electronic transactions of many sorts, notably web and email.

BREACH is really an attack against HTTP itself, the hypertext transfer protocol that is used ubiquitously for delivering web pages.

Indeed, you could even use the BREACH attack against an unencrypted web session, if you were unable to sniff the entire session but could only capture certain metadata about it, such as the lengths of HTTP replies.

What sort of attack is BREACH?

BREACH is what's known as an oracle attack.

That's where you are able to fire some questions at a computer system, observe the answers that come back, and use them to infer facts that the answers didn't intend to disclose - regardless of how garbled or irrelevant those answers might seem at first.

→ The name comes from the Oracle at Delphi, a Greek priestess who would answer questions while under the influence of psychotropic vapours. Presumably, the answers were trippily disconnected enough to be useless on their own, but were nevertheless interpreted by priestly intermediaries in an attempt to make them shed light on the issues under consideration.

A easy-to-understand example is given in a discussion on StackExchange from nearly a year ago, where the poster, Thomas Pornin, describes a trick he says was used in World War One by French codebreakers.

They noticed that the Germans included the rank of the intended recipient in the preamble of their messages, and though it was encrypted, the Germans did not disguise the length of this field.

So messages to the Colonel (Oberst) would stand out from messages to a junior officer (Oberleutnant) simply because of the difference in length of the two words.

Since messages to the Colonel were generally much more important, they were attempted first.

How does BREACH work?

The attack relies on the fact that many web pages are compressed to save bandwidth before they are sent out.

Loosely speaking, the web server takes the body of an HTTP reply and squashes it with an algorithm known as deflate, the same compression that is used in PKZIP and gzip.

On arrival, your browser inflates it and processes it from there as if the HTTP session had been uncompressed all along.

Even when a compressed HTTP response is wrapped in TLS encryption to produce a secure HTTPS connection, the actual length of the encrypted data stream (more precisely, the length of the compressed data inside the encrypted stream) can be determined.

And by trying lots of different inputs, each slightly different from the last, and seeing how well or badly they compress, you may be able to guess some of what's inside, even though you can't decrypt any of the messages

A practical example

Imagine a vendor's website that has a page on which you can search for invoices.

Of course, you need to login to get access, so when you first visit the search page, you'll be asked to do so:

Once that's out of the way, the web server will probably set a session cookie that your browser will send in the headers of future HTTP requests in order to denote that you have successfully logged in.

(An attacker can't just use an automatic tool like Firesheep to grab the cookie by sniffing - the traffic is encrypted over HTTPS.)

Now you can carry out a search:

You get back an answer which looks like this:

The raw HTML, decrypted and inflated, looks like this:

Because the content is encrypted in transmission, it doesn't seem inappropriate to include what amounts to Personally Identifiable Information (PII) in the body of the web page, in this case the login ID 10462897.

But notice that even though the web page is encrypted, there is one part that you can not only predict, but control precisely: the search term that you entered into the search box.

For example, once you've logged in and a suitable session cookie has been set by the vendor's server, you can bypass the search box and control the search results by directly tweaking the search URL:

Now stop to ask yourself, "In the web page above, where the login ID appears twice in close proximity, what would happen to the compression ratio?"

The deflate algorithm works by detecting repeated byte sequences in the input, and avoiding repeating them in the output.

Instead of outputting 10462897 a second time, the compressor says, "I already saw the next eight characters 66 characters ago."

So if you run a series of searches, with each search term being a guess at the login ID, you'd expect to get back HTTP replies of varying length.

In theory, the better your guess, the better the reply will compress (because of the repeated data), and the shorter it will be.

From oracular response to probable fact

For our imaginary website, I wrote a tiny HTML page that uses JavaScript to force you to perform 10000 searches of the vulnocorp.test website, each embedded in a tiny IFRAME:

I left each IFRAME visible, just for visual confirmation that the multi-search script was working and harvesting results for every four-digit combination, despite the script's tiny size:

Using the Firebug debugger shows us the HTML that was generated and actually rendered in the browser:

(In practice, those IFRAMES could be be made smaller, and hidden in amongst legitimate-looking content, or even made entirely invisible.)

I measured the compressed sizes of the 10,000 HTML pages that were returned.

There wasn't much variety in the compressed lengths, which ranged from 267 to 270 bytes, but there was a massive skew in how many times each length showed up, to the extent that I had to use a logarithmic scale to make the graph fit:

It would be a good guess that, amongst the 30 replies that gave a minimum compressed length (i.e. compressed the best through having most repetition), we would find:

  • Search terms with four repeated digits, e.g. 7777.
  • Search terms with three or four characters that match a substring in the login ID.

And so it turned out, as you can see from the list of the 30 "best compressed" search terms:

In the above list I was easily able to construct a chain of four-digit values in which the last three digits of one value overlapped with the first three of the next.

There's a potential ambiguity at 6289, the fourth block, which can be followed by 2890, 2893 or 2897.

But only 2897 is followed by other sequences in the list (8790 and 8793), suggesting that the three digits 897 make up the end of the login ID.

Technically, I haven't decrypted anything, but I have nevertheless recovered a modest but significant block of data from the encrypted stream: the value of the eight-digit login ID, 10462897.

And that, in an admittedly fairly extensive nutshell, is the nature of the BREACH attack.

Defending againt BREACH

Clearly, to use my imaginary attack in real life, the following are required:

  • The HTTP pages must include both the PII that I want to guess and some chosen plaintext that I can control.

  • The server must be requested, and configured, to serve the pages with compression turned on.
  • I need to persuade you, while you are logged in, to visit a page containing my malicious do-lots-of-searches JavaScript.
  • I have to know what sort of PII I am looking for in order to structure my searches correctly.
  • I have to be able to sniff your traffic in sufficient detail to recover the length of each HTTP reply.

So, if you run a web server and you turn off compression for any pages that inlcude PII, you neutralise this attack.

On the other side of the equation, you may be able to force your browser not to invite the use of HTTP compression in the first place, thus neutralising the attack.

For example, in Firefox, the about:config page allows you to edit the Accept-Encoding: header sent by the browser, changing it from the default of gzip, deflate to an empty string, meaning, "No compression, thanks":

If any readers know how to achieve similar results in their favourite browsers, please let us know in the comments!

NB. Be aware that if you configure your browser to disallow compressed HTTP replies, your bandwidth use is likely to increase significantly. Compression will be off even for unencrypted web pages (where an attacker who can sniff your HTTP reply lengths can probably just grab the raw data anyway), and for pages where BREACH is irrelevant because there is no PII to extract. By way of example, I just loaded a Naked Security page that used up 183 KBytes with compression enabled, but would have taken more than 500 KBytes without.

, , ,

You might like

16 Responses to Anatomy of a cryptographic oracle - understanding (and mitigating) the BREACH attack

  1. Mr C · 789 days ago

    Very interesting, do you know if IIS 7. 5 with PHP would by default compress pages? I think we have the PHP cache installed.

    • Paul Ducklin · 789 days ago

      No idea. But in logging stuff for my images I got a bunch of pages from IIS 7.5 servers that were (from memory) all compressed where that made sense, e.g. for HTML, CSS and JS but not for already-compressed stuff like GIFs and PNGs.

      Of course, that was for HTTP pages, not HTTPS - the IIS default for secure pages *might* be uncompressed precisely because this "oracle attack" is not a new idea.

      Anyone know the answer for Mr C?

      (Not to naysay the BEAST paper, of course - it's a new PoC to make the dangers clear - but this sort of danger has been known, if perhaps not appreciated, for some time.)

  2. Obliterator · 789 days ago

    Good article. Thanks.

    So could I not just configure the server to output some random hidden text to each page each time to mitigate against this whilst still using compression?

    • 2072 · 789 days ago

      I think so but make sure this random text is also random in length and compresses badly.

    • Paul Ducklin · 789 days ago

      I suppose you could add an HTML comment of randomly-varying length to mess up the stats so that good compression might be made to look bad by a long random comment, and bad compression made to look good by a short one.

      It's the length that needs to vary. If you dump a fixed-size random string into each HTML page you won't make any difference to which outputs compress the most , only to how long those shortest outputs are. (Generally speaking, this attack is about relative sizes, not absolutes.)

      Just be careful that by adding a hack like this you don't introduce some other flaw :-)

      (And use a good PRNG!)

  3. I think there is a typo here. Notice the difference in numbers. You have mentioned 98 in some places where it should be 89:

    There's a potential ambiguity at 6298, the fourth block, which can be followed by 2980, 2893 or 2897.

    But only 2987 is followed by other sequences in the list (8790 and 8793), suggesting that the three digits 897 make up the end of the login ID.

    Technically, I haven't decrypted anything, but I have nevertheless recovered a modest but significant block of data from the encrypted stream: the value of the eight-digit login ID, 10462987."

    It should be:
    There's a potential ambiguity at 6289, the fourth block, which can be followed by 2890, 2893 or 2897.

    But only 2897 is followed by other sequences in the list (8970 and 8793), suggesting that the three digits 897 make up the end of the login ID.

    Technically, I haven't decrypted anything, but I have nevertheless recovered a modest but significant block of data from the encrypted stream: the value of the eight-digit login ID, 10462897."

    • Paul Ducklin · 788 days ago

      Aaaargh. I kept looking at that and thinking, "Something is wrong here."

      I think the problem started when I mistyped the 8-digit code the first time, then copy-n-pasted it so I was sure to have exactly the same text for the deflate algorithm to match :-)

      All 2987s are now 2897s. (At least the error wasn't in the image of the 30 numbers, as that would have been more hassle to fix :-)

      Thanks for spotting!

  4. Carson · 789 days ago

    I think you mixed the numbers a bit up in the "2897" part. But on the other hand, Im a bit sleepy.

    • Paul Ducklin · 788 days ago

      I mixed them up a lot. All due to one transposition (89->98) at the start :-(

      Would you mind having another look and seeing if I have got it right at last? (I thought I had it right before, so a second look would be handy.)

      • Carson · 788 days ago

        Yes, I think now it is all good. Thats what happens if you fiddle with numbers (thats why I scored pretty low on my first College math exam).

  5. Thank you for this article! At first, I was sceptical about whether I would stand even a remote chance of understanding it, but you dumbed it down to my level nicely! :P

  6. Steve · 787 days ago

    Exceptionally well-presented, as usual. Is that too contradictory? Let me clarify: "as usual", for Mr. Ducklin; exceptional, relative to most of what passes for "professional" writing these days.

    And the little nugget about the Oracle at Delphi was a treat... "trippily disconnected"? LOL!

  7. Felix · 786 days ago

    Great article, thanks for helping people to understand this breach attack! One question that arises, then, is: are subjected to this attack only those pages that send some tokens several times (like the ID in your example)? For example: sites using wordpress do not use the address bar to sending tokens, so we could say that are something more protected against this attack?

  8. Thrawn · 783 days ago

    Rather than inserting random HTML comments, couldn't the website set a random-length cookie each time?

    That way, you can compress the HTML response body just like always; you can be sure that the true recipient will ignore your 'noise'; you can set the 'Secure' flag to avoid wasting bandwidth on unencrypted pages; and since modern browsers no longer use TLS compression due to CRIME, you don't have to worry about whether the noise is compressible. Plus, if you're running a web server like Apache/Nginx with an application server behind it, you could apply this at the web server level and not have to touch the application server at all.

    As a bonus, if you do ensure that the noise is (mostly) incompressible, then in theory you could safely turn TLS compression back on, because the browser would send your noise cookie back each time and scramble CRIME attacks.

  9. Thrawn · 782 days ago

    Instead of inserting random text into the page, why not set a random-length 'noise' cookie on each request?

    - You don't tamper with the page contents, so you're not likely to break anything.
    - If you use a web proxy for your SSL termination, then it can do the work without any changes to your application.
    - The browser will send the cookie for requests as well, so regular CRIME is defeated into the bargain.
    - Since browsers have disabled TLS compression to fight CRIME, your noise definitely won't be compressed (although you should make it mostly-incompressible anyway, in case they ever turn it back on).
    - You can set the Secure flag, so the cookie won't waste bandwidth on unencrypted requests.

    The attacker could still crack it by making a large number of requests and analysing the average change for each inserted string. But then their attack time blows out from 30 seconds (feasible) to several hours (probably not feasible in the real world).

  10. I don't understand this algorithm. It's do tricky for me!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

About the author

Paul Ducklin is a passionate security proselytiser. (That's like an evangelist, but more so!) He lives and breathes computer security, and would be happy for you to do so, too. Paul won the inaugural AusCERT Director's Award for Individual Excellence in Computer Security in 2009. Follow him on Twitter: @duckblog