Anatomy of a pseudorandom number generator – visualising Cryptocat’s buggy PRNG

We recently wrote about a security advisory from Cryptocat, an open-source web-based secure messaging project.

Cryptocat’s hacktivist credibility was cemented in 2012 when its Canadian developer, Nadim Kobeissi, was stopped at the US border and interviewed about his Cryptocat-related programming activities.

The software enjoyed a surge in popularity as a result, which made last week’s Fourth-of-July vulnerability announcement all the more worrying.

It turned out that the Cryptocat code had for some time contained a number of cryptographic flaws that made it measurably less secure than its users might reasonably have assumed.

One of these flaws was found in the heart of Cryptocat’s PRNG (pseudo-random number generator).

PRNGs are a key component of cryptographic software

For the same reason that you don’t use the same password on more than one website, so that one cracked password doesn’t open up your whole online life, Cryptocat generates new keys every time you chat to someone.

So, to generate a steady supply of unpredictable, unguessable, one-use encryption keys, Cryptocat needs a steady supply of high-quality random numbers.

With this in mind, I thought it might be interesting, and educational, to drill into one of Cryptocat’s now-patched PRNG bugs.

Cryptocat random numbers, old-style

Cryptocat is written in JavaScript, which represents numbers internally as 8-byte IEEE-754 floating point numbers.

Usefully, 8-byte floats allow you to represent integers up to and beyond 32 bits, though annoyingly this numeric range doesn’t extend all the way to 64 bit integers.

For technical reasons, IEEE-754 numbers run out of integer precision at 253, which corresponds to nearly, but not quite, 16 decimal digits of precision.

Since Cryptocat’s core PRNG function, Cryptocat.random(), returns a JavaScript number, the coders made sure they extracted all possible variability every time it was called.

They constructed a text string representing a random 16-digit (53 bit) decimal number somewhere from 0 to 0.999999999999999 inclusive. In pseudocode:

// Start building a number (0<=N<1) as a text string

numstr = '0.'

// tack on 16 random ASCII digits

for i = 1 to 16 do
   numstr = numstr + randomASCIIdigit()
end

// Convert the string to an 8-byte float

return stringtonumber(numstr)

Curiously, the coders generated each random ASCII digit by extracting a random byte from a stream cipher of excellent repute known as Salsa20, and converting that byte into one of the characters from ‘0’ to ‘9’.

But a byte (8 bits) can take on 256 (28) different values, and 256 isn’t a multiple of 10.

Since they couldn’t divide 256 bytes into 10 equally-sized groups of values, the coders resorted to a trick:

repeat
   byte250 = randomSalsaByte()
until byte250 <= 250

randomdigit = remainder(byte250,10)

The idea is simple: 250 is evenly divisible by 10, so it’s easy to map the numbers from 1..250 onto the digits 0..9.

And 250 is very close to 256, so by simply throwing away random bytes from 251..255, the problem of indivisibility by 10 is sidestepped.

The problem

The code above is certainly an inelegant solution, since it is, in theory, at least, a potentially infinite loop.

→ If the randomSalsaByte() function is truly random, there is no mathematical guarantee that it won’t continue to return values above 250 for seconds, or minutes, or months.

It is also flat-out wrong.

The less-than-or-equal-to sign should be a less-than sign.

There are 251 different integers such that 0 ≤ X ≤ 250.

The programmer meant to specify that 0 ≤ X < 250 or, equivalently, that 0 ≤ X ≤ 249.

This is known as a fencepost error, by analogy with the fact that it takes seven fenceposts one metre apart to create a fence six metres long.

That’s because you need an extra fencepost at the far end of the sixth section:

Quantifying the error

When you divide numbers from 0..249 by 10 and take the remainder, the values 0 to 9 come out on average 25 times each, or 10% of the time, for a uniform distribution of digits.

But when you do the same with numbers from 0..250, there are 25 ways to come up with each of the digits 1..9, and 26 ways to come up with 0.

Instead of coming up on average 10% of the time (25/250, or 0.1), 0 will come up approximately 10.36% of the time (26/251, or 0.1036), and 1..9 will come up 9.96% each (25/251, or 0.0996).

The difference is easily visualised.

Here is a graph showing three tests, each generating 1,000,000 random digits using a good PRNG (the file /dev/urandom on OS X).

You’d expect each digit to come up about 100,000 times, and that’s what happened:

But the Cryptocat generator, with its one-part-in-251 inaccuracy, gives a graph that on casual inspection doesn’t look right.

Each time, the count for 0 is somewhere around 103,500 – approximately 10.36% of the 1,000,000 digits in each sample:

Of course, the visible differences between the red graphs and the blue graphs might not be statistically significant, since they could have arisen by chance.

There is, after all, no law that prevents unusual looking results, just as there is nothing that stops a casino’s roulette wheel from landing on red 20 times in a row.

But there are statistical tools we can use to measure the likelihood that a set of non-random-looking results really did happen by chance, not due to some flaw in the PRNG.

Chi-squared

One common method is known as the Chi-squared test.

You compute the difference between each reading and the average value you expected.

Then you square the differences, divide each one by the expected value, and add those scaled-squares-of-differences together.

Because you square the differences, readings that are unusually far from what you expected have a much larger effect on the Chi-squared value than readings that were close to correct.

You then do some fancy mathematical calculations (or look up the result in a pre-computed table) to convert the Chi-squared value into a probability figure; the lower the probability, the worse your PRNG looks.

Loosely speaking, the Chi-squared probability tells you the chance that the variation in results that you observed really did happen by accident.

So, the lower the figure, the less likely that the inaccuracies you observed can be explained away as natural and reasonable.

→ Lower than 0.1% is considered very suspicious, since it suggests that your observed result should only happen “against the odds” once in every 1000 tries. So there’s a 99.9% chance that your weird results are down to a flaw in your assumptions. In this case our assumption is that the numbers come from a reliable PRNG; if we reject that hypothesis, we effectively accept that the PRNG is faulty.

Generating 30 sets of 100,000 samples using the /dev/urandom generator produced a wide range of probability figures, none particularly alarming:

But similar results for 30 sets of Cryptocat PRNG digits have a much shabbier look:

Stepping up the number of samples in each test – simply put, giving more time for genuinely random variations to even themselves out – polarises the results even further.

With 30 sets of 1,000,000 samples, /dev/urandom seems to satisfy our assumption, namely that it is a decent PRNG:

Cryptocat’s fencepost error, however, has really put a spanner in the works:

The error propagated

Clearly, the PRNG’s bias towards 0 is consistent and quantifiable: it’s off-kilter by 1/251, as we showed above.

Does that sort of bias still cause a problem when we combine sixteen digits into a floating point string and convert it back to a 53-bit number?

What if we generate 53 bits of floating point precision, and then do this:

number = rounddown(Cryptocat.random()*100) + 1

Since the random() function produces a number from 0 to 0.9999999999999999, multiplying by 100 and rounding down limits us to numbers from 0..99, and adding 1 produces an integer from 1..100.

We’ve converted 16 flawed digits into 53 bits of IEEE-754 floating point fraction, then converted that again to fewer than seven bits’ worth of positive integer.

Perhaps that is enough to hide the bias we saw above?

It isn’t.

Visualising the flaw

Below, I created a 100×100 grid and then generated 10,000,000 pairs of Cryptocat-style floats.

I mapped each pair of 53-bit numbers to a pair of integers from 1..100, as explained above, and used that pair as an (X,Y) co-ordinate in my 100×100 grid, keeping tally of how many times each point in the grid got a “hit”.

I then used those counts to choose a colour for each point in the grid; more “hits” meant a brighter colour.

Using /dev/urandom to produce my digits from 0..9 in my PRNG, I ended up with what looks like (and, indeed, is) just so much random noise:

But using the flawed Cryptocat PRNG produced a ghostly grid of colours that clearly reveals the systematic prejudices in the PRNG:

And that, my friends, is why PRNGs are important, and why they should be tested thoroughly every time you build and ship the latest version of your software.

A PRNG which passes the Chi-squared test may very well still be deeply flawed, but one that fails, and which produces such clear visual regularities, is definitely unfit for purpose.

Random number generators, as the joke goes, are far too important to be left to chance.