When is a file not a file?

Sometimes it is easy to examine a file and to tell what it is.

Program files, at least on Windows, start with the two bytes ‘MZ’, after Mark Zbikowski, the Microsoft coder who invented the original EXE file format for DOS. Image files in the GIF format, often seen on web pages, begin with ‘GIF89a’. Many files carry a tell-tale format marker in their header bytes. Such markers are quaintly known as “magic numbers”.

Other file formats have no official magic, but are still recognisable. Program written in Python, for instance, are just plain text files. But the idioms of the Python language usually make such programs stand out – the first word in a Python file is often ‘import’, denoting the libraries the program uses; lines which don’t start with spaces often start with the word ‘def’, since that is how Python function are defined, and so on.

But what of encrypted files? How can you tell if a file is encrypted? Technically – assuming that the file is not re-encoded in some structured way after encryption – you can’t. At least, you can’t if the encryption is any good.

Strongly-encrypted data is indistinguishable from a stream of strictly random bytes, since to be strongly encrypted, the data must contain no discernible patterns which might be used to infer its original form.

Written English, for example, contains the letters ETAOIN very much more frequently – and predictably so – than VKXJQZ. Similarly, in English, Q is very much more often followed by U than by any other letter.

When encrypting data, it is vital that this, or any other sort of frequency skew, is removed. This leaves you with a file in which every possible byte value is equally likely at each byte offset in the file, and in which any byte is equally likely to be followed by any other. To an external observer, such a data stream appears random, even though you can easily reconstruct the original file using the decryption key.

The difference between random and *really* random can be subtle, and the difference may go unnoticed for years. The stream cipher RC4, for example, produces output which is nearly, but not quite, random. One particular flaw in RC4 means that second byte of any RC4 cipher stream has the value zero twice as often as it should – a cryptographic chink which led to the cracking of WEP, once considered suitable for WiFi security.

This begs the question: how can you tell, after encrypting a file, whether it really is encrypted? How can you be sure your encryption software is working correctly?

And the answer is: you can’t. (You may be able to prove that it *isn’t* working properly. But absence of proof isn’t proof of absence.) You really do need to trust your vendor!