When is a file not a file?

Filed Under: Cryptography, Data loss, Privacy

Sometimes it is easy to examine a file and to tell what it is.

Program files, at least on Windows, start with the two bytes 'MZ', after Mark Zbikowski, the Microsoft coder who invented the original EXE file format for DOS. Image files in the GIF format, often seen on web pages, begin with 'GIF89a'. Many files carry a tell-tale format marker in their header bytes. Such markers are quaintly known as "magic numbers".

Other file formats have no official magic, but are still recognisable. Program written in Python, for instance, are just plain text files. But the idioms of the Python language usually make such programs stand out - the first word in a Python file is often 'import', denoting the libraries the program uses; lines which don't start with spaces often start with the word 'def', since that is how Python function are defined, and so on.

But what of encrypted files? How can you tell if a file is encrypted? Technically - assuming that the file is not re-encoded in some structured way after encryption - you can't. At least, you can't if the encryption is any good.

Strongly-encrypted data is indistinguishable from a stream of strictly random bytes, since to be strongly encrypted, the data must contain no discernible patterns which might be used to infer its original form.

Written English, for example, contains the letters ETAOIN very much more frequently - and predictably so - than VKXJQZ. Similarly, in English, Q is very much more often followed by U than by any other letter.

When encrypting data, it is vital that this, or any other sort of frequency skew, is removed. This leaves you with a file in which every possible byte value is equally likely at each byte offset in the file, and in which any byte is equally likely to be followed by any other. To an external observer, such a data stream appears random, even though you can easily reconstruct the original file using the decryption key.

The difference between random and *really* random can be subtle, and the difference may go unnoticed for years. The stream cipher RC4, for example, produces output which is nearly, but not quite, random. One particular flaw in RC4 means that second byte of any RC4 cipher stream has the value zero twice as often as it should - a cryptographic chink which led to the cracking of WEP, once considered suitable for WiFi security.

This begs the question: how can you tell, after encrypting a file, whether it really is encrypted? How can you be sure your encryption software is working correctly?

And the answer is: you can't. (You may be able to prove that it *isn't* working properly. But absence of proof isn't proof of absence.) You really do need to trust your vendor!



, ,

You might like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

About the author

Paul Ducklin is a passionate security proselytiser. (That's like an evangelist, but more so!) He lives and breathes computer security, and would be happy for you to do so, too. Paul won the inaugural AusCERT Director's Award for Individual Excellence in Computer Security in 2009. Follow him on Twitter: @duckblog