You can keep the hash as a placeholder for the original input, and use it later to validate that the original input data hasn't changed.
The size of the hash (usually measured in bits) needs to be large enough that it's vanishingly unlikely for you to get a match by chance, and the hash function needs to scramble up the input unpredictably enough that it's impossible for you to rustle up your own input data to get the same hash as someone else's file.
This is necessary for collision resistance. It must be computationally infeasible to find two files that share a hash. Clearly, if you can manufacture a collision, then the hash is not a reliable digital fingerprint.
For several decades, hash functions have usually used the Merkle-Damgård Construction. This, in turn, relies on a compression function. That's a bit-scrambling algorithm that takes two inputs and produces one output that is the same size as only one of the inputs.
In the well-known hash function MD5, for instance, each lap of the compression function takes 128 bits (16 bytes) of internal state information and 512 bits (64 bytes) of the file you want to hash. It munges and compresses these 640 bits of input into a 128-bit output, which becomes the new internal hash state. At the end, the final state becomes the 128-bit hash.
In recent years, however, some clouds have begun to cast a shadow over the Merkle-Damgård system.
Perhaps the biggest problem is that Merkle-Damgård is merely a way to build a known-good compression function into a known-good hash function. But it turns out that creating a known-good, collision-resistant compression function is surprisingly tricky. MD5, mentioned above, was found, after many years of apparently safe use, to have a real problem with collisions.
They published a seminal paper to make their point, aptly entitled MD5 considered harmful today.
The problems with MD5 and Merkle-Damgård understandably caused concerns over the long-term safety of the SHA-1 and SHA-2 hash families. They are today's recommended algorithms, and are in very widepsread use. But SHA-1 and SHA-2 were built in the same way as MD5.
What to do?
The US National Institute of Standards and Technology (NIST) decided to be proactive. NIST announced, in late 2007, a public competition to deliver Secure Hash Algorithm 3 (SHA-3). Here was a chance for anyone, without fear or favour, to advance the state of cryptography, and perhaps to leave the Merkle-Damgård construction behind forever.
By operating publicly, NIST ensured that the world's best cryptographers would enjoy a politically unencumbered chance to prove their smarts, and that the world's best cryptanalysts would take them on. It's tough for any submitter to get away with an accidental weakness, and even harder to sneak in a deliberate backdoor, if the entire world is watching. Especially if they're watching competitively!
Five years, 64 entries and three rounds of cryptographic cook-off later, and we finally have a winner: Keccak.
(Five years sounds like a long time, but in cryptanalytical terms it isn't. Nowhere is the saying "act in haste, repent at leisure" more apposite than in the field of cryptography. Remember that MD5 held out under years of scrutiny before crumbling.)
Keccak (which I am led to believe is pronounced "ketchak", like "ketchup") uses what is known as a Sponge Construction. It's called a sponge because it soaks up input at one rate, and then squeezes out its output at another.
Remember how MD5 mixes 512 bites of input into 128 bytes of internal state at each iteration? Keccak is quite different.
Keccack mixes 576 bits of input into an internal state of 1600 bits at every iteration, and then permutes - mixes up - all 1600 bits before soaking up the next 576 bits. At the end, 512 bits of the 1600 are squeezed out as the final hash.
Part of the cryptographic strength of a sponge is the large number of "spare" internal bits that aren't directly affected by the input bits each time the sponge sucks in new data. This is known as its capacity - and in Keccak, it's deliberately and conservatively set to twice the number of bits that you want to squeeze out at the end to form the hash. (1600 state bits minus 576 input bits leaves 1024 bits of capacity - exactly double the 512 bits of the hash.)
Heady stuff! By the way, get used to it as soon as you can. NIST didn't go to all this trouble to engender SHA-3 in order to keep on recommending SHA-2 for ever.
Well done, Keccak, and congratulations to Guido Bertoni, Joan Daemen, Michaël Peeters, and Gilles Van Assche, its inventors.