NCD – interesting file similarity metric

One of the important areas of malware research is the ability to estimate similarity between files. Every month we receive thousands of files and use different methods to sort files into groups and extract best candidates to be targeted for detection using our Behavioral Genotype technology.

Today, I had a brief look at another method for finding similarities between two files. The method defines a metric called NCD (Normalized Compression Distance) and uses file compression to compare files.

NCD attempts to estimate how much one file can be compressed given that we already have the information of the second file. The formula for NCD is:

NCD = (cl(f1+f2) – min(cl(f1),cl(f2)))/max(cl(f1),cl(f2))

where cl is the length of a compressed file.

Let us take, for example, notepad.exe and compare it with itself. The algorithm executes three compression cycles. First, it compresses two files separately and saves the length of new resulting files, in our case, comparing notepad with itself, new lengths should be identical. Next, it concatenates two files and compresses that new file using the same compression method and saves the length of the new file.

Any good compression algorithm will take in account the fact that there are lot of repetitions in the resulting concatenated file and the compression will be more efficient. The compressed file size of two concatenated similar files will be only slightly bigger than the size of the single file. If two files are different, the compression of concatenated files will not be so efficient and the length of the concatenated compressed file will be closer to the sum of lengths of individual compressed files.

For very similar files we expect NCD to approach 0 and for different files 1 (it will always be in that interval). Initial results looked encouraging until I compared two already compressed files. Compressed files account for over 70% of files in our malware collection. It is well known that only very limited size improvements can be achieved by compressing a file multiple times. Therefore, two compressed or packed files cannot be compared using this method.

Experimentally, for two early variants of W32/Blaster worm (W32/Blaster-A and W32/Blaster-B) we get NCD of 0.91 and 0.12 for compressed and uncompressed samples respectively which unfortunatelly, confirms the previous conclusion that the method does not work for previously compressed files.

For more information about NCD please see the paper “Clustering by compression”.