Quality versus Quantity

A certain blockbuster movie would have us believe that, at the ancient battle of Thermopylae, 300 Spartans managed to hold off over 1 million Persians. Not quite the whole story, but it made for a good evening’s entertainment.

Meanwhile, how can 50 virus researchers deal with, say, 15 million suspicious files?

There is no doubt that one of the challenges in the battle against malware one of sheer quantity. Here at Sophos our systems currently receive a new file about every 2 seconds, and the growth continues exponentially. So how do we meet that challenge? Should we respond to quantity with quantity, maybe employing 500 or even 5000 researchers? Or is there another way? 

Let’s explore the different ways in which we might work, maybe give a secret or two away, and get a bit of (largely personal) historical perspective on the life of a Sophos virus researcher while we are at it. Figures in the following should not be taken as definitive, but they are drawn from experience of various ways of working at Sophos.

  • A virus researcher might retrospectively sort out two or three cases of existing infection per day.

Sometimes such work is necessary. There will always be some targeted attacks of new malware designed to evade existing detection. There will always be new customers only discovering an infection when they switch to Sophos. However, this is the most time consuming way of dealing with malware , not only for virus researchers, but also for our technical support staff and the infected customer. Sorting out cases of customer infection is the highest priority, but it is much better to be pro-active and prevent infection in the first place.

Therefore, dealing with cases of actual infection is the exception rather than the norm. Only a handful of SophosLabs researchers need to be involved in such work each day.

  • A researcher might write simple, signature based, detection for twenty “obvious” malware samples per day, or about five more subtle samples.

Customer samples are only a tiny percentage of the files we receive. We also analyse many files from other sources, and provide detection not only for potentially malicious files but also for PUAs (such as adware) and controlled applications. Again, this is important work, and working on individual files is sometimes necessary, but has some drawbacks. Whilst analysis of a new malware technique can be exciting, it only happens occasionally. Most samples are run of the mill stuff, and researchers processing these one after another soon die of boredom.

  • An automated system may process thousands of “run of the mill” samples per day.

Sophos certainly has systems like this, and they do a lot of important work for us, not least saving researchers from the above mentioned boredom. However, basic systems can still only make retrospective judgments on files that have already been seen, and malware are getting better at disguising their behaviour and confusing such systems. Updating a system to deal with new trends and challenges can be a full time task in itself, and the decision logic within such systems is far from trivial.

For example, if you see the same code section 1000 times but with different appended data in each sample, is that an ircbot worm which appends different random data each time it replicates? Or is it legitimate self extracting archive code, with appended data which sometimes might be an archived piece of malware, and sometimes might be a legitimate application. In the former case, detection of the exact file with appended data is pointless, the same sample will never be seen twice in the wild. Detection should be from sections common to all the samples. However, in the later case detection on the code of a legitimate installer would be disastrous!

 Therefore, such automated systems tend to be very conservative in the detection they produce.

  • A researcher relying on their own insights might provide generic detection for about 100 samples per day.

A human researcher can make much better contextual judgements about what material to detect on. A good piece of generic detection can detect 10s, 100s, or sometimes even 1000s, of variants of a malware family in one go. However, if researchers just rely on their own insight to spot patterns and trends then, as the volume of malware rises and the number of researchers employed too, a law of diminishing returns applies. Three years ago, when I was still quite new at Sophos, each researcher could still see a large percentage of what was coming in. We still aimed to hand analyse most samples in those days, and during analysis a researcher might sometimes think “hey, I’ve seen that technique twice already this week.” After bit of command line wizardry, plus a good deal of hand written code, a new -Gen or -Fam detection would be produced.

The teams in SophosLabs are still quite closely knit. We still prefer to be a small bunch of expert researchers instead of a large group of “sample processing monkeys”, but a lot has changed in recent years. A critical point was reached where researchers could not be expected to see everything and spot all the patterns. Nor could the increasing amount of generic work be coordinated if it relied purely on individual initiative and creativity. Therefore creative energies were also put not only into new detection technologies, such as our Behavioural Genotypes, but into database and system work to automatically spot patterns for us and provide essential statistical data.

  • A researcher aided by automated systems can provide pro-active detection for several hundred or even a few thousand samples per day.

This is by far the most powerful way of working. Let systems do some automated analysis and data mining, and suggest material for detection, but let experienced human researchers make the final judgments about context and classification. For example, it is easy for a system to spot that a certain string is present in several thousand malicious files, but it often needs a human to judge the meaning of that string. Furthermore, the position of that string might vary widely, and a virus engine which searched every file for it would be prohibitively slow. A researcher can make queries to discover other distinctive properties about the files in question. They may discover something unusual about the file structure, or maybe a rare instruction sequence appearing quite early in the code during emulation. The researcher can then put together a strategy for detection that eliminates 99.9% of files very quickly, and only occasionally performs that expensive string search.

Each piece of data can also be fed back into the system in such a way that it keeps a close watch on such properties, flagging up potential variations when malware authors change things in an attempt to evade detection.

Of course not every detection is for several thousand files. There are lots of small malware families, where we just see a few samples in total, but the important thing is to spot patterns where there are patterns and to make our response as efficient and generic as possible. Not all malware authors leave tell tale strings in their files, many use polymorphic encryption techniques. Spotting such obfuscation techniques, or classifying new packers, is a large part of our work. Furthermore, systems have to be very flexible to respond to such challenges. We can extract all sorts of information from files, but the “genes” that feed our behavioural genotype detection are still low level, often hand crafted pieces of code in Sophos’ “Virus Description Language”. This has itself evolved into a programming language in its own right, and is still evolving. Likewise, our systems for much of the above are in constant development. There are some exciting things in the pipeline.

Historians reckon the real figures at the battle of Thermopylae were 7,000 Greeks versus 200,000 Persians. After a few days the Greeks were defeated, but it was still an impressive stand. Moreover, it bought critical time for the Greek navy to prepare for the subsequent battle of Salamis, reckoned to be a turning point in the war. Meanwhile, work in SophosLabs is a team effort, fighting on the front line where necessary, but also researching and developing new technology to handle the ever increasing hordes of malware samples. Neither do we accept defeat on the front line: any customer infections will be dealt with. Even more effective is the pro-active prevention of infection in the first place. It makes for challenging and creative work, but we prefer to meet quantity with quality.