In my last article, I discussed anti-virus tests, particularly certification schemes. Today I'll focus on comparatives and group tests.
This is a much murkier area of the testing world, with certifications tending to be limited to well-known, usually well-respected expert testers.
On the other hand, it sometimes seems like anyone with a computer and more than one brain cell feels qualified to do comparative testing.
There are a lot of pitfalls to look out for, which often trip up unwary would-be testers, and regularly lead to wonky data and biased, inaccurate and occasionally completely off-the-wall conclusions.
1. Cleanup tests
The majority of comparatives focus on one or more of four main areas - detection/protection tests, false positive tests, speed/performance tests, and cleanup/removal tests.
Of these, cleanup tests are perhaps the most technically demanding, as they require a complete understanding of what the malware being used does when it infects a system, and how serious and lasting the changes it makes are.
As well as the skill to break this down and measure it reliably and accurately, a bit of judgement is required to rate the importance of the changes, as some products will leave behind what might be considered innocuous traces, while others might actually cause damage by ‘reverting’ changes to inappropriate states.
Given the amount of knowledge required, and the work involved, it’s no big surprise that this sort of test is one of the least commonly performed, especially by amateurs.
2 - Speed tests
Speed testing seems a little easier, as you don’t really need to know much about malware.
Everyone’s favourite complaint about anti-virus is that it slows your system down (a close second being, of course, that it doesn’t always protect against 100% of bad things – and never will), so there is always lots of interest in seeing just how much overhead various products impose.
There are some pretty complex pitfalls here too though, with the choice of what is measured as likely to trip up the unwary as the way measurements are taken. The best tests tend to combine a number of metrics representing realistic uses of a computer, and measure them multiple times for accuracy. The worst take a single, rather random factor and trumpet it as an indicator of overall performance.
3 - False positive tests
Speed testing may be viable for thorough and thoughtful non-experts, but there’s another caveat – a product may well be the lightest out there, and not get in the way of your super-high-speed online gaming experience at all, but that’s no use if it’s not providing decent security.
So it’s advisable for speed tests to be considered only in conjunction with tests of actual protection. The same goes for false positive tests, which are considered a vital companion to any protection test. A product which detects all malware may sound great, but it's no good if it also alerts on all good software or websites.
Again, false positive tests can be done reasonably well by a careful amateur, with another set of obvious and not-so-obvious potential pitfalls; you may not need access to lots of quality malware samples or live threat URLs, but you do need to be able to show that the clean stuff you’re using really is definitely clean, and also reasonably significant.
A false alarm on a minor component of Lawn Mowing Simulator 2008 isn’t going to cause much damage on a global scale, but detecting a core part of Windows and bricking half the world’s machines certainly will.
So, when a report claims that product A’s false alarm rate is high, look beyond the headlines and see what it actually alerted on, and whether that seems like something important enough to really matter.
4 - Detection tests
Moving on to the meat and drink of most comparatives, the protection or detection test is the main part of most group tests and as such brings the most opportunities for muddy thinking and plain craziness.
The simplest form of measure is a standard on-demand scan of a bunch of files, which can be performed relatively easily on a large scale. Its results are becoming less reflective of the full potential of many modern multi-layered solutions, but it remains a useful indicator of quality, particularly in the corporate world where gateway and server solutions make good use of ‘simple’ detection technology.
Even with this there are a wealth of issues to bear in mind. The most important is the selection of appropriate samples, which just like clean samples need to be carefully checked to confirm they are what they are supposed to be, and are representative of the diversity of threats around at any given moment.
The beauty of scanning tests is that they can be performed at very large scales to ensure statistical relevance, but this is only worthwhile if the tester has the skill, knowledge and time required to properly choose what to use.
5 - Protection tests
The holy grail of protection tests is a fully ‘holistic’ test which subjects solutions to a completely realistic real-world attack, exercising all available layers of protection.
At first glance this seems fairly easy; people seem to manage to get themselves infected pretty regularly without even meaning to, so that in itself clearly doesn’t need much expertise. To do it regularly though, and to ensure all products being compared are exposed to the same threat in the same way, and to properly measure just how each product responds to each threat, is extremely difficult.
As it’s a one-by-one kind of job rather than a bulk thing, it also takes much longer to perform, so sample sets tends to be massively smaller. Even the most experienced and well-funded expert labs, having developed advanced automation techniques, are rarely covering more than a few hundred test cases in tests spanning several months.
This makes the choice of samples that much more important too, as ensuring a representative sampling at these smaller scales is much more difficult.
6 - Methodology
All good comparatives should be accompanied by some form of methodology providing full details of each part of the test.
With any test component, the switched-on reader should look closely at the small print, and find out what sort of measures were taken, how they were taken, and how the data gathered was interpreted.
A lack of detail on how a test was performed is a strong hint that the test may not have been thoroughly thought through; if a test description sounds suspect on close analysis, it’s likely that the results themselves are similarly shaky.
Tips for testers
If I’m making this all sound complicated, that’s the idea – it is a deeply complex business.
I’ve only really skimmed the surface here, we’ve not even looked at the issues around product selection, drawing appropriate conclusions from raw data, and much more besides.
A few times here I’ve used the word ‘amateur’ to refer to testers from outside the rather enclosed world of the anti-malware specialty; I really don’t mean to insult anyone by this.
If you are one of the many part-time, occasional comparers of security products, or are considering doing your own comparison tests, I hope I’ve given you some food for thought.
If you want to take it further, have a look at some of the documents AMTSO has published, which go into all of these issues, and many more, in considerably more depth.
If there’s anything you don’t get, try asking someone who may have been there – most testers are fairly friendly and willing to share ideas. My email address is on the VB website, and other organisations should have easy-to-find contact points.
Why you should care
For the consumer of comparative tests who is looking for information to help with purchasing decisions and so on, it really comes down to one of two options.
The first is the easy way - just believe the headline figures or rankings any random reviewer puts out, blindly and unquestioningly, and trust to luck.
The alternative is to invest some time and effort into understanding the hows and whys of your chosen test, at every level you can. Dig up the test methodology, consider its implications, and if necessary go further and read the relevant AMTSO documents if something seems fishy. Compare multiple comparatives too, as all good ones have something different to offer.
I'd say the second path should give better results.
It's in everyone’s interest to encourage educated and thoughtful consumption of test data, and informed but insistent questioning of anything which doesn’t smell right.
Switched-on readers make for switched-on, accurate and useful tests; blindly trusting catchy or controversial conclusions without caring if they’re based on sound work only encourages sloppy and thoughtless testing, which will continue to mislead.
There are several highly professional expert labs (the people at AV-Comparatives recently pointed me at a useful list of some of them), whose output should be reasonably reliable.
But don't take my word for it - find out for yourself.