Last week saw the first Workshop on Anti-Malware Testing Research (WATeR), a conference bringing together security and testing experts from industry and academia to discuss testing related matters, held in Montréal, Canada.
Among the papers presented were several looking at the sort of things current tests of anti-malware solutions reveal, and some things they do not.
Several of the papers updated topics that were previously discussed at Virus Bulletin conferences and elsewhere.
There was an in-depth talk analysing just how many samples or “test cases” a test needs to include to provide a statistically significant picture of performance against the huge numbers of new threats appearing each day (the short answer – a lot), and what aspects of sample selection may bias results.
There was also a description of the methods used by the École Polytechnique de Montréal (who hosted the conference) in a “field trial” of anti-malware.
This mirrored techniques used in clinical trials, by handing out laptops to real-world users, letting them do what they wanted with them, then periodically checking the machines out to see what threats they’d been hit with, and what, if anything, got past the defences installed.
One of the more thought-provoking talks came from Florida Institute of Technology professor and AMTSO president Dr Richard Ford, who asked “Do we measure resilience?”
Ford differentiated between “robustness”, defined as the ability of solutions to prevent malware from penetrating systems at all, which is covered by most anti-malware tests, and “resilience”, by which he meant the ability of protection and protected systems to recover from attacks which do manage to get through the border controls and establish a foothold on the machine.
He argued that the resilience side of things is important to end users and sysadmins, but is rarely covered in much depth in public tests.
For the most part, the leading comparative and certification tests look mainly at detection or protection metrics. We measure how many threats a product can pick up with its scanners, or how many it can block with the various other layers of filters and monitors included in most products these days.
These would all be robustness measures.
Resilience might perhaps be covered by a removal or clean-up test – seeing how well a product can deal with an infected machine. Some tests include these, but they tend to be performed separately from the “robustness” tests, as it’s hard to tell how well a product can clean something up if it doesn’t let the machine get infected in the first place.
Ideally, Ford argues, a clean-up test would be run as part of a protection test – any threats which are not blocked initially should be allowed to run to see if they are blocked or removed later on.
If a threat can disable the security product and take complete control of the machine permanently, that’s basically zero for resilience; however, if the threat can only run for a while before fresh updates allow the protection to recover and clean the infection up, that’s a little better.
Of course, most threats are about more than simply staying on the machine – it’s all about gathering up your data and sending it off to be abused by the bad guys. But how this is handled could also be considered a resilience measure.
If a machine gets infected with a keylogger, which is not initially spotted, some products might then detect it when it starts trying to read your bank account login details, or when it tries to send that information out to the internet.
In the case of the CryptoLocker threat currently grabbing the headlines, it might be that the malware is allowed to run, but blocked when it starts trying to make changes to files you’ve marked out as sensitive.
An analogy might be that robbers manage to break into a bank, but a security guard manages to pin them in the staff canteen until reinforcements arrive.
How well a product copes in these kinds of situations might well be very important, but it’s rather tricky to measure.
It means first of all getting systems infected with malware, which means finding items which defeat the “robustness” layer, then leaving them infected, ideally with realistic everyday actions going on, until such a time as the product under test either does something about them, or gives up the ghost.
That’s pretty labour-intensive work, and tricky to automate. There’s also a need for caution, as running a machine infected with unknown malware risks creating unnecessary dangers to the outside world – the machine could start spewing out spam for example. So the tester needs to ensure the risks are kept as tightly controlled as possible.
Even if you do manage to do all that, there’s then a further issue of rating the relative successes of different products.
Resilience is highly dependent on the setting – in some situations, it might be fine for a system to go down completely as long as it bounces back quickly, while in others it’s OK for the recovery to take a long time if the initial outage is only minor.
So, a tough proposition for us testers to work on, but one that could have some useful outcomes. Testing should show where products are less than perfect; if the world requires resilience then we need to see if products are providing it, and encourage them to do so if not.
The meeting was rounded off by a talk suggesting that in certain circumstances, and with the proper caution, it might be considered appropriate to create new malware for testing purposes, which generated the expected controversy, and a panel debating what areas might be ripe for deeper analysis by academic researchers.
The panel’s conclusions were that there is room for much more active collaboration between industry and academia, with the resulting cross-pollination of ideas and resources leading to good things for both sides, and indeed the world at large.
On the evidence so far, I’d be inclined to agree. Events like WATeR can shift our thinking in all kinds of interesting new directions.