Just about every security company publishes some sort of prevalence data - those little bar charts and top tens showing the most important and widespread threats spotted in the last few days, weeks or months.
These lists are simple to absorb and make for easy, eye-catching PR stories: 90% of new malware comes from Greenland, 75% of spam is sent by the Vatican, etc etc.
At a more technical level, the raw data these summaries are based on can be a great resource for malware researchers, testers, security admins and academics studying the malware ecosystem.
How the data is gathered, compiled and interpreted is a rather involved and difficult process though, with plenty of opportunities for poor methodology choices leading to inaccurate or misleading results.
This week, I've been attending the AMTSO meetings in Bratislava, Slovakia, where prevalence issues have been the subject of some intense debate.
Picking the brains of the assembled experts from across the anti-malware industry (not to mention the major testing organisations, specialist security media and academia) has opened my eyes to some issues I'd not previously considered.
First up, why is prevalence so important? For product developers, it's great to know what the biggest issues are. You can make sure you're putting the effort into the right areas going forward.
Looking back, you can see how well your various techniques and technologies have performed, which ones need improving and which ones can be left as they are.
For sysadmins, it's great to have a heads-up on a major new threat, to make sure your company networks are well secured and ready for the onslaught.
For testers, it allows tests to cover what really matters; there's no way a test can hope to include every possible threat, so a subset has to be chosen, and accurate prevalence data allows that subset to be more representative of the real world, making the results more accurate.
It’s no surprise that so many people invest so much effort into gathering this kind of data. Most products have some kind of ‘phone home’ feature, reporting back to base when a threat is spotted. Nowadays, many products use cloud look-up systems, which record reams of data on what’s being looked up at the server side.
In the enterprise, clients report back to central management systems, which may in turn feed back to product developers.
In these ways companies get to hear a lot about what their products are detecting.
Limitations of detection data
This is the first problem – what they are detecting. Prevalence is necessarily based on what people already know about, as it’s pretty hard to measure what you can’t see. So, a lot of things will go unreported, at least until such a time as they are spotted and detection for them is implemented.
For testers this is particularly problematic, especially when trying to test protection against emerging, targeted and zero-day threats.
But, in some cases, tests can be carried out on day one, as soon as something is picked up by the tester, and the importance of emerging threats measured retrospectively later on, as more information becomes available.
In some cases, testers may even prefer to work with vanishingly rare samples, the highly targeted attacks crafted for a specific purpose, as those can be the most devastating to targeted businesses.
Prevalence data can also help pin down just where these are, if only by their absence from the record.
This limited vision can be improved by using data from a range of sources. The prevalence tables my team publish monthly at Virus Bulletin have long been compiled by merging together reports from several major firms (no easy task, given variations in the way data is recorded), and the disparity between what various people see most can be quite stark.
There is a cross-industry IEEE initiative offering a standardised format to facilitate sharing of metadata, operating alongside existing sample-sharing systems. The idea is that whenever people share samples with each other, they also share associated telemetry, info on when and where it was seen, how often, what it was classified as, and much more besides.
Once it is widely implemented, this system offers some opportunities for simplified and more accurate merging of data. This will hopefully lead to a clearer picture of the biggest threats.
The second issue is defining exactly what a *threat* is.
For the most part, when looking at binary files, unique items are recorded by file hashes. But in many cases a single piece of malware will be morphed numerous times, either locally in the case of old-fashioned file-infecting viruses, or at the server side with modern polymorphic Trojans or poly-obfuscated script attacks.
So a single file hash may not be seen more than once, but it should ideally be classed as just one instance of a much bigger threat.
When looking only at one company’s data this can be avoided by simply splitting threat data by detection IDs rather than file hashes. When trying to cross-match reports between different products, these detection IDs will rarely if ever match up, making accurate clustering very difficult.
Similar issues apply to URL-related prevalence info. A test might use a URL as the sample, rather than a file, but the prevalence of that URL is difficult to measure, not least thanks to the tendency of malicious sites to come and go, sometimes serving up malware and sometimes not.
Some attempts may be made by reporting products to match up binary samples to source URLs, but this is very difficult and resource-hungry if the file is not detected immediately on download. It is also difficult to attribute a URL to a prevalent cluster of threats, as a given URL may redirect to different places each visit, serve multiple morphed versions of the same threat, or may just as well serve completely different threats from visit to visit.
Clustering these dangers by original source vector or by final infection type would give useful pictures from different angles, but both are very tricky to do.
Actual threat danger
In anti-malware testing, when we look at false positives, it’s a good idea to consider the prevalence of clean files. This ensures we’re not penalising products for detections on rare and obscure things, which are not going to cause anyone any problems in the real world.
But we should also consider the importance of files, and how much damage detecting them could cause. If a product false alarms on a component of your favourite game, it’s a minor annoyance; if it alerts on a key system DLL and cripples your machine, it’s a serious problem.
Likewise in detection tests, it is perhaps just as important to consider the actual danger of a malware sample, as well as how widespread it is.
An infection which runs a click fraud scam is not a welcome thing; as well as using up system resources, it opens up the system to further, more serious compromise. But in itself it doesn’t really harm the infected user much – it’s whoever is paying for the scammed Google ad clicks that’s losing out. Compare this to a banker Trojan which steals bank login info and drains your account.
So even if the first threat is seen much more often than the second, perhaps the significance of protecting against it should be consider slightly lower.
Measuring this significance is a major challenge though, and something that tends to be based on human intuition rather than nice clean science.
A brighter tomorrow
All in all, it’s clear that prevalence and telemetry data is vital stuff, hard to gather and handle, tricky to properly interpret, but bursting with promise.
New viewpoints on the same issue can bring up entirely new problems, and also new approaches to doing things better. That’s partly why expert groups like AMTSO exist: to facilitate this pooling of ideas, experience and knowledge.
The pooling of data from multiple sources is clearly the best way to produce the broadest, deepest and most accurate prevalence information.
I hope that cross-industry, cross-sector collaboration can overcome these problems to produce reliable, usable insights into just what’s going on out there.