Google has expanded its Transparency Report data to include stats from their ‘Safe Browsing’ system, which keeps tabs on where malware and phishing sites are hosted.
The data is a little short on definition, but it does give some interesting insights into which hosting providers are doing the worst job of keeping their IP space clean.
The twice-yearly Transparency Report has traditionally covered more politically-sensitive topics – which countries are blocking access to Google services, and who’s been asking Google to provide data on their users (or “product“), or to take stuff down that might be found offensive for some reason, or in breach of copyright.
Some of this stuff is interesting in itself, not least when it very nearly names-and-shames dodgy political and judicial figures trying to abuse their authority and silence their critics.
There’s also quite a big question mark hanging over just how “transparent” it all is, in the light of the whole PRISM brouhaha.
For the most part it seems fairly detailed and fine-grained though, or at least gives the impression of trying to be, as far as “the man” will let them, with some of the data even provided as spreadsheets for proper looking at by proper science-y types.
The new data is based on the Safe Browsing programme, which combines scanning by Google and reports from the wider web world to keep tabs on where the bad stuff is at; browsers use the data to filter search results, to protect their users from potential malware and phishing.
It’s a little less detailed; much of it consists of little graphs showing trends of malware and phishing spotted over time. Some is rather hard to find much value in, data for related topics covering wildly different time periods and thus hard to compare.
Some of the graphs seem more useful, but may not be; an apparently clear, if somewhat loose, correlation between the number of malware sites and phishing sites picked up at any given time may imply a definite link between the two activities, but could also simply be showing how hard the Google scanning crew were working that week.
The one graph which does seem clear is the contrast between “attack” and “compromised” sites – i.e., sites deliberately set up to get you, versus legitimate sites that have been taken over by the bad guys. The graph shows actual attack sites on the increase recently, but still barely registering – it seems the compromised sites outnumber them massively, and always have.
Again, there is, of course, room for some sampling bias here – it’s quite possible that the attack sites are better at hiding from Google, and of course they have no legit owners or admins to spot the compromise and report it.
Some numbers are available for these graphs, but they require some mouse skills to hover over the exact spot you’re interested in.
The real detail is on the “Malware Dashboard” page though. This breaks down the sites recorded by the Safe Browsing scheme by Autonomous System (AS – basically an ISP or other large-ish body responsible for a subsection of the internet).
It provides a rather undramatic world map highlighting which geographic regions are especially malware-ridden (nowhere’s that much worse than anywhere else, it turns out), but then also breaks down the data by AS, including details of how many threats have been spotted in each.
The clear leader recently, using the default three-month view, is one called “Webair Internet Development”, a US-based ISP on which Google has found 43% of sites checked have been malicious.
Looking at a sample of the domains they host seems to confirm some old stereotypes – it seems to be remarkably popular with gambling, pharmacy and porn sites, with domain names like “top3casino”, “247-pharmacy” and “seemyass” jumping out of the list.
This impression is reversed by checking into the next two in the list though, American Access Integrated Technologies and Spain’s True Records; both are listed as hosting 40% bad sites, but both are apparently hosting a random selection of legit-sounding domains (although, of course, there seems to be a fair amount of porn in both).
Again we come back to sampling error though.
The Webair listing says 43%, but as you may have spotted, that’s 43% of sites checked. In the period covered, Google has only actually looked at 2% of the sites hosted there. So, it all comes down to how good the Safe Browsing team are at deciding which sites to check.
If they’re super hot and have pinpointed all the bad stuff in the whole AS with just a few misses, we’ve got 43% of 2%, aka 0.86% – not such bad guys after all.
On the other hand, if they’re really terrible and have foolishly started their scanning with the handful of clean sites on a seriously malware-riddled section, it could be as high as 98.86% danger.
That’s the problem with stats, really – and we’re not even considering whether the results of the Safe Browsing checks could be in error.
Looking at the longer term, by turning the dial up to the maximum 1 year, the top five are all in the 80s and 90s, apart from number 1 which, rather intriguingly, is listed as “unknown” – they know it’s the biggest, but can’t say why.
All this top five also list the % of the total AS scanned as “unknown”. Not much for those real science-y people to play with here unfortunately.
So what’s the use of it all?
Well, the actual data on whether or not your site is listed is made available to site admins, which is helpful, but there’s nothing new here. The main value of this new regular report, it would seem, is to highlight potentially dodgy providers.
So, if you’re running a website and your provider comes high up in one of these lists, get in touch with them. Ask them, hey, what’s up, are you some sort of haven for crooks, or just incompetent?
If they really are dirty, you might just get them to clean up their act. If not, you’ll at least be helping keep them on their toes.
And if you’ve somehow got your mum’s flower arranging club website registered with a Russian ‘bulletproof’ provider, then maybe this should give you fair warning it’s time to move it on.