Have you ever seen really early photographs of urban areas?
They never seem to have any signs of activity: no horses; no carriages; no people on the busy streets.
The reason isn’t that photography was such an expensive novelty that the streets were shut down and tidied up to produce the best-looking pictures.
It’s because exposure times were so long that no-one was still enough to impinge on the image, so only the architectural solidity of the cityscape was recorded.
(The earliest known photo of a person is said to be the man at the bottom left of the image above, who had apparently stopped on a Paris street for a shoeshine some time in April or May of 1838.)
We’ve had a similar problem taking reliable snapshots of the internet, because doing a network scan of the whole thing has typically taken weeks or months to complete.
If there was a server dishing out web pages from IP number 198.51.100.42 when you started your scan, who’s to say it was still there when you finished?
If the network 203.0.113.0/24 seemed totally deserted when you passed by, who’s to say it didn’t suddenly burst into life once you’d moved on to the next network block?
You may have heard of Nmap, which we’re fond of at Naked Security (its creator, Fyodor, is the owner of one of our T-shirts), and if you have, you might wonder why we need yet another network mapping tool.
The answer is that Zmap doesn’t aim to compete with or replace existing, general-purpose mappers like Nmap, which is excellent for scanning subnetworks in depth.
Zmap was built expressly (pun intended) to do a shallow scan – typically of a single port or service – of the entire internet, or at least the IPv4 internet, from a single, dedicated computer, in under an hour.
If you’ve done any network scanning, that probably sounds like an outrageously unachievable goal, especially when previous internet scanning projects have needed weeks or even months to achieve a similar result.
But Zmap has, by all accounts, done so, thanks to a few new tricks.
For a start, how do you proceed quickly yet systematically?
If you go in numeric address sequence, much like a Google StreetView car has to since it can’t be in Caracas, Venezuela at 14:00 and in Brisbane, Queensland at 14:01, there will be whole subnets where you can only proceed slowly because the target network is slow.
Your probe packets – even though you are only sending one ethernet frame per probe – will enter the target network much more slowly than you can send them out, so you will receive your replies correspondingly slowly.
The outcome might also be extremely antisocial to the network you are probing, effectively producing a DoS, or Denial of Service.
Zmap solved this problem by using what are known as cyclic multiplicative groups.
Before each scan, the software comes up with an iterative formula that visits each integer from 1 to 232 – 1 (all possible 32-bit numbers except zero) once each, in a pseudorandom order.
Although each successive probe follows a strict algorithmic sequence, the IP numbers bounce around throughout the IPv4 address space.
As a result, you don’t get hundreds or thousands of probes converging in on a single subnet at the same time.
When your loop gets back to the first IP number you visited, you’ve completed your random journey through the internet, but without ever bunching up your traffic in any one part of it.
The other trick Zmap used was to avoid worrying about maintaining what a computer scientist would call state.
Instead of keeping a giant list of probes it’s sent, and the time they’ve been out there, and how much longer it should wait for each one, and painstakingly updating the list with every recognised response, Zmap just lets rip through its cyclic multiplicative group.
It has one software component that spews forth the probe packets, using raw network sockets to avoid any overhead in the kernel’s TCP software stack; and another that collects and saves any replies.
With a 1Gbit/sec outbound connection, and by bypassing the TCP stack, one computer can just about fill the pipe, successfully producing more than 1 million probe packets per second.
And with a maximum of about 3.7 billion addresses available for use out of the 232 theoretical IPv4 addresses, and 3600 seconds in an hour, Zmap really can crawl across the entire surface of Planet Internet in under an hour, without needing any special network drivers.
Thanks to the cyclic random traverse through IPv4 space, that 1 Gbit/sec of probe traffic is spread in an egalitarian fashion through the internet, and so nothing gets choked up.
Replies either come back and are explicitly logged as successes, or don’t come back and are implicitly logged as failures.
The result – to me, at any rate – is quite astonishing: it just works!
As for why you would want to do this, and what you might learn, I suggest you check the Zmap paper, which has some interesting graphs measuring things like HTTPS adoption rates (see above); prevalence of buggy UPnP implementations in routers; and the number of weak cryptographic keys from Debian’s 2008 randomness bug that are still in circulation.
Of course, because of the speed at which Zmap can “make a lap” of the internet, these figures can be tracked regularly and swiftly, thus allowing their rate of change to be measured.
This was never possible before, because each “lap” typically took long enough that significant change took place while the scan was in progress, thus making its results largely useless.
Only one giant question remains.
Whatever will we do when IPv6 takes over?