Anyone who used the World Wide Web in the nineties will know that web search has come a long way. Sure, it was easy to get more search results than you knew what to do with in 1999 but it was really hard to get good ones.
What Google did better than Alta Vista, HotBot, Yahoo and the others at the dawn of the millennium was to figure out which search results were the most relevant and respected.
And so it’s been ever since – search engines have become fast, simple interfaces that compete based on relevance and earn money from advertising.
Meanwhile, the methods for finding things to put in the search results have remained largely the same – you either tell the search engines your site exists or they find it by following a link on somebody else’s website.
That business model has worked extremely well but there’s one thing that it does not excel at – depth.
If you don’t declare your site’s existence and nobody links to it, it doesn’t exist – in search engine land at least.
Google’s stated aim may be to organize the world’s information and make it universally accessible and useful but it hasn’t succeeded yet. That’s not just because it’s difficult, it’s also because Google is a business and there isn’t a strong commercial imperative for it to index everything.
Estimates of how much of the web has been indexed vary wildly (I’ve seen figures of 0.04% and 76% so we can perhaps narrow it down to somewhere between almost none and almost all) but one thing is sure, there’s enough stuff that hasn’t been indexed that it’s got it’s own name – the Deep Web.
It’s not out of the question to suggest that the part of the web that hasn’t been indexed is actually bigger than the part that has.
A subset of it – the part hosted on Tor Hidden Services and referred to as the Dark Web – is very interesting to those in law enforcement.
There are all manner of people, sites and services that operate over the web that would rather not appear in your Google search results.
If you’re a terrorist, paedophile, gun-runner, drug dealer, sex trafficker or serious criminal of that ilk then the shadows of the Deep Web, and particularly the Dark Web, offer a safer haven then the part occupied by, say, Naked Security or Wikipedia.
Enter Memex, brainchild of the boffins at DARPA, the US government agency that built the internet (then ARPANET).
DARPA describes Memex as a set of search tools that are better suited to government (presumably law enforcement and intelligence) use than commercial search engines.
Whereas Google and Bing are designed to be good-enough systems that work for everyone, Memex will end up powering domain-specific searches that are the very best solution for specific narrow interests (such as certain types of crime.)
Today's web searches use a centralized, one-size-fits-all approach that searches the internet with the same set of tools for all queries. While that model has been wildly successful commercially, it does not work well for many government use cases.
The goal is for users to ... quickly and thoroughly organize subsets of information based on individual interests ... and to improve the ability of military, government and commercial enterprises to find and organize mission-critical publically [sic] available information on the internet.
Although Memex will eventually have a very broad range of applications, the project’s initial focus is on tackling human trafficking and slavery.
According to DARPA, human trafficking has a significant Dark Web presence in the form of forums, advertisements, job postings and hidden services (anonymous sites available via Tor).
Memex has been available to a few law enforcement agencies for about a year and has already been used with some success.
In September 2014, sex trafficker Benjamin Gaston was sentenced to a minimum of 50 years in prison having been found guilty of “Sex Trafficking, as well as Kidnapping, Criminal Sexual Act, Rape, Assault, and Sex Abuse – all in the First Degree”.
Scientific American reports that Memex was in the thick of it:
A key weapon in the prosecutor's arsenal, according to the NYDA's Office: an experimental set of internet search tools the US Department of Defense is developing to help catch and lock up human traffickers.
The journal also reports that Memex is used by the New York County District Attorney’s Office in every case pursued by its Human Trafficking Response Unit, and it has played a role in generating at least 20 active sex trafficking investigations.
If Memex carries on like this then we’ll have to think of a new name for the Dark Web.
Image of Fractal Texture spiral Dark Web Abstract Nether licensed under Creative Commons, courtesy of TextureX on DeviantArt
21 comments on “Memex – DARPA’s search engine for the Dark Web”
Political dissenters will have to find another planet pretty soon
This system is not designed to identify anonymous users or to uncover the true location of things hidden by hidden services (the ‘hidden’ bit of hidden services is their virtual location, not their content.)
It’s designed to find things that are dark because the current generation of search engines can’t or won’t find them, not because they’re encrypted.
So, using your example, a dissenter who publishes dissent using a Tor hidden service and takes steps to avoid identifying themselves in the words they write would not be identified as a specific individual and their site’s location wouldn’t be revealed but the fact that their site exists might be.
So if you’re trying to get word out, it might be a good thing.
The Dark Web is dark for the same reason that lots of the countryside is – it’s just not worth the cost of putting up street lights.
Interstate 60 comes to mind. 🙂
TBH, most of the so-called dark web is not even technically ‘WWW’ AKA http/https traffic. It’s stuff like IRC and FTP (old school!) and P2P filesharing and even yes, anti-spying (privacy) networks. Of course, TOR and Freenet are forms of P2P but not intended primarily for sharing files.
Somebody needs to check what Memex means in Bahasa, Indonesia, just saying
There is no such word in bahasa Indonesia, which has no letter x.
It’s a slang language. It means V (woman’s genital)
The Internet was BETTER in the late 90’s. More freedom, more choices, less control, more privacy. Free information actually existed instead of this bullshit cloaked censorship that Google has manufactured to support a corrupt government owned by industry and advertisers who support them. How stupid do you think people are? Do you think they’ll not figure it out? Get a clue.
The internet was never anonymous and security was always and afterthought, if it was a thought at all.
Great point, just another freedom we’ve lost right before our eyes or under our noses
Great quote Mark (“The Dark Web is dark for the same reason that lots of the countryside is – it’s just not worth the cost of putting up street lights.”)
And here’s another one, which is so true in so many situations: “we can perhaps narrow it down to somewhere between almost none and almost all”
Excellent writing, as always. Nice job, Mark!
Thanks Mark for the great summary makes a lot of sense.
Mark wrote “Meanwhile, the methods for finding things to put in the search results have remained largely the same – you either tell the search engines your site exists or they find it by following a link on somebody else’s website.”
I thought that they simply tried to open port 80 iterating over IP addresses 0.0.0.1, 0.0.0.2, …,188.8.131.52, 184.108.40.206, 220.127.116.11, 18.104.22.168,…,255.255.255.253,255.255.255.254. That’s what they would have had to do in the first place anyway. And it’s the easiest refresh scheme for providers like Yahoo! and Google.
The industry uses name-based virtual hosting extensively so any given IP address could harbour any number of websites, none of which will be accessible with the IP alone (and you need to supply a hostname for a valid HTTP 1.1 request anyway.)
So you need to get a list of hostnames from somewhere and you can’t get them by reverse DNS on the IP.
The search engines use spidering – they follow links or you give them a sitemap. This works on the regular World Wide Web because people want to have their sites indexed so Google and Bing don’t have to try hard to find them.
How does one log into Memex in order to use it as a search engine? Do I enter it in IE’s search engine screen?
Does Memex index websites in a manner that ignores robots.txt rules. If so, is there any liability if such indexing causes damage to a Web server, accidentally or otherwise?
there are no ‘laws’ saying spider MUST obey robot files. It’s a courtesy and a way for engines to stop themselves from being black listed. (which you can do easliy in your config files)
The memex (a portmanteau of “memory” and “index”) is the name of the hypothetical proto-hypertext system that Vannevar Bush described in his 1945 The Atlantic Monthly article “As We May Think”.
You can downvote this all you want, but Mike is correct on the history of the term “Memex”
What is the best dark web search engine for Android systems ie for things like name searches user profiles gamer tags or general sites the person has joined
What exactly do we think we mean by “the internet”?