Who is archiving the web, and what happens when people ask for information to be ‘un-archived’?
The internet found out recently, when a company with a questionable marketing history reportedly asked the world’s best-known web archive to eradicate its information.
The Wayback Machine, which is run by the non-profit Internet Archive, has been quietly archiving as much of the web as it can to create a permanent record of our fast-moving, volatile digital landscape.
The archive’s preservation of online data has proven valuable on several occasions. In 2014, Ukrainian separatist leader Igor Girkin bragged about downing a Soviet military cargo plane on social media. After that plane was revealed as Malaysia Airlines Flight 17, the post was deleted, but the Wayback machine still had the original message.
Clearly, archiving information has its benefits. So what happens when someone doesn’t want information about them to stick around?
This issue came up recently when Thailand-based FlexiSpy reportedly asked the Internet Archive to delete its webpages from the Wayback Machine. FlexiSpy, which sells software for monitoring phones and desktop computers, used to market its software as a tool to spy on cheating spouses. As Motherboard points out, another archive still maintains images of the company’s site from several years ago.
Search the Wayback Machine’s archive for FlexiSpy, however, and it reports that the URL has been excluded. Does that mean it complied with the request?
The Internet Archive did not respond to requests about its policy. However, its terms and conditions say that if asked by an author or publisher, it “may remove that portion of the Collections without notice.” Its FAQ says that site owners can “send an email request for us to review”.
Traditionally, the Archive has based its approach to exclusion requests on a policy created by UC Berkeley (archived version here). Under this policy, archivists should provide a ‘self-service’ approach that site owners can use to remove their materials using robots.txt files.
Robots.txt files are instructions left on sites for crawlers, telling them what they should not look at. Under the policy, a site owner could simply add one of these files at the top level of their site with a specific instruction for the Internet Archive, and then submit their site using a form.
That policy had significant implications for the Archive. In 2006, it settled with a firm called Healthcare Advocates, which was in the middle of a trademark dispute with a similarly-named company. Healthcare Advocates had added a robots.txt file to its site to stop crawlers spidering it. Under the Archive’s policy at the time, this should have triggered the site’s complete deletion from the Wayback Machine, but it didn’t.
Since then, the Archive’s policy on crawling has relaxed. In December 2016 it began ignoring robots.txt files on government sites, and then in April 2017 announced that it was “looking to do this more broadly”. However, the ability to request a deletion via email remains, as it always has done.
FlexiSpy’s request isn’t the first that the Archive has received. There are many others, and some have resulted in legal cases. In 2007, it settled with activist Susanne Shell, who had demanded that it take down records of her family rights site after alleging copyright infringement. Internet Archive said at the time:
Internet Archive has no interest in including materials in the Wayback Machine of persons who do not wish to have their Web content archived.
Nevertheless, the Archive doesn’t appear ready to roll over at every request. Nor does it seem to have completely removed robots.txt-based removals.
MSNBC host Joy Ann Reid has recently been the subject of controversy after Wayback Machine searches unearthed homophobic comments on her blog. She has said that someone hacked the Wayback Machine, which is an unsubstantiated claim that the Archive denies. The interesting part is that the Archive refused an emailed request from her lawyers to delete the offending posts, due to:
Reid’s being a journalist (a very high-profile one, at that) and the journalistic nature of the blog archives.
So, the Archive won’t always follow takedown requests. However, its automated robots.txt file policy apparently still does. Its decision to explore ignoring robots.txt files more widely clearly hasn’t kicked in yet, because someone put a robots.txt file on Reid’s live blog, and the automated removal process played out – the blogs are no longer visible. Perhaps that highlights the need for a more manual process?
A broader question is: Does honouring takedown requests, manually or automatically, affect the Archive’s value?
In an age of fake news, shrinking government trustworthiness and changing official narratives, scientists have already had to rush to preserve information in the face of political change. Without a reliable archive, how can we be sure that we are fixing statements in time and holding people to account?
The Archive is a small non-profit with around $17.5m in revenues, and yet it is currently our best hope for documenting the internet’s ephemera and making it permanent. However, without substantially more funding, it will have to pick its legal battles wisely.