Let’s say a site devoted to letting people download files has a URL that contains a bunch of numbers. What happens if you go into the URL window of your browser and bump that number up by 1?
Well, in this case, you get yet another downloadable file, and so maybe you bump the number again to see if you get another file. Say you do, and you keep increasing that number by one to get even more files. Make that a lot more files, as in, 7,000, achieved with automatic scraping of the site.
And then, surprise surprise, your younger brother is arrested as he walks to school; your home is raided; your family is corralled in the living room; your sister starts to cry; and law enforcement agents dump out drawers, turn over mattresses, and seize everybody’s laptops and mobile phones (meaning that your dad can’t work).
Oh, and of course, you’re now facing a criminal charge for being a “hacker.” For downloading files from Nova Scotia’s freedom-of-information (FOI) portal.
CBC News reports that this is what happened to a 19-year-old in Halifax on 11 April.
His name hasn’t been released because he hasn’t yet been arraigned. Also, his family requested anonymity. The young man says he’s worried that a conviction could skewer his chances of getting hired. He hopes the charges will be dropped. CBC News quoted him:
I don’t know if I’ll be able to get a job if this gets on my record… I don’t know what my future will be like.
The government says he’s a hacker. There isn’t supposed to be that much freedom in the freedom-of-information portal, so it’s charging him with unauthorized computer access.
The “hacker” – or non-maliciously curious archivist, depending on how credible you find the teen vs. government prosecutors – downloaded about 7,000 freedom-of-information releases, the majority of which were already scrubbed of personal information and had been made publicly available.
About 250 of the records – around 4% – were prepared for Nova Scotians requesting their own government files. The files were un-redacted, contained highly sensitive personal information such as birth dates, addresses and social insurance numbers, and hence weren’t intended for public release.
Nor were they password-protected. They were just there for the taking for anybody who likes to save stuff. And this young man is definitely one of those online archivist types, of which there are many.
Archivists don’t always care if they’re downloading material that’s been posted publicly or that’s been stolen from locked accounts. For example, in September, we heard about redditors trying to rip every single image from Instagram. Why? Because they could.
But the Halifax man says he wasn’t that type of archivist. He thought the records were all public, he told news outlets, and he didn’t download them out of malice.
I didn’t do anything to try to hide myself. I didn’t think any of this would be wrong if it’s all public information. Since it was public, I thought it was free to just download, to save.
Does that make it OK? Twitter users so far have been pretty vocal in the teen’s defense. Likewise for privacy and security advocates who’ve talked to news outlets.
Evan D’Entremont, a software engineer, told CBC News that as more details emerge, it’s looking more and more like “this kid’s being railroaded.”
He didn’t actually do anything wrong, and the government’s looking for somebody to blame in this.
VIDEO: Nova Scotia's government is accusing a 19-year-old of breaching their government website's security ~ Privacy experts disagree.— Brett Ruskin (@Brett_CBC) April 13, 2018
Oh, and here's how the teen did it: pic.twitter.com/FQ2qXJoP89
(For technical details about the portal and what the teen did, check out this post from D’Entremont.)
Others, calling the case a “travesty,” have started crowdfunding the teen’s legal defense. He’s facing up to 10 years in prison if convicted.
At Naked Security, there’s a bit of skepticism about the archivist’s claimed ignorance about scraping private information. The thinking: he’s done this before. In the past, his archivist inclinations have led him to amass data that include what’s typically the quickly submerged pages of sites such as 4chan and Reddit. He knows he was using the same loophole to get the Freedom of Information files.
In this case, he says he was curious to get to the bottom of a labor dispute about teachers. He didn’t find what he was after, so he wrote a simple one-line piece of code to automatically, sequentially increment the URLs and download the files. A few hours later, he had his 7,000 records.
If he’d quickly examined those files, he might have realized he was treading on other people’s privacy. Or then again, maybe not. According to what’s been reported, he would have had a 4% chance of hitting on one of those 250 out of 7,000 records that held private information.
The Electronic Frontier Foundation (EFF) has called the prosecution “ginned up.” The FOI portal apparently hasn’t put up “minimal technical safeguards” to keep out widely known indexing tools such as Google search and the Internet Archive from archiving all the records published on the site. The FOI portal took the system down, but D’Entremont has found several requests that Google indexed and cached. From his post:
This system is literally designed for facilitating “access to information.” …There are no authentication mechanisms, no password protection, no access restrictions. It’s very clear that the software is intended to serve as a public repository of documents.
The case is being compared to that of Aaron Swartz, an American who downloaded millions of journals from a server at MIT and whose prosecution was widely seen as prosecutorial overreach.
Readers, what’s your take on who’s to blame: the teen or the government?
Should the young man have put a bit more effort into ensuring he wasn’t asking for things he shouldn’t have asked for? Should the government be blamed for not redacting, or password-protecting, records published on a portal designed to let the public get at them? Is this the same as arguing that leaving your window open doesn’t make it OK for somebody to reach in and snatch your TV? Or is it different? Everybody knows you’re not supposed to walk into somebody’s private residence, even if the door’s unlocked. Is it criminal to download files that are supposed to be public?
The calendar pages are quickly flipping toward 25 May: the date when the European Union’s General Data Protection Regulation (GDPR) privacy law goes into effect. It’s leading companies to put quite a bit of effort into being careful about what kind of data they ask for, what they take and what they keep.
Should we all be held to that standard? Or should we expect that a portal made to provide access to public files is only going to provide files meant to be public?
32 comments on “Is scraping files from a Freedom of Information website ‘hacking’?”
It’s a “freedom of information” portal. The files are not password protected. One can assume that the content of those file is for public consumption. That’s how Internet tends to work: pages with no access restriction are for anyone to read.
Downloading the files for archiving (or for reading later without searching for it again) is not uncommon. The amount of files downloaded (7000) is a bit odd. But odd ain’t illegal yet…
However, Nova Scotia government is at fault when they failed to properly secure the content. To access a webpage you have several possibilities:
– search it with Duckduckgo or Google or Bing or [insert your choice]
– click on a link someone sent you (be careful with those)
– type the address in your browser and land on highly sensitive but unprotected content because you made a typo!
This is true, and yet at the same time. If you’re in possession of something you shouldn’t have, then there is a problem of some kind, and no, you don’t get to keep it.
Whether the provincial government blundered or not (seems they did) by publishing stuff that wasn’t supposed to be accessible this easily doesn’t excuse rooting around for it “because it’s there”. The fact that a simple script with a numeric increment could leech the data files, and that no serious hacking was required, isn’t really an excuse.
It’s a bit like seeing a huge pile of documents left on a colleague’s desk after everyone else has gone home – a pile that you might reasonably assume would contain at least some documents that shouldn’t have beem left out but ought to have been returned to the filing cabinet.
You know you oughtn’t to peek – in fact, you should probably pop them back in the filing cabinet yourself, or ask security to do so – but you can’t resist it, so you say to yourself, “Hey, I’ll just snap a photo of each and every page with my mobile phone, and then reassemble the pile and leave it undisturbed. Then I can riffle through my copies at my leisure later on, and see if there are any nuggets of naughtiness in there that I could, errrrr, well, that I could keep along with all the other documents I’ve impertinently and sneakily copied over the last few years. And if anyone finds out that I looked through files that a more scrupulous person would have decided weren’t actually for peeking at, hey I’ll blame my colleague for leaving the documents out in the first place, and insist that I was ‘making a backup’ in case the documents got lost overnight.”
This youngster didn’t exactly stumble upon this stuff, or retrieve it in an expected or usual fashion.
OK, so what he did wasn’t serious hacking and perhaps he’ll be exonerated or given just a slap on the wrist – but he purposefully went after a large tranche of data, deliberately using a “let me rummage around systematically and see what I can dig out” fashion. So for all that he might not be much of a cybercrook, he’s not exactly a wide-eyed innocent, wouldn’t you say?
Paul I agree with you on lots of stuff (and your’e smarter than I am anyway), but any coworker has a reasonable expectation of privacy on their own desk, and the Freedom of Information site shouldn’t.
I vote a nearer “meatspace” analogy is visiting a library to find a specific microfilm article–let’s say my great-grandparents’ wedding announcement/photo in the local paper. Once I’ve printed that, not much (besides closing time and boredom) prevents my rifling through the remaining articles on that particular film, the other films in the drawer, then all the other drawers.
Overly-nosy? Yep. Unnecessary? Yep. Overstepping the intended spirit of my visit’? Sure.
But if I unearth nuclear launch codes (or even just my neighbor’s SSN), it’s not because _I_ made the egregious decision** to store it where John Q. Public could find it. The librarian or board of directors should be more judicious with what they store unattended.
Too much time on my hands is (so far) not even a crime in old Styx songs.
** and certainly not because anyone’s ported my buddy
grepto meatspace yet.
erg. 9,000 prior comments also invoke a library comparison.
Aint the kid’s fault that the gov did the digital equivalent of putting some of their furniture on the side of the road and expecting no one to think that they are giving it away. Most of the time on the internet you just gotta assume that if its there and isn’t asking you for a password that its OK for you to look at it. If we didn’t do that we wouldn’t be able to browse anywhere without sending out at least 5 emails a minute asking web admins if it’s OK to browse their site.
If this is really the entire story then yes, the teen did nothing wrong. There needs to be more pressure put on web site creators and governments to secure their data. If something isn’t secured then it would be assumed to be free for taking. Someone from the government should be spending 10 years in prison instead.
Wait, Google indexed and cached some of these documents? Why isn’t Google being charged with hacking, then?
how come the guy who finds the problem is in trouble and the guy tasked with making the information secure hasn’t even been brought forward to answer for himself
How come the Private has more trouble for losing his rifle, then the General for losing a war?
“Is this the same as arguing that leaving your window open doesn’t make it OK for somebody to reach in and snatch your TV? Or is it different?”
Actually this doesn’t need a house analogy. It is the same as publically hosting documents containing personal information free for anyone to access without authentication. Nova Scotia failed to protect the privacy of said individuals by PUBLICALLY HOSTING THE FILES WITH NO PROTECTION.
A house analogy is completely false equivalence. It’s more like a public library or a shop – if the door opens when you approach it, you assume it’s fair to walk in. And when there’s self-checkout available and working, you assume it’s fair to use it.
If they didn’t want him taking it, they should have put up a password or account check. Now if he bypassed one of those, even a poorly designed one, they would have a case.
Actually, the house analogy makes sense. The kid didn’t access the sensitive documents by following a hyperlink; he “wiggled through the window” to get “the goods”.
Nevertheless, the main responsibility of protecting the data lies with the entity entrusted with it.
The key bit about “hacking” is that it means you are gaining *UNAUTHORIZED* access to a system. In order for that to be true, there must be a mechanism on the system to challenge a user to authenticate before serving them data. Bypassing that challenge and getting to the data anyway would be hacking. But this site didn’t have an authentication mechanism. No authentication = open to everyone. On top of that, the site’s stated purpose was to provide access to publicly available information. All this kid did was look at some unindexed files. It’s like going into the public library, skipping the card catalog and going directly to the shelves, going through all the books, and then being accused of trespass and burglary because you were only SUPPOSED to look at the books that were in the catalog, not just browse any book you liked.
It’s not the kid’s fault that the government carelessly intermingled private data in with the public stuff.
I don’t know the details of Canadian data protection law, but if FOI means the same there as it means in the UK, then personal records should never have been made available under FOI. That would be a Subject Access Request, not a Freedom of Information Request. FOI is for making public bodies disclose information that is in the public interest. SAR applies to both public and private bodies is to make them disclose person-identifiable details to the data subject. I cannot use FOI to find out what information the bank holds about me, but I could use SAR. Someone has made a major privacy mistake if FOI is being used to disclose personal information.
The comparison with theft of physical goods is not irrelevant. If someone “steals” (copies and uses) your identity then its value and utility to you is pretty much destroyed, and you will need to renew or replace it. The act of “theft” is not inconsequential.
I think it’s comparable to putting illegal books in a public library. Books with blank covers and library bar code. And then prosecute the person who borrows one of those books.
Even if he had malicious intent, which seems doubtful, he accessed publicly available, completely unsecured information. This wasn’t even a database unintentionally exposed, it was a database created for the public to access. Guess what, the public accessed it. One of them in an unexpected way, but there was nothing to indicate he was doing anything wrong, given the purpose and design of the site.
This is as if the government stored books full of sensitive information on the shelves of a public library and thought leaving them out of the card catalog was sufficient security.
Their claim is akin to calling looking at and pulling books directly off the shelf, “hacking”. It’s not “unauthorized access” if there are no authorization steps.
The UK used not to mark the Royal Ordnance Factory on the maps even though it was a huge site and had a sign proclaiming it at the gate. This was so the Russians would not know where it was (in the days before Google Earth, Maps and Streetview). This was their approach to security – pretend its not there.
Say a library puts documents (books) on the public shelves and doesn’t bother (or hasn’t gotten around to) indexing them.
Would it have the right to get upset if a patron systematically walked down the shelves reading the documents/books without bothering to try and the indexes first?
If it was the stacks area, which is private and accesses with permissions (passwords), then they might have a case, but not in public (unpassworded) areas.
“Is this the same as arguing that leaving your window open doesn’t make it OK for somebody to reach in and snatch your TV? Or is it different?”
I reckon it’s more like standing in the street and watching the pay TV the TV owner paid for. The TV owner still has his TV. He might be pissed that the guy in street is saving himself the money for the pay TV but then again he could have just drawn the curtains.
In here the window is the authentication. Certainly the house is not for the general public but authorized public (in other words only known people) will have access. An ideal example is the library.
Even if the server was password protected for authorized person/s if they could pull all information is it called hacking? Should not be the case. Because the intention of the server was to allow the general public to pull all information. Further, the government should have had mechanism to monitor and block if the intention was not to allow mass access if information.
Anyone who uses ID references in a URL without encoding them in some way is (in my view) making them open. If you go to a URL that is example.com/companyreports/1Q2018.pdf you would expect to be able to amend it to 1Q2017.pdf and get the document. Equally if it was /document_216.pdf you would legitimately expect to be able to change 216 to 215 or 217 to find the previous or subsequent document. It is like going to a book in the library or a bookshop and looking at the books either side of it. If there is no security, and references are not encoded, then there should be an expectation that you are entitled to look at others in the sequence, as there is absolutely no indication that these are not public. It is assumed that anything on a public web site is public unless there is clear and obvious security.
It is ironic that a Freedom of Information site with no security is claiming that people are hacking it by changing a number. Surely it should be the person who placed the information into a public space without protection or authorisation who is in the wrong, not the person who accessed it.
This poor kid is getting railroaded big time. The Canadian government has egg all over its face and instead of acting like responsible adults, they’re blaming someone else for their mistake. The kid should be suing the government for harassment, pain & suffering, and the cost of fighting this battle.
A smacked wrist for the lad who was pushing his luck, and prosecution for the gov who used a public system for personal data and provided no security.
If all it takes to be labeled as a criminal is sending an HTTP GET request, we’re all in for a world of hurt.
That’s silly, the contents of that GET request can vary wildly, now can’t they?
Sounds like the kid unknowingly embarrassed the government for their stupidity and now they will make him pay for.
If we did a Save as MHT, would that do the same thing? I used to do that to some sites to read them offline back in the slow modem days (while sleeping) Sometimes I found it downloaded tons more than expected.
I can’t speak for MHT but web crawlers, or tools like wget, rely on reaching pages by following links from other pages rather than guesswork (there are a lot of potential URLs out there).
The government agency should be put up for review, its a clear case of stupidity on their part, to assume that no-one would notice a numerical get variable in the address bar, and not try to increment it to index, archive, or see what else it would yield. Give the kid a thanks and send him on his way, and hire new staff for the FOI.