On 25 May, the EU’s General Data Protection Regulation (GDPR) came into force.
Mind you, the law itself had actually been in place for more than two years. The game changer: as of May, people could now demand that organizations hand over the data they hold on them – via subject access requests (SARs) – for free.
…which is how technology policy researcher Michael Veale, of University College London, wound up banging on the door of Facebook’s data warehouse.
As The Register reports, Veale submitted an SAR to the platform on 25 May, asking for whatever data it had collected on his browsing behavior and activities away from Facebook.
Facebook’s response: to slam the door in his face. Sorry, it told Veale: it’s too tough to find your information in our ginormous data warehouse.
That’s not going to fly, Veale has argued, given that the information Facebook picks up can be used to suss out highly personal information about somebody, including their religion, medical history or sexuality… and that goes for both Facebook users and non-Facebookers alike.
In particular, we’re talking about data scooped up by Facebook Pixel: a tiny but powerful snippet of code embedded on many third-party sites that Facebook has lauded as a clever way to serve targeted ads to people, including non-members.
Veale is taking the matter up with the Irish Data Protection Commissioner (DPC), given that Facebook’s European headquarters are in Ireland.
The Irish DPC has launched an inquiry into the matter, telling Veale that the case will likely be referred to the European Data Protection Board, given that it involves cross-border processing.
Veale shared his complaint with The Register. In his complaint, Veale seeks to find out whether Facebook has web history on him that pertains to medical domains and sexuality: the areas where Facebook is known to be doing highly targeted marketing, as he told The Register:
Both of these concerns have been triggered and exacerbated by the way in which the Facebook platform targets adverts in highly granular ways, and I wish to understand fair processing.
Veale says that he’s used the tools Facebook offers the public to find out what it knows about us. Such tools include Download Your Information and Ads Preferences, for example. But whichever specific tools Veale availed himself of proved “insufficient,” he said.
As Mark Zuckerberg repeatedly said over the course of two days of testimonial in front of the US Congress in April, and as Facebook reiterated yet again in a “Hard Questions” blog post in the aftermath of that question-fest, Facebook uses data collected – even when users aren’t on Facebook – in order to improve safety and security, and to improve its own and its partners’ products and services.
But unlike Google, which offers a tool to see what it knows about us, Facebook earlier this year revealed to activist Paul Olivier Dehaye that it can’t share users’ data with them.
We’re all stuck in the Hive
As Facebook said in an emailed response that Dehaye shared with the UK House of Commons digital committee, he had asked for data regarding what ads he saw as a result of advertisers’ use of Facebook’s Custom Audiences product. He also asked what data Facebook got on him via Facebook Pixel on third-party sites: data that’s not available through its self-service tools because it’s tucked away in a Hive data warehouse.
The Hive data is kept separate from the relational databases that power the Facebook site, Facebook told him, and is primarily organized by hour, in log format. That warehouse is vast, and it’s stuffed with people’s personal data, but it’s way too hard to get at it, Facebook said, and if everybody lines up to ask for their data, we’ll blow a gasket.
The data isn’t indexed by user, Facebook explained. In order to extract one user’s data from Hive, each partition would need to be searched for all possible dates in order to find any entries relating to a particular user’s ID.
From the company’s response to Dehaye:
Facebook simply does not have the infrastructure capacity to store log data in Hive in a form that is indexed by user in the way that it can for production data used for the main Facebook site.
As Dehaye points out, Facebook’s claims mean that as its user base grows, its data protection obligation “effectively decreases, as a result of deliberate architecture choices.”
Likewise, Veale isn’t buying Facebook’s argument. He pointed out that those who research Big Data have already clearly established that even if such data isn’t stored alongside a user ID, web browsing histories can be linked to individuals using only publicly available data. Toss machine learning into the mix, and even more patterns begin to emerge, he told The Register, including information on sexuality, purchasing habits, health information or political leanings:
Web browsing history is staggeringly sensitive.
Any balancing test, such as legitimate interests, must recognize that this data is among the most intrusive data that can be collected on individuals in the 21st century.
He told The Register that he wants to debunk the notion that it’s beyond the technical wherewithal of Facebook – or of any other online platform, for that matter – to handle requests like his:
I hope to refute emerging arguments that the data processing operations of big platforms relating to tracking are too big or complex to regulate.
By choosing to give user-friendly information (like ad interests) instead of the raw tracking data, it has the effect of disguising some of its creepiest practices. It’s also hard to tell how well ad or tracker blockers work without this kind of data.
13 comments on “Facebook: It’s too tough to find personal data in our huge warehouse”
Yes, Facebook has very cleverly constructed its data mountain in a way that seems as though it’s unsearchable, but if that was true, why collect the data in the first place or keep it so carefully? These sort of lies should be challenged, and I wish Michael Veale every success in his Herculean pursuit.
So let’s imagine Facebook have a shop in the high street. Mr Zuckerberg is behind the counter with an assistant. A passer by stops to look in the window of the shop next door. Mr Zuckerberg steps out of his shop and starts making a speech to no-one in particular as to how his shop will zealously guard your personal information if you buy in his store. While the shopper is distracted the assistant pick pocket whips out his wallet, memorises all the personal and financial details then pops it back. Shopper walks away and Mr Zuckerberg and his assistant file the taken information in their index card system ready for sale to anyone with the necessary cash. Couldn’t happen, could it?
Good for you Michael Veale. I can’t wait to read more about this stuff as GDPR continues to get tested and enforced. Someone needs to hold these giant corporations responsible for what they do with the massive amounts of personal data that they collect (mostly without our permission).
Won’t they just be fined, by a lot, if they don’t comply? If they collect user data, does it make a difference if they just say “we can’t find it”?
Well, let’s try to think objectively for a moment. Data schema are designed for efficiency and effectiveness. Regulations are designed for reporting and compliance. These are not quite mutually exclusive goals, but nearly so. Thus, if a government says to re-architect your entire global infrastructure so that it complies with our demands, but we’re not going to fund or offer you any resources (or enough time — granted, GDPR has been “coming” for a couple of years, but this would be a massive, multi-year project.) Who pays for compliance? The company’s shareholders. What’s to keep a government from making increasingly expensive demands while shorting a company’s stock? If you don’t like it, don’t use it. That’s simple enough. What about data for non-users? Well, I don’t subscribe to any credit bureaus, but they sure know a lot about me. If the data were obtained legally, then saying it’s not legal to keep it is a bit of a logical stretch. I’d be interested in seeing audited metrics that show cost of compliance for a single user’s data in the “mountain” of information — what if it really costs $10,000 of time and effort? Why should stockholders pay for a disgruntled customer? And how about a lifetime limit of one? Otherwise, one person can continually press the GDPR button and try to bankrupt a company. Not asking you to agree or disagree; just think about additional aspects and accept that as Scott McNealy from Sun Microsystems famously said in 1999, “You have zero privacy anyway. Get over it.”
Gene M wrote “Well, let’s try to think objectively for a moment. Data schema are designed for efficiency and effectiveness. Regulations are designed for reporting and compliance. These are not quite mutually exclusive goals, but nearly so. Thus, if a government says to re-architect your entire global infrastructure so that it complies with our demands, but we’re not going to fund or offer you any resources (or enough time — granted, GDPR has been “coming” for a couple of years, but this would be a massive, multi-year project.) Who pays for compliance? The company’s shareholders. What’s to keep a government from making increasingly expensive demands while shorting a company’s stock?”
You could say the same thing about Sarbanes-Oxley.
If you don’t like following regulations, don’t collect user data. That’s simple enough. If it was more expensive to follow the regulations, than to not collect so much data and sell it, they wouldn’t do it. So, let them pay, let them bleed for every last cent until they finally treat our personal data as more than just a cheap currency, or until they stop collecting it.
If they can’t search the “Hive” for my data because it isn’t indexed in anyway associated to my user ID, then how do they provide targeted ads based on my browsing activity, application usage or search history?
What would interest me, as someone who would not touch a Facebook account with a disinfected bargepole, is how I can find out what information, if any, Facebook has amassed about me, which of course would be without my permission.
If Facebook cannot comply with the law, owing to technical difficulties, the solution is simple Facebook – stop this activity until you can. On a personal note I would like to see both Facebook and Twitter be removed from the Internet. They do far more harm to society than they do good.
The former is simply an ego ride for Zuckerberg.
If you are not a fakebook user, how do you apply for details of what data they hold on you without giving them personal information about you?
If you make multiple applications (say because you have multiple devices / IP addresses / email addresses) what is the best way to ask without giving them the keys to the kingdom by allowing them to knit together your multiple internet identities?
“Facebook Pixel: a tiny but powerful snippet of code embedded on many third-party sites that Facebook has lauded as a clever way to serve targeted ads to people, including non-members.”
Yes, that’s how they sell it. The real benefit from (reason for) all their data aggregation are the user profiles they can sell.
So if they are selling “user profiles” it has to key on some form of “user ID” – so what is it that goes with the request for the pixel that forms the basis for knowing against which profile (individual or collective) to file the pixel request? (IP address or Browser Fingerprint?) If we know that, a user access request with the same details should trigger the same profile.
Or is that too simplistic?