Published personal data on 70,000 OkCupid users taken down after DMCA order

Last week, without users’ permission, Danish researchers publicly released data scraped from 70,000 OkCupid profiles, including their usernames, age, gender, location, what kind of relationship (or sex) they’re interested in, personality traits, and answers to thousands of profiling questions used by the site.

On Friday, the Open Science Framework (OSF) removed the data following a Digital Millennium Copyright Act (DMCA) complaint from OkCupid.

The researchers, Emil Kirkegaard and Julius Daugbjerg Bjerrekær, got the user data with a scraper – an automated tool for saving information from a website – between November 2014 and March 2015.

They did so without asking either the users or OkCupid if it was OK. By dumping the data online, they broke what many consider to be the cardinal rule of social science research ethics: taking personally identifiable information (PII) without permission.

Many pointed out that the users were highly re-identifiable from the dataset:

But when asked if they planned to anonymize the data, Kirkegaard said no, it’s already public:

The data was, technically, semi-public. You had to be logged in, and to have answered the same questions, to see somebody’s answers. Their answers could be scraped, but only if they’d chosen to answer a given question publicly rather than privately.

Kirkegaard told Retraction Watch that the researchers were surprised at the outcry:

We did not anticipate any strong reaction, no. We wanted to contribute a nice open dataset to science, we did not want to be famous for it.

Brian Nosek, the executive director of the Center for Open Science, which maintains the OSF, told Retraction Watch that they’d first heard about the possibility of users identifying information in the file on Wednesday and had initiated an investigation.

By Wednesday evening, the Center for Open Science had learned enough to ask that the dataset be removed or made private. That’s when the researchers made it password-requested, though Motherboard found that the open version was still accessible after clicking through various versions on the site.

On Thursday, after a full review, the group determined that it should be removed entirely. On Friday, OkCupid sent over the DMCA takedown order.

OkCupid’s terms and conditions state that “The contents of this website are protected by copyright and may not be copied or otherwise reproduced” without written permission and that “users may not publish or create derivative works from the contents of this website for any public or commercial purposes.”

The takedown doesn’t mean the data won’t reappear. The paper had been submitted for review, not published. Only its data was published at the time of submission, as required.

The pair did the research independently, on their own time. Aarhus University, where Kirkegaard is pursuing his Master’s degree, has distanced itself from the whole mess.

Kirkegaard said that if the journal winds up not taking the paper, the researchers will potentially publish it elsewhere.

The paper itself should be fairly uncontroversial as none of the findings are new – in fact, they were explicitly chosen as calibration tests for the dataset.

Retraction Watch asked him two good questions:

1) Why did he believe he didn’t need users’ permission? He responded with a Q&A the researchers had written, with this answer to the question of whether the data are public:

This depends on the definition used, but in our opinion yes. The profile information of many users can be freely seen from Google. This includes pictures, age, gender, sexual identity and the profile text. To see users’ answers to questions, however, one must have answered the same question. This means that one must be logged in with a user that has answered that question. OkCupid itself clearly states in their terms of service that the information may be public…Furthermore, when users answer a question, they get the option to answer the question privately…Most users do not choose to answer privately. We did not and could not scrape the private answers because they are not possible to see for others.

2) Why did they publish user names? He had two reasons. One was that the names are…

…an interesting topic of research. Usernames play a crucial part in a person’s presentation and so are not randomly chosen. One can thus research what predicts choice of username. For instance, do people who include ‘hot’ in their username see themselves as more attractive? Many users use animals in their names are people who chose the same animal more similar than people who don’t?

The second reason: they forgot to scrape some information, including the profile text. With the username, they could come back and do it later.

It is possible that the usernames will be removed in a future version of the dataset as one may argue that the two scientific goals above do not outweigh the privacy concern from the usernames being available.