Fake text generator is so good its creators don’t want to release full version

Researchers at Elon Musk’s AI think tank OpenAI have created what amounts to a text version of a deepfake – and it’s too scared for humanity to release the full version.

Its AI writing tool generates reasonable-looking text on a wide range of subjects. It is based on research that the organization did to predict the next word in a sequence of text, it explains in a blog post on the topic. The tool takes a sample piece of text written by a human and then writes the rest of an article, producing dozens of sentences from a single introductory phrase.

The tool doesn’t discriminate between topics. Instead, it uses over 40Gb of text gathered from the internet to help it produce convincing-sounding copy on anything from Miley Cyrus to astrophysics.

The problem is that while the copy sounds convincing, all the facts in it are fabricated. The tool writes names, facts and figures effectively synthesized from something that the system read online. It’s like an electronic version of that old school friend who you regrettably accepted a Facebook invitation from and who now keeps writing bizarre posts with ‘alternative facts’. For example, it takes the following phrase…

A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown.

…and builds an entire news story around a fictional event. It fabricates a quote from Tom Hicks, who it says is the US Energy Secretary. At the time of writing, that role is occupied by Rick Perry.

OpenAI built the training data set, consisting of eight million web pages, by scanning Reddit for links that received more than three Karma (the site’s reward for popular content). The researchers were not necessarily looking for truth here, so much as interesting text that was either educational or funny.

The tool is also good at reading, understanding, summarizing and answering questions about text, along with translating.

This isn’t going to replace factual reporting anytime soon (phew), but it could automate some darker things online. It’s an article spinner’s dream, and as OpenAI points out, it could easily be used to write fake Amazon reviews by the thousand.

Perhaps the most worrying use case is the production of fake news via social media and blog posts. Marry it with other forms of deepfake (such as NVIDIA’s recently launched ThisPersonDoesNotExist) for the creation of fake faces, and deepfake video and audio, and you have the makings of an automated disinformation-spewing social media machine.

OpenAI realises this. It says:

These findings, combined with earlier results on synthetic imagery, audio, and video, imply that technologies are reducing the cost of generating fake content and waging disinformation campaigns. The public at large will need to become more skeptical of text they find online, just as the ”deep fakes” phenomenon calls for more skepticism about images.

No wonder the researchers decided not to release the fully-trained model. Instead, they released a scaled-down one, which uses less data and only included the sampling code. It didn’t release the broader 40Gb dataset, or the code used to train it. However, reproducing what they did is only a matter of time, they admitted:

We are aware that some researchers have the technical capacity to reproduce and open source our results. We believe our release strategy limits the initial set of organizations who may choose to do this, and gives the AI community more time to have a discussion about the implications of such systems.

That’s the problem in a world where knowledge – or the power to get it – is easily distributed. Secrets are difficult to keep. And with computing power increasingly cheap, AI’s processor-intensive training is becoming easier to reproduce.