A team of 13 analysts at the Internet Watch Foundation (IWF) have used machine learning to help them figure out what secret code words are used by online communities of perverts to covertly talk about child sexual abuse images.
The IWF is a UK-based charity that every year removes tens of thousands of depraved images.
Sarah Smith, the technical projects officer who’s overseen IWF’s work, told Wired that the charity has been working on its database of paedophile slang for more than 10 years.
The abusers who trade this imagery have been developing a private, secret language over that time. At the dawn of the IWF’s work, over a decade ago, predators were openly sharing content through newsgroups, forums and on dedicated websites, often with clear descriptions of what the pictures depicted.
Chris Hughes, who leads the IWF’s team of 13 analysts, told Wired that back then, finding the content was as simple as a web search. You didn’t have to go to the Dark Web to find the material, given that it was easily available on the open web, he said:
It was possible to go to a search engine, type it in and get exactly what you wanted.
Up until a few weeks ago, the IWF’s database of paedophile slang contained about 450 words and phrases used to refer to abuse images. But over the last few weeks, that database has expanded to contain 3,681 more entries, with several hundred more still to be added.
Smith told Wired that the breakthrough came from the IWF’s development of an intelligent crawler that identifies new potential keywords. It works similar to those used by search engines such as Google’s: the IWF’s crawler scans parts of the web for potentially abusive content, including comments left on images or videos and metadata attached to files.
It’s targeted on what the IWF already knows, scanning sites that the IWF has already identified as potentially having child sexual abuse material.
The IWF has a huge database of URLs that it’s taken down over two decades of working on the scourge of abusive imagery. It’s now also incorporating machine learning technologies to help identify phrases commonly used.
Words and phrases don’t get added unless they appear in multiple places and are verified by humans. Otherwise, were innocuous phrases to be added automatically, it could lead to censorship.
The IWF isn’t publishing its list of keywords, for obvious reasons: it doesn’t want to show its cards to the predators. What the charity can say is that paedophiles use quotidian language, or made-up words, to refer to various types of abuse.
Some of them are almost alien. They don’t necessarily make a nice tidy word or phrase. They could be a collection of characters that don’t make an actual word.
He said it could be something as simple as a phrase like “purple cushions.” That’s not an actual example from the newly enlarged database. It’s just something that Hughes could see in front of him during his interview with Wired. “Purple cushions” is an illustration of how simple words can be used to indicate particular content, where that content can be found, the victim’s name, or a specific set of images.
Here’s Hughes again:
If you were to read something like that on a forum, where every other conversation is perhaps less covert, then we would take that phrase, do some additional searching on different sites and see if it produces results that give us an indication that ‘purple cushions’ is a phrase that people are using openly.
Abusers sometimes combine keywords with other words to impart meaning. Sometimes, they use several at one time to refer to certain images or behaviors, and sometimes they’re used in a particular combination.
Most of the newly expanded list of keywords are in English, but there are also terms in Dutch and German. In 2018, when the IWF removed 105,000 sites that were hosting abuse imagery, they found that some of the terms had been translated from Spanish. To further obscure meaning, some of the keywords were actually acronyms from one language, such as Spanish, that were then used with the English language, Hughes said.
Given the many layers of obfuscation, and given the danger of censoring everyday language that doesn’t have darker meaning, discerning context is crucial in this work, Smith said:
We have to try and follow the offender mindset and look at how they might be going about finding this content and try to disrupt that and cut those routes off.
The IWF expects that eventually, its members will implemented the expanded keywords list. It has more than 140 members, including Apple, Amazon, Google, Microsoft and Facebook, as well as Zoom, law enforcement groups and mobile phone operators.
Implementation will take some time, but the hope is that eventually, it will lead to the discovery and eradication of more child sexual abuse imagery than is now uncovered by the existing technique of using a database of hashed images to stop existing, previously identified content from being uploaded.
By having a greater understanding of these slang terms that are associated with these images, we can find websites and locate images that we haven’t seen before. The significant amount of keywords we have now identified will make it very much harder for them to be able to use those to identify and locate this type of content.
Latest Naked Security podcast
Click-and-drag on the soundwaves below to skip to any point in the podcast. You can also listen directly on Soundcloud.