A gargantuan all-seeing eye is watching you on popular websites

Eye of Sauron

Next time you visit a popular website imagine that your arrival on the site coincides with the arrival of a film crew at whatever home, office, coffee shop or bus stop you happen to be occupying.

A cameraman erects his tripod and rests the camera’s lens just above your shoulder with such haste that the page you’re visiting hasn’t even finished loading before a lens is locked on to it, greedily inhaling everything that happens on your screen.

Your head and hands are out of shot but every mouse wobble, scroll, click and keystroke is recorded.

You forget he’s there as you browse around, dropping things in and out of your shopping cart but the lens sees and saves everything. You move through the checkout process and get to a page that wants your name, address and credit card details. You fill them in before having a change of heart and deciding, no, you don’t need another pair of khakis. You don’t hit ‘submit’.

As the data sits unsent on your screen the cameraman reaches across, unrolls a short length of sticky tape, slaps it over your credit card number and then films everything you decided not to share.

As he flashes you a look that tells you exactly how clever he thinks he was for covering up that credit number you notice that your name, address, email, CVV and credit card expiry date didn’t get any tape. Come to think of it, you can’t remember him doing that to your password when you logged in earlier either.

Of course that story isn’t true, it all happens without a cameraman.

The eye of Sauron

It happens because of JavaScript, a programming language that can be embedded in web pages and which, more than any other technology, turns the World Wide Web from a collection of documents into a collection of interactive apps.

Its very old, featureful and well-established tool bag bulges with such useful things as: the clientX and clientY properties that capture the exact location of your cursor at any moment; the onkeypress event that coughs up whatever keys you’ve pressed; the value property that holds the contents of input fields; as well as countless other objects, properties and events that give websites access to everything from your physical location to the amount of battery charge in your laptop.

JavaScript’s features can be woven together to make everything from live chat clients and games to cryptocurrency miners and session replay scripts that record every single thing you do on a website.

Session replay scripts act just like a silent cameraman, recording aspects of your visit like how long you spend on each page, where your mouse goes and what you type, so that it can be played back like a movie by the site’s owner.

They exist to help website owners improve their sites by observing how users engage with them.

But how many users realise that this is even possible, that so much data is being gathered, that their choice to click “submit” or not doesn’t matter and that all the data that’s harvested is under the care of third-party tracking companies?

A recent study by researchers at Princeton University called No boundaries: Exfiltration of personal data by session-replay scripts revealed the extent to which session replay code is used on popular websites, and highlighted a number of serious privacy concerns that occur as a consequence.

Collection of page content by third-party replay scripts may cause sensitive information such as medical conditions, credit card details and other personal information displayed on a page to leak to the third-party as part of the recording. This may expose users to identity theft, online scams, and other unwanted behavior. The same is true for the collection of user inputs during checkout and registration processes.

Using a fairly conservative methodology the researchers looked for evidence of recording by seven of the top session replay companies: Yandex, FullStory, Hotjar, UserReplay, Smartlook, Clicktale, and SessionCam.

They were found on 482 of the top 50,000 sites, on domains as interesting as hp.com, intel.com, comcast.net, lenovo.com, costco.com and gap.com. Alongside their writeup the researchers have made a full list of the sites available.

The researchers identified three serious issues:

Failure to redact passwords

The research notes that all the services that were monitored took steps to prevent the accidental capture of passwords by excluding HTML password input fields. The trouble is that doesn’t always work (my emphasis):

…mobile-friendly login boxes that use text inputs to store unmasked passwords are not redacted by this rule, unless the publisher manually adds redaction tags to exclude them. We found at least one website where the password entered into a registration form leaked to SessionCam, even if the form is never submitted.

Failure to redact sensitive data

The replay scripts all take steps to automatically exclude the sensitive data you use when logging in, searching or making purchases, and provide tools for site owners to configure it for themselves. That’s laudable but, like all the best plans, it doesn’t survive contact with real life.

Four of the six tracking systems – FullStory, Hotjar, Yandex and Smartlook – will happily suck up your name, email address, phone number, address, date of birth and social security number if they fall into their maw. Hotjar and Yandex extend that laissez faire attitude to your credit card’s CVV number and expiry date.

The automatic redaction rules that do exist, to exclude things like passwords and credit card numbers, rely on websites to do their data capture in the same, predictable ways. They do not, which leads to cases like this (my emphasis):

FullStory redacts credit card fields with the `autocomplete` attribute set to `cc-number`, but will collect any credit card numbers included in forms without this attribute.

Automatic redaction also only applies to one type of data gathering done by replay scripts.

Alongside the data captured by monitoring key strokes or the contents of input fields, the scripts also capture “rendered page content” (screen grabs). Automatically redacting information that appears in screen grabs is hard (my emphasis):

…none of the companies appear to provide automated redaction of displayed content by default; all displayed content in our tests ended up leaking.

Instead, session recording companies expect sites to manually label all personally identifying information included in a rendered page.

That’s right, it’s up to individual sites to make sure your data isn’t hoovered up by the all-seeing, all-screen-grabbing eye of Sauron. And that, dear reader, means you cannot rely on it happening.

Why? Because the path most often taken in software development is the path of least resistance. In this case that path leads to your data being hoovered up in screen grabs of the websites you’re visiting.

For it to not happen, this has to happen:

…a site’s web application developers would need to work with the site’s marketing and analytics teams to iteratively scrub personally identifying information from recordings as it’s discovered. Any [small] change to the site design … requires a review of the redaction rules.

Not. Going. To. Happen.

Your data leaks during recording and playback

So far we’ve only concerned ourselves with the actual snagging of your data, but that’s just half the story. Once it’s been gobbled up your data has to be shunted somewhere else, stored and then made available for playback.

These days it’s more common than not to move data, any data, around the web using HTTPS, the secure and encrypted form of HTTP. It protects against MitM (Man-in-the-Middle) attacks that can steal or change your data, and it provides a degree of assurance that data is being sent to where it’s supposed to go.

Since the data captured by replay scripts could potentially contain passwords, credit card numbers, social security numbers, dates of birth, medical data or other highly sensitive, personal information, we’d expect HTTPS to be used when your data is sent to the third party recording services’ websites…

Yandex and Hotjar deliver the publisher page content over HTTP — data that was previously protected by HTTPS is now vulnerable to passive network surveillance.

…and when it’s played back to the site owner.

The publisher dashboards for Yandex, Hotjar, and Smartlook all deliver playbacks within an HTTP page, even for recordings which take place on HTTPS pages.

What to do?

Session recording is a complex business so it’s normally carried out by third party services. Those third party services can be disrupted in the same way that you might disrupt other forms of unwanted online tracking or analytics, by using third-party browser plugins like Ghostery or Privacy Badger.

The research also shows the hopeless, toothless, pointlessness of the not-quite-dead-yet DNT (Do Not Track) proposal that hopes to get websites to behave themselves by asking them nicely (my emphasis):

At least one of the five companies we studied (UserReplay) allows publishers to disable data collection from users who have Do Not Track (DNT) set in their browsers. We scanned the configuration settings of the Alexa top 1 million publishers using UserReplay on their homepages, and found that none of them chose to honor the DNT signal.

Disrupting known, third-party scripts only goes so far though. Developing a full session recording and playback capability is too big a job for most websites but using some of its techniques – such as tracking mouse movements or key strokes – isn’t very difficult at all (and it never has been).

Short of reading each website’s source code and forming a judgement about the intentions of its developers, there’s nothing you can do about that.

GDPR

It’s just possible that the tide is about to turn on services like this though.

In May 2018 Europe’s new rules dealing with how data is collected, stored, accessed and used; how users are told about those things; and what happens if you fail to do them, will come into effect.

What’s got everyone’s attention about the General Data Protection Regulation (GDPR) though is the size of the punch it packs. Firms that fall foul of the new rules could face fines of up to €20m (about $24m) or 4% of global annual turnover, whichever is bigger.

Storing the wrong data in the wrong way is about to become very expensive, making certain types of data much more of a liability and much less of an asset.

When we asked Sophos’s own Senior Cybersecurity Director, Ross McKerchar, in October for his cybersecurity predictions for the next six months, it was top of his list:

I expect to spend a lot of time in the next 6 months deleting unnecessary data and generally being very careful about what we store and where. It’s a defence in depth measure – the less you store the less you have to lose.

Let’s hope he’s not the only one.

If you want to know more about the problems of session replay tracking read the original research.


Image of 2015 NASM “Violent Universe”: Jeremy Schnittman courtesy of Flickr user goddard studio 13 under Creative Commons license.