The curiously-named system known as punycode is a way of converting words that can’t be written in ASCII, such as the Ancient Greek phrase
ΓΝΩΘΙΣΕΑΥΤΟΝ (know yourself), into an ASCII encoding, like this:
This makes it possible to encode so-called International Domain Names (IDNs) – ones that include non-ASCII characters – using only the Roman letters A to Z, the digits 0 to 9 and the hyphen (-) character.
That’s handy, because the global Domain Name System (DNS), responsible for turning human-friendly server names into computer-friendly network numbers, is restricted to that limited subset of ASCII characters in domain names.
(Back when DNS was codified, storage and network bandwidth were much more precious resources than today, with the result that limits on the maximum size of everything from character sets to network packets are typically much more restrictive in older protocols.)
Homographs – when two words look alike
If you were to register the domain…
…some modern apps may recognise the punycoding, and automatically convert the name for display as…
You can see where this is going.
Some letters in the Roman alphabet are the same shape (if not always the same sound) as letters in the Greek, Cyrillic and other alphabets, such as the letters I, E, A, Y, T, O and N in the example above.
So you may be able to register a punycode domain name that looks nothing like a well-known ASCII company name, but nevertheless displays very much like it.
For example, consider the text string consisting of these lower-case Greek letters: alpha, rho, rho, iota, epsilon.
In punycode you get
xn--mxail5aa, but when displayed (depending on the fonts you have installed), you get:
Punycode considered harmful
A security researcher called Xudong Zheng recently wrote an article describing how different browsers take different approaches to homograph problem.
He registered the domain
xn--80ak6aa92e.com, which is a Cyrillic version of the above Greek apple trick – an unlikely Cyrillic domain name that just happens to come out as аррӏе when converted back from punycode to “Russian” text.
Interestingly, many browsers take an aggressive stance against this sort of jiggery-pokery.
Safari and Edge, for example, just display it as plain old
xn--80ak6aa92e.com, at least if your system settings don’t include any Cyrillic languages:
After all, if you can’t read Cyrillic text in the first place, you don’t lose anything by seeing the domain name in its punycode format – in fact, you gain a lot by not seeing it as misleading faux-English text.
Internet Explorer shows the plain punycode URL, too, if your settings aren’t appropriate for the language that’s been punycoded, and also pops up a handy notification about the presence of “letters and symbols that can’t be displayed” in the web address:
Chrome and Firefox won’t automatically decode punycode URLs if they mix multiple alphabets or languages, on the grounds that such text strings are highly unlikely in real life and therefore suspicious.
But both Chrome and Firefox will autoconvert punycode URLs that contain all their characters in the same language, like this:
Apparently, Chrome will be adding additional browser protection to prevent this autoconversion, starting in the next version (Chrome 58), even though there’s a risk that some genuine non-ASCII domains might subsequently appear in the browser as punycode URLs.
Firefox programmers, on the other hand, are arguing strongly that because the Mozilla Foundation’s desire is to avoid favouritism, and to treat all languages equally, this sort of protection is culturally insensitive and technically undesirable.
They say that the browser isn’t the place for deciding when ASCII should take “first class status” over some other system of writing. (ASCII, by the way, stands for American Standard Code for Information Interchange.)
Some of the Mozilla team suggest, not unreasonably, that the responsibility for preventing “confusable” domains, such as the one used by Xudong in his blog article, lies with the registrars of each top-level domain.
If registrars are, in general, supposed to stop fraudulent or deliberately misleading domain registrations, Mozilla says, then they should be stopping “confusables”, too, in the same way that countries expect their Motor Registries to avoid issuing personalised number plates with potentially offensive or
B16OTED combinations of letters and numbers.
Not all of the Mozillans agree, however, pointing out that the risk of appearing “culturally insensitive” in respect of a small number of non-ASCII domain names is a small price to pay for making life harder for phishers and scammers in real life.
After all, deciding whether to allow or disallow a “confusable” domain name in the first place is itself a culturally subjective exercise.
Oh, what a tangled web we weave…
What to do?
Xudong has two good suggestions, to which we’ve added a third of our own:
- Use a password manager, which helps reduce the risk of pasting passwords into any incorrectly-named site. The password manager won’t match your Apple-in-ASCII password with the Apple-in-Cyrillic domain name, no matter what character encoding system is used.
- Force Firefox always to display punycode names. If you don’t (or can’t) read any non-Roman alphabets or writing systems, you lose nothing by going to the
about:configpage and setting
- Click on the padlock to display the HTTPS certificate.. This shows the domain name for which the certificate was issued using the DNS-friendly, ASCII-only format, so if the name starts
xn--then you are looking at a punycode domain, whatever it may look like in the address bar. (Note. Drill right down to the
21 comments on “Phishing with ‘punycode’ – when foreign letters spell English words”
Is there any benefit to removing the Cyrillic fonts from the system?
These days, there are plenty of fonts that include glyphs (a fancy word for “character and symbol shapes”) for mutiple languages, so fonts really group together glyphs for a particular visual style rather than for an individual language or writing system.
Anyway, it’s not only Cyrillic letters than can be used to look like letters from other languages, so short of going back to 1971 and having only 7-bit ASCII characters supported (no typographic symbols, special punctuation, accented characters, maths notation…no emojis!) you are on a bit of a hiding to nothing if you try to “fix” this issue by breaking your display 🙂
Setting network.IDN_show_punycode to false in Firefox is incorrect. Set it to TRUE to show the real URL.
Oops, I’ll fix that in the article. I guess I was thinking of “false” implying “don’t use the fancy feature of autodecoding the punycodes”.
Now fixed, sorry about that.
That’s what he said
I just tried suggestion #2 on Firefox 52.0.2 on Mac OS X 10.11.6 and it appears Firefox now uses “False” as the default. Maybe Mozilla decided to let each user choose an appropriate level of cultural insensitivity.
Thanks to Ralph Hartwell and Anonymous for the correction. I changed my setting from the default “False” to “True” and verified the difference at the test site. I was surprised at how normal the Cyrillic characters looked – it would easily fool most people if something else didn’t give the site away. Considering how easy it is to make the change in about:config I am disappointed that Mozilla doesn’t make the safe approach the default. People who don’t want to appear “culturally insensitive” would be free to make the change while others would be less likely to be sucked in by a phony site.
There’s a link in the article to the Mozilla discussion thread where numerous precautionary techniques are proposed, such as those apparently used in Edge and Safari.
One approach that I think other browsers use is to show the punycode if there are any other parts of the domain name that are in plain ASCII (i.e. not punycoded) and don’t otherwise seem to match.
The theory is that if you have an International Domain Name in the “.com” top-level domain then whoever reads the address bar will have to be comfortable with Roman letters anyway in order to read the “.com” part. In that case, given that you can make sense of the “.com”, you’re unlikely to be thrown by the ASCII punycode characters.
In other words, any domain in “.com” is safest to display in pure ASCII throughout – this gives consistency and security, with a very low likelihood of ever causing confusion or misunderstanding.
Change network.IDN_show_punycode to True in Firefox. the default is False.
I wrote the option the wrong way round originally – you turn ON show_punycode to turn OFF the display of domain names in non-ASCII character sets.
How about Internet Explorer? Despite Microsoft not liking it any more, some people still do.
Trying it now…
[wait a moment for my VM to start]
…IE 11.0.15063.0 on Windows 10.0.15063 comes up with the punycode (pure ASCII) URL by default. I have my system set to “English (UK)” with a US keyboard.
Interestingly, IE popped up a warning at the bottom of the window to say “This web address contains letters or symbols that can’t be displayed”, which was a handy giveaway!
I’ve now added an IE screenshot to the article…thanks for asking the question – it was a good one.
Perfect! I kind of figured Microsoft would implement both correctly (if they got one right — which they did).
I have found lately that both chrome and edge make it very hard to actually look at the certificate without delving into the f12 dev tools, I’m not sure why they are obfuscating it now but it used to be much easier in previous versions.
We bemoaned this ourselves in a recent podcast (the Chrome item starts at the 12’00” mark):
Chrome has been updated and doesn’t show punycode now!
*Not* showing the punycode means it *is* showing the International Domain Name version, which is the basis of this trick.
(The version I used in the article shows itself as up-to-date [2017-04-20T09:26Z] and shows what looks almost the same as “apple” written using Cyrillic letters. That’s Chrome 57.)
I take it that using Sophos email gateway would protect any links in emails regardless if they contain punnycode?
Yes. The issue with punycode is that the text of the link itself is different from how the link gets displayed when converted back into human-readable characters. The link itself, as encoded in the email, has to consist of ASCII characters, and can be reliably blocked on that basis.