Malicious PDFs: A summary of my VB2010 presentation

Last week, I presented at VB2010 a talk that was well received in the room and on the wires. A number of people have requested copies of or links to my presentation and paper (thanks to Helen Martin of Virus Bulletin for permission). Reading presentations without the commentary is difficult and so I will expand on a few slides here.

In the presentation I give five heuristics for detection and/or more in-depth parsing:

Heuristic 1

For my paper I gathered a corpus of ~130 000 PDF files. Of which half were malicious. Scanning the corpus for the tag /JavaScript gave the following results:

For the presentation I gathered a larger corpus (concentrating on files with JavaScript) and that gave:

In the two corpuses the definition of malicious differs – the first overly agreesive and the second less so however, it appears that:-

“While JavaScript is not neccessary for maliciousness it is not neccessary for the majority of clean files.”

Heuristic 1: If the PDF contains JavaScript look more closely

Heuristic 2

Within PDF files you have indirect objects and they are of the form N R obj (where N is the object number and R is the revision number). Each indirect object is associated with a tag endobj. Within indirect objects you can have stored binary data in a stream tag. Each stream is associated with an endstream tag.

Within the first corpus we see:

The second corpus shows:

“Because the PDFs are those reported to SophosLabs some of them are actually corrupt. Writing a parser to know when the files are maliciously corrput is non-trivial”

Heuristic 2: If the objects or streams are mismatched look more closely

Heuristic 3

There are two main ways of parsing a PDF file:

  • Use the Cross Reference (XRef) Table which points to the position of each object and build the tree
  • Brute force the file. Scan for starting and ending object tags and build the tree

Over the first corpus I attempt to validate the XRef table:

“It appears that readers and parsers must use both methods”

Heuristic 3: If the Cross-Reference (XRef) Table is invalid look more closely

Heuristic 4

Binary data within streams can be stored in various Filters (Adobe parlance for compression methods). Scanning the first corpus for different types of data shows the following prevelence :

Over the second corpus we see slightly different results:

“LZWDecode is suggestive of older PDFs and DCTDecode is used to store certain graphics”

Heuristic 4: The presence of LZWDecode, ASCII85Decode, DCTDecode and Encrypt Filter are indictative of clean files

Heuristic 5

The standard allows for Fonts names to have non-ASCII characters in them to do this the non-ASCII are encoded via hash encoding i.e. #61 the hash followed by the hexidecimal number 61 (ASCII ‘a’). Scanning the corpus for Filters that use hash encoding gives:

“When I rescanned the 257 files with later data they are all malicious”

Heuristic 5: Hash (#) encoded tags are indictative of malicious files


Adobe have done a great deal of work to try and fix the problems of malicious PDFs: changed the update frequency of their products; changed the update mechanisms; and joined MAPP. Even so, there are still things they could improve:

Conclusion 1

Heuristic 1 suggests that JavaScript isn’t that common and so lightweight readers shouldn’t implement it, especially, browser plugins.

Conclusion 2

If running code (via JavaScript or Flash) it should be signed so you can have some level of trust. This isn’t a fail-safe method but it helps.

Conclusion 3

Having readers by default warning when trying to open corrupt files would be a help. Browser plugins should even try to.

Conclusion 4

Redesigning PDF has already begun, and PDF/A is actually a good start. History has shown, that the problems with Microsoft Office macros went away with newer versions because of a redesign.

Conclusion 5

PDF Reader is being redesigned to have a sandbox but care must be taken not to allow sloppy code that relies on the sandbox to catch errors.

I finished the presentation by stating:

This house believes that PDF as a file format is no longer fit for purpose and that a new SDF (Safe Document Format) should take its place.

Of the ~200 people in the room ~75% agreed with the statement and ~3% disagreed.

My colleague Mike Wood – who also presented at VB2010 – joined Chet and me in a podcast.