Last week, I presented at VB2010 a talk that was well received in the room and on the wires. A number of people have requested copies of or links to my presentation and paper (thanks to Helen Martin of Virus Bulletin for permission). Reading presentations without the commentary is difficult and so I will expand on a few slides here.
In the presentation I give five heuristics for detection and/or more in-depth parsing:
- Heuristic 2: If the objects or streams are mismatched look more closely
- Heuristic 3: If the Cross-Reference (XRef) Table is invalid look more closely
- Heuristic 4: The presence of LZWDecode, ASCII85Decode, DCTDecode and Encrypt Filter are indictative of clean files
- Heuristic 5: Hash (#) encoded tags are indictative of malicious files
In the two corpuses the definition of malicious differs - the first overly agreesive and the second less so however, it appears that:-
Within PDF files you have indirect objects and they are of the form N R obj (where N is the object number and R is the revision number). Each indirect object is associated with a tag endobj. Within indirect objects you can have stored binary data in a stream tag. Each stream is associated with an endstream tag.
Within the first corpus we see:
The second corpus shows:
"Because the PDFs are those reported to SophosLabs some of them are actually corrupt. Writing a parser to know when the files are maliciously corrput is non-trivial"
Heuristic 2: If the objects or streams are mismatched look more closely
There are two main ways of parsing a PDF file:
- Use the Cross Reference (XRef) Table which points to the position of each object and build the tree
- Brute force the file. Scan for starting and ending object tags and build the tree
Over the first corpus I attempt to validate the XRef table:
"It appears that readers and parsers must use both methods"
Heuristic 3: If the Cross-Reference (XRef) Table is invalid look more closely
Binary data within streams can be stored in various Filters (Adobe parlance for compression methods). Scanning the first corpus for different types of data shows the following prevelence :
Over the second corpus we see slightly different results:
"LZWDecode is suggestive of older PDFs and DCTDecode is used to store certain graphics"
Heuristic 4: The presence of LZWDecode, ASCII85Decode, DCTDecode and Encrypt Filter are indictative of clean files
The standard allows for Fonts names to have non-ASCII characters in them to do this the non-ASCII are encoded via hash encoding i.e. #61 the hash followed by the hexidecimal number 61 (ASCII 'a'). Scanning the corpus for Filters that use hash encoding gives:
"When I rescanned the 257 files with later data they are all malicious"
Heuristic 5: Hash (#) encoded tags are indictative of malicious files
Adobe have done a great deal of work to try and fix the problems of malicious PDFs: changed the update frequency of their products; changed the update mechanisms; and joined MAPP. Even so, there are still things they could improve:
- Conclusion 2: Only run signed external and internal code
- Conclusion 3: Implement strict parsing modes in reader (esp. browser plugins)
- Conclusion 4: Redesign PDF
- Conclusion 5: Flying Wallendas
Having readers by default warning when trying to open corrupt files would be a help. Browser plugins should even try to.
Redesigning PDF has already begun, and PDF/A is actually a good start. History has shown, that the problems with Microsoft Office macros went away with newer versions because of a redesign.
PDF Reader is being redesigned to have a sandbox but care must be taken not to allow sloppy code that relies on the sandbox to catch errors.
I finished the presentation by stating:
This house believes that PDF as a file format is no longer fit for purpose and that a new SDF (Safe Document Format) should take its place.
Of the ~200 people in the room ~75% agreed with the statement and ~3% disagreed.