They emailed me a PDF. It opened fine with evince and looked like a simple doc at first. Then I clicked on a field in the form. Strangely, instead of simply populating the field with my text, a PDF note window popped up so my text entry went into a PDF note, which many viewers present as a sticky note icon.
If I were to fax this PDF, the PDF comments would just get lost. So to fill out the form I fed it to LaTeX and used the overpic pkg to write text wherever I choose. LaTeX rejected the file.. could not handle this PDF. Then I used the file command to see what I am dealing with:
$ file signature_page.pdf
signature_page.pdf: Java serialization data, version 5
WTF is that? I know PDF supports JavaScript (shitty indeed). Is that what this is? “Java” is not JavaScript, so I’m baffled. Why is java in a PDF? (edit: explainer on java serialization, and some analysis)
My workaround was to use evince to print the PDF to PDF (using a PDF-building printer driver or whatever evince uses), then feed that into LaTeX. That worked.
My question is, how common is this? Is it going to become a mechanism to embed a tracking pixel like corporate assholes do with HTML email?
I probably need to change my habits. I know PDF docs can serve as carriers of copious malware anyway. Some people go to the extreme of creating a one-time use virtual machine with PDF viewer which then prints a PDF to a PDF before destroying the VM which is assumed to be compromised.
My temptation is to take a less tedious approach. E.g. something like:
$ firejail --net=none evince untrusted.pdf
I should be able to improve on that by doing something non-interactive. My first guess:
Error: /invalidfileaccess in --file--
Operand stack:
(untrusted_input.pdf) (r)
Execution stack:
%interp_exit .runexec2 --nostringval-- --nostringval-- --nostringval-- 2 %stopped_push --nostringval-- --nostringval-- --nostringval-- false 1 %stopped_push 1990 1 3 %oparray_pop 1989 1 3 %oparray_pop 1977 1 3 %oparray_pop 1833 1 3 %oparray_pop --nostringval-- %errorexec_pop .runexec2 --nostringval-- --nostringval-- --nostringval-- 2 %stopped_push --nostringval-- --nostringval-- --nostringval-- %array_continue --nostringval--
Dictionary stack:
--dict:769/1123(ro)(G)-- --dict:0/20(G)-- --dict:87/200(L)-- --dict:0/20(L)--
Current allocation mode is local
Last OS error: Permission denied
Current file position is 10479
GPL Ghostscript 10.00.0: Unrecoverable error, exit code 1
What’s my problem? Better ideas? I would love it if attempts to reach the cloud could be trapped and recorded to a log file in the course of neutering the PDF.
(note: I also wonder what happens when Firefox opens this PDF considering Mozilla is happy to blindly execute whatever code it receives no matter the context.)
For many years malicious PDF files had the shameful honor of being the number one way people's PCs got infected, and it's because of bullshit like this.
"Surprise, here's some Java code to execute on your personal computer without asking!" isn't being done by anyone who is actually your ally.
We're just discussing how shitty a shitty person has been toward you, at this point. There's no good pro-social reason to deliver you an app while calling it a document.
Do we think it's a virus? Probably not, but maybe.
So we think there's a tracker? Certainly. The average organization shitty enough to build or use this technology layer has over 500 separate relationships with companies that track you.
Someone tried to put a tracker in this PDF.
Whether people like me made it too hard for them is up for analysis.
I guarantee you that someone tried.
They're not good enough at hiding this stuff yet, to feel confident lying about it, so it likely is disclosed in the fine print somewhere, if you're feeling patient enough to read all of it.
This could just be a really stupid format, put out by a specific application for creating PDFs, because the original authors didn't want to pay Adobe (never attribute to malice, that which can be sufficiently explained with stupidity).
Does pdfinfo give any indication of the application used to create the document? If it chokes on the Java bit up front, can you extract just the PDF from the file and look at that? You might also dig through the PDF a bit using Dider Stevens 's Tools, looking for JavaScript or other indicators of PDF fuckery.
Does the file contain any other Java bytecode? If so, can you pass that through a decompiler?
would love it if attempts to reach the cloud could be trapped and recorded to a log file in the course of neutering the PDF.
This is possible, but it takes a bit of setup. In my own lab, I have PolarProxy running in one Virtual Machine (VM), using QEMU/KVM. That acts as a gateway between an isolated network and a network with internet access. It runs transparent TLS break and inspect on port 443/tcp and tcpdump capturing port 80/tcp. It also serves DNS using Bind.
There is then the "victim" VM which is running bog standard Windows 10. The PolarProxy root cert has been added to the Trusted Roots certificate store. The Default Gateway and DNS servers are hard coded to the PolarProxy VM. Suspicious stuff is tested on this system and all network traffic is recorded on the PolarProxy system in standard pcap format for analysis.
Does pdfinfo give any indication of the application used to create the document?
Oracle Documaker PDF Driver
PDF version: 1.3
If it chokes on the Java bit up front, can you extract just the PDF from the file and look at that?
Not sure how to do that but I did just try pdfimages -all which was not useful since it’s a vector PDF. pdfdetach -list shows 0 attachments. It just occurred to me that pdftocairo could be useful as far as a CLI way to neuter the doc and make it useable, but that’s a kind of a lossy meat-grinder option that doesn’t help with analysis.
You might also dig through the PDF a bit using Dider Stevens 's Tools,
Thanks for the tip. I might have to look into that. No readme.. I guess this is a /use the source, Luke/ scenario. (edit: found this).
I appreciate all the tips. I might be tempted to dig into some of those options.
It's literally just the format of the file here. If you skip the java serialisation header it's a normal pdf file. I said nothing about the pdf file itself.
I did explain what it is. I just don't know why certain programs encode it this way. It's supported by multiple pdf readers so it must be semi common but I can't find a reason for it to be encoded this way.
I'm trying to help you out there's no need to be a dick.