Ever wondered how much metadata is included within the PDF files you email or share with others. Well, believe it or not, there is a lot that can be determined from a PDF you've created. This post looks at how to clean the metadata from your PDF files before you send them, and how to protect them, so they aren't easily edited or copied by a recipient. These techniques are sometimes referred to anti-forensics with the goal to limit the amount of forensic information you provide within a file that you have produced.
If you're after the quick copy and paste solution, skip to the bottom of this post where I show you how to create a function in bash to automate the whole process.
Prerequisites
Before embarking on the solutions I've provided in this blog post, I've assumed you're using Ubuntu or a Debian flavoured OS, and that you have the following tools installed.
apt install exiftool
apt install qpdf
apt install pdftk
Cleaning Metadata From Your PDF Files
Let's start by running the following command on a PDF document to see what metadata is actually contained in a PDF file. You can run this command safely on any PDF.
$ exiftool -all My_Secrect_Document.pdf
Using an example PDF and the above command the following output is produced which shows all the current metadata that is associated with the PDF file.
ExifTool Version Number : 10.10
File Name : My_Secrect_Document.pdf
Directory : .
File Size : 318 kB
File Modification Date/Time : 2017:04:17 23:55:19+10:00
File Access Date/Time : 2017:04:17 23:57:09+10:00
File Inode Change Date/Time : 2017:04:17 23:55:33+10:00
File Permissions : rw-------
File Type : PDF
File Type Extension : pdf
MIME Type : application/pdf
PDF Version : 1.5
Linearized : No
Page Count : 5
Language : en-AU
Tagged PDF : Yes
XMP Toolkit : 3.1-701
Producer : Microsoft® Word 2016
Title : Josh Lemon
Creator : Josh Lemon
Creator Tool : Microsoft® Word 2016
Create Date : 2017:04:17 23:55:19+10:00
Modify Date : 2017:04:17 23:55:19+10:00
Document ID : uuid:B3ABB980-2F6F-4FCD-9650-D015ED64C528
Instance ID : uuid:B3ABB980-2F6F-4FCD-9650-D015ED64C528
Author : Josh Lemon
You'll notice the metadata in the PDF file gives a fair amount of information away including;
- timestamps for creation and changes to the file,
- my local language pack,
- the timezone my computer is set to,
- the application I used to create the file and version number - in this case Microsoft Word,
- and the name I've used to register Microsoft Word.
While it's unlikely that this information on its own could lead to something as sinister as my machine being compromised, it does give away a fair amount of information about the computer I created this PDF on. Gathering information and reconnaissance about a target is where an attacker starts, so limiting this footprint helps limit an attacker in their initial research before a compromise begins. For example, if an attacker wants to exploit the Word application I use they now know exactly which version to concentrate developing an exploit for.
Let's look at how we can trim some of this metadata down. There are two main tools I prefer to use, qpdf and exiftool. Qpdf allows you to linearize a PDF file, among other PDF manipulation functions, traditionally this is for creating web optimised PDF files for faster downloading and viewing. Exiftool, on the other hand, allows you to view and update metadata for files. Exiftool isn't only limited to PDF files, however, for this example we'll stick with PDF's but feel free to run the above exiftool command on other files to see what results you get. It's worth noting that these tools​ remove common metadata from a PDF, some metadata may still exist in the file including font metadata and object metadata.
First, we start with qpdf to linearize the PDF and strip its metadata.
qpdf My_Secrect_Document.pdf My_Secrect_Document_CLEAN.pdf
When we now view the metadata of the file you'll notice a lot, compared to the above output, has been stripped out.
$ exiftool My_Secrect_Document_CLEAN.pdf
ExifTool Version Number : 10.10
File Name : output.pdf
Directory : .
File Size : 307 kB
File Modification Date/Time : 2017:04:18 00:25:51+10:00
File Access Date/Time : 2017:04:18 00:25:51+10:00
File Inode Change Date/Time : 2017:04:18 00:25:51+10:00
File Permissions : rw-r--r--
File Type : PDF
File Type Extension : pdf
MIME Type : application/pdf
PDF Version : 1.5
Linearized : No
Language : en-AU
Tagged PDF : Yes
Page Count : 5
Once you've used qpdf, you can then run exiftool over the file. In the below command I'm telling exiftool to remove all metadata fields that it can and replace them with null. This effectively removes the metadata field from being queried.
exiftool -all:all= My_Secrect_Document_CLEAN.pdf
When I again run exiftool to simply query the metadata in the file, I get the following results.
$ exiftool My_Secrect_Document_CLEAN.pdf
ExifTool Version Number : 10.10
File Name : My_Secrect_Document_CLEAN.pdf
Directory : .
File Size : 307 kB
File Modification Date/Time : 2017:04:18 00:25:51+10:00
File Access Date/Time : 2017:04:18 00:25:58+10:00
File Inode Change Date/Time : 2017:04:18 00:25:51+10:00
File Permissions : rw-r--r--
File Type : PDF
File Type Extension : pdf
MIME Type : application/pdf
PDF Version : 1.5
Linearized : No
Language : en-AU
Tagged PDF : Yes
Page Count : 5
You'll notice very little in the output here has actually changed, this is mainly because there is no XMP metadata in the PDF document I've used for this example. The reason I recommend using both qpdf and exiftool together is exiftool also removes all the XMP metadata, whereas qpdf optimises the file while also removing any orphaned objects, like the ones we've just set to null. In the final script I've provided at the end of this post I actually run qpdf again after exiftool to remove any orphaned objects that exiftool may have created.
Securing Your PDF Files
Once you've removed as much of the metadata from the PDF file as possible, I usually take the stance that I don't really want anyone to alter or be able to easily copy my work - which is generally the reason I've created a PDF in the first place. This section looks at how to best lock down a PDF while still allowing the recipient of it to view and print your PDF file. While the following section talks about how to secure a PDF, you should note this isn't bulletproof, and there are techniques to still copy items out of a PDF. This process will simply slow down the more advanced recipients, and help prevent the less advanced recipients from altering your work.
I'm going to provide two different techniques for securing a PDF. Both have slightly different results. The first one is again to use qpdf, which allows you to linearize a document and also encrypt its contents from editing all in a single command. This command essentially says you want to;
- password protect the file with MyOwnerPassword,
- do not require a password to open the file,
- you want to allow a recipient to print the file in high resolution,
- you don't want a recipient to be able to make any modifications,
- you want to protect the file with 128bit AES encryption,
- the PDF you want to protect is called My_Secrect_Document_CLEAN.pdf,
- and you want the newly created PDF that's protected called My_Secrect_Document_CLEAN_PROTECTED.pdf
Here is the full command:
$ qpdf --linearize --encrypt "" "MyOwnerPassword" 128 --print=full --modify=none --extract=n --use-aes=y -- My_Secrect_Document_CLEAN.pdf My_Secrect_Document_CLEAN_PROTECTED.pdf
This next technique has been around for a while and is still relatively common among older web applications that manipulate PDF files, which is why I've included it. Unlike the above example, pdftk does not use AES encryption to protect the file. Instead, it uses the less secure RC4 encryption algorithm. Although, this technique also allows the PDF to be opened with an earlier version of Adobe Reader. It's worth noting that most people should no longer be using Adobe Reader 5 so you'd hope to never come across someone that couldn't read a PDF using the qpdf method above which requires Adobe Reader 6 to read the final PDF document.
This command essentially says you want to;
- the PDF you want to protect is called My_Secrect_Document_CLEAN.pdf,
- you want a newly created PDF that's protected called, My_Secrect_Document_CLEAN_PROTECTED.pdf,
- password protect the file and prompt the author for the password,
- do not require a password to open the file,
- you want to allow a recipient to print the file in high resolution,
- you don't want a recipient to be able to make any modifications,
- and that you want to protect the file with RC4 encryption
Here is the full command:
$pdftk "My_Secrect_Document_CLEAN.pdf" output "My_Secrect_Document_CLEAN_PROTECTED.pdf" owner_pw PROMPT encrypt_128bit allow printing
Once you've completed one of the above methods to lock down your PDF file, you can check that it's correctly occurred and see what the permissions are on the PDF file. You can use the pdfinfo tool to achieve this. The below example is a PDF file that was secured using pdftk, you'll notice the encryption algorithm used is RC4.
$ pdfinfo "My_Secrect_Document_CLEAN_PROTECTED.pdf"
Tagged: yes
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 5
Encrypted: yes (print:yes copy:no change:no addNotes:no algorithm:RC4)
Page size: 595.32 x 841.92 pts (A4)
Page rot: 0
File size: 343106 bytes
Optimized: no
PDF version: 1.5
The below example is a file that was secured using qpdf with the --use-aes=y parameter. In this example, you will see it has the AES encryption algorithm noted.
$ pdfinfo "My_Secrect_Document_CLEAN_PROTECTED.pdf"
Tagged: yes
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 5
Encrypted: yes (print:yes copy:no change:no addNotes:no algorithm:AES)
Page size: 595.32 x 841.92 pts (A4)
Page rot: 0
File size: 319560 bytes
Optimized: yes
PDF version: 1.6
Putting It All Together
So now you've learnt to strip out metadata and also lock down a PDF to prevent it from being edited. Surely by now you're thinking "I just want a simple way to put this all together in one command". So here it is....
The below code snippet works by adding a new function to your ~/.bash_aliases file. You may need to create this file if you've never used .bash_aliases before. You'll need to edit this file with your preferred command line editor, I've used vi in the example below.
$vi ~/.bash_aliases
Then you'll need to add the following function.
strip_pdf() {
echo "Original Metadata for $1"
exiftool $1
echo "Removing Metadata...."
echo ""
qpdf --linearize $1 stripped1-$1
exiftool -all:all= stripped1-$1
qpdf --linearize stripped1-$1 stripped2-$1
rm stripped1-$1
rm stripped1-$1_original
echo "New Metadata for stripped2-$1"
exiftool stripped2-$1
echo ""
echo "Securing stripped2-$1...."
password=$(cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 40 | head -n 1)
echo "Password will be: $password"
echo ""
qpdf --linearize --encrypt "" $password 128 --print=full --modify=none --extract=n --use-aes=y -- stripped2-$1 stripped-$1
rm stripped2-$1
echo "Final status of stripped-$1"
pdfinfo stripped-$1
}
Once you've added the above to your .bash_aliases file you'll need run the below command (yes there is a "." in there) so your current Bash season uses the changes you've added. The below command is only needed once, after that all new Bash sessions will already have the .bash_aliases file loaded.
$. ~/.bash_aliases
That should be it, good luck keeping your metadata footprint down and your PDF files secured.