How to Work With PDF Documents Using Python
I really admire Portable Document Format (PDF) files. I remember the days when such files solved any formatting issues while exchanging files due to some differences in Word versions, or for other reasons.
We are mainly talking about Python here, aren't we? And we are interested in tying that to working with PDF documents. Well, you may say that's so simple, especially if you have used Python with text files (txt) before. But, it is a bit different here. PDF documents are binary files and more complex than just plaintext files, especially since they contain different font types, colors, etc.
That doesn't mean that it is hard to work with PDF documents using Python, it is rather simple, and using an external module solves the issue.
As we mentioned above, using an external module would be the key. The module we will be using in this tutorial is
PyPDF2. As it is an external module, the first normal step we have to take is to install that module. For that, we will be using pip, which is (based on Wikipedia):
A package management system used to install and manage software packages written in Python. Many packages can be found in the Python Package Index (PyPI).
You can follow the steps mentioned in the Python Packaging User Guide for installing
pip, but if you have
Python 2.7.9 and higher, or
Python 3.4 and higher, you already have
PyPDF2 now can be simply installed by typing the following command (in Mac OS X's Terminal):
pip install pypdf2
Great! You now have
PyPDF2 installed, and you're ready to start playing with PDF documents.
Reading a PDF Document
The sample file we will be working with in this tutorial is sample.pdf. Go ahead and download the file to follow the tutorial, or you can simply use any PDF file you like.
Let's now go ahead and read the PDF document. Since we will be using
PyPDF2, we need to import the module, as follows:
After importing the module, we will be using the PdfFileReader class. So, the script for reading the PDF document looks as follows:
import PyPDF2 pdf_file = open('sample.pdf') read_pdf = PyPDF2.PdfFileReader(pdf_file)
More Operations on PDF Documents
After reading the PDF document, we can now carry out different operations on the document, as we will see in this section.
Number of Pages
number_of_pages = read_pdf.getNumPages() print number_of_pages
In this case, the returned value will be
Let's now check the number of some page in the PDF document. We can use the method
getPageNumber(page), Notice that we have to pass an object of type
page to the method. To retrieve a
page, we will use the
getPage(number) method, where
number represents the page number in the PDF document. The argument
number starts with the value
Well, I know when you use
getPage(number) you already know the page number, but this is just to illustrate how to use those methods together. This can be demonstrated in the following script:
page = read_pdf.getPage(0) page_number = read_pdf.getPageNumber(page) print page_number
Go ahead, try the script. What output did you get?
We know that in
sample.pdf (the file we are experimenting with), we only have one page (number
0). What if we passed the number
1 as the page number to
getPage(number)? In this case, you will get the following error:
Traceback (most recent call last): File "test.py", line 6, in <module> page = read_pdf.getPage(1) File "/usr/local/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1158, in getPage return self.flattenedPages[pageNumber] IndexError: list index out of range
This is because the page is not available, and we are using a page number out of range (does not exist).
The PDF page comes with different modes, which are as follows:
|/UseNone||Do not show outlines or thumbnails panels|
|/UseOutlines||Show outlines (aka bookmarks) panel|
|/UseThumbs||Show page thumbnails panel|
|/UseOC||Show Optional Content Group (OCG) panel|
|/UseAttachments||Show attachments panel|
In order to check our page mode, we can use the following script:
page = read_pdf.getPage(0) page_mode = read_pdf.getPageMode() print page_mode
In the case of our PDF document (
sample.pdf), the returned value is
none, which means that the page mode is not specified. If you want to specify a page mode, you can use the method
mode is one of the modes listed in the table above.
We have been wandering around the file so far, so let's see what's inside. The method
extractText() will be our friend in this task.
Let me show you the full script to do that, as opposed to what I was doing above in showing you only the required script to perform an operation. The script to extract a text from the PDF document is as follows:
import PyPDF2 pdf_file = open('sample.pdf') read_pdf = PyPDF2.PdfFileReader(pdf_file) number_of_pages = read_pdf.getNumPages() page = read_pdf.getPage(0) page_content = page.extractText() print page_content
I was surprised when I got the following output rather than that in
!"#$%#$%&%$&'()*%+,-%./01'*23%4 5'%1$#26%3/%7/))/8%&)/26%8#3"%3"*%313/9#&) %
This is most likely due to a font issue, such that the character codes map to other values. So it is sometimes an issue with the PDF document itself, as the PDF document might not contain the data required to restore the content.
I thus tried another file, which is a paper of mine:
paper.pdf. Go ahead and replace
sample.pdf in the code with
paper.pdf. The output in this case was:
But, where is the rest of the text in the page? Well, actually the
extractText() method seems not to be perfect, and some improvements need to be made. But, the goal here is to show you how to work with PDF files using Python, and it seems some improvements need to be made in the domain.
As we can see, Python makes it simple to work with PDF documents. This tutorial just scratched the surface on this topic, and you can find more details on different operations you can perform on PDF documents on the PyPDF2 documentation page.
Source: Tuts Plus