Tutorial¶

This brief tutorial should give you an introduction and orientation to pikepdf’s paradigm and syntax. From there, we refer to you various topics.
Opening and saving PDFs¶
In contrast to better known PDF libraries, pikepdf uses a single object to
represent a PDF, whether reading, writing or merging. We have cleverly named
this pikepdf.Pdf
. In this documentation, a Pdf
is a class that
allows manipulate the PDF, meaning the file.
from pikepdf import Pdf
new_pdf = Pdf.new()
with Pdf.open('sample.pdf') as pdf:
pdf.save('output.pdf')
You may of course use from pikepdf import Pdf as ...
if the short class
name conflicts or from pikepdf import Pdf as PDF
if you prefer uppercase.
pikepdf.open()
is a shorthand for pikepdf.Pdf.open()
.
The PDF class API follows the example of the widely-used
Pillow image library. For clarity
there is no default constructor since the arguments used for creation and
opening are different. Pdf.open()
also accepts seekable streams as input,
and Pdf.save()
accepts streams as output.
Inspecting pages¶
Manipulating pages is fundamental to PDFs. pikepdf presents the pages in a PDF
through the pikepdf.Pdf.pages
property, which follows the list
protocol. As such page numbers begin at 0.
Let’s open a simple PDF that contains four pages.
In [1]: from pikepdf import Pdf
In [2]: pdf = Pdf.open('../tests/resources/fourpages.pdf')
How many pages?
In [3]: len(pdf.pages)
Out[3]: 4
pikepdf integrates with IPython and Jupyter’s rich object APIs so that you can view PDFs, PDF pages, or images within PDF in a IPython window or Jupyter notebook. This makes it to test visual changes.
In [4]: pdf
Out[4]: « In Jupyter you would see the PDF here »
In [5]: pdf.pages[0]
Out[5]: « In Jupyter you would see an image of the PDF page here »
You can also examine individual pages, which we’ll explore in the next section. Suffice to say that you can access pages by indexing them and slicing them.
In [6]: pdf.pages[0]
Out[6]: « In Jupyter you would see an image of the PDF page here »
Note
pikepdf.Pdf.open()
can open almost all types of encrypted PDF! Just
provide the password=
keyword argument.
For more details on document assembly, see PDF split, merge and document assembly.
Pages are dictionaries¶
In PDFs, the main data structure is the dictionary, a key-value data
structure much like a Python dict
or attrdict
. The major difference is
that the keys can only be names, and can only be PDF types, including
other dictionaries.
PDF dictionaries are represented as pikepdf.Dictionary
, and names
are of type pikepdf.Name
. A page is just a dictionary with a few
required files and a reference from the document’s “page tree”. (pikepdf manages
the page tree for you.)
In [7]: from pikepdf import Pdf
In [8]: example = Pdf.open('../tests/resources/congress.pdf')
In [9]: page1 = example.pages[0]
repr() output¶
Let’s example the page’s repr()
output:
In [10]: page1
Out[10]:
<pikepdf.Dictionary(type_="/Page")({
"/Contents": pikepdf.Stream(stream_dict={
"/Length": 50
}, data=<...>),
"/MediaBox": [ 0, 0, 200, 304 ],
"/Parent": <reference to /Pages>,
"/Resources": {
"/XObject": {
"/Im0": pikepdf.Stream(stream_dict={
"/BitsPerComponent": 8,
"/ColorSpace": "/DeviceRGB",
"/Filter": [ "/DCTDecode" ],
"/Height": 1520,
"/Length": 192956,
"/Subtype": "/Image",
"/Type": "/XObject",
"/Width": 1000
}, data=<...>)
}
},
"/Type": "/Page"
})>
The angle brackets in the output indicate that this object cannot be constructed
with a Python expression because it contains a reference. When angle brackets
are omitted from the repr()
of a pikepdf object, then the object can be
replicated with a Python expression, such as eval(repr(x)) == x
. Pages
typically concern indirect references to themselves and other pages, so they
cannot be represented as an expression.
In Jupyter and IPython, pikepdf will instead attempt to display a preview of the PDF page, assuming a PDF rendering backend is available.
Item and attribute notation¶
Dictionary keys may be looked up using attributes (page1.MediaBox
) or
keys (page1['/MediaBox']
).
In [11]: page1.MediaBox # preferred notation for required names
Out[11]: pikepdf.Array([ 0, 0, 200, 304 ])
In [12]: page1['/MediaBox'] # also works