Working with images¶
PDFs embed images as binary stream objects within the PDF’s data stream. The stream object’s dictionary describes properties of the image such as its dimensions and color space. The same image may be drawn multiple times on multiple pages, at different scales and positions.
In some cases such as JPEG2000, the standard file format of the image is used verbatim, even when the file format contains headers and information that is repeated in the stream dictionary. In other cases such as for PNG-style encoding, the image file format is not used directly.
pikepdf currently has no facility to embed new images into PDFs. We recommend img2pdf instead, because it does the job so well. pikepdf instead allows for image inspection and lossless/transcode free (where possible) “pdf2img”.
Playing with images¶
pikepdf provides a helper class PdfImage
for manipulating
images in a PDF. The helper class helps manage the complexity of the image
dictionaries.
In [1]: from pikepdf import Pdf, PdfImage, Name
In [2]: example = Pdf.open('../tests/resources/congress.pdf')
In [3]: page1 = example.pages[0]
In [4]: list(page1.images.keys())
Out[4]: ['/Im0']
In [5]: rawimage = page1.images['/Im0'] # The raw object/dictionary
In [6]: pdfimage = PdfImage(rawimage)
In [7]: pdfimage
Out[7]: <pikepdf.PdfImage image mode=RGB size=1000x1520 at 0x7fe48794d050>
In Jupyter (or IPython with a suitable backend) the image will be displayed.
You can also inspect the properties of the image. The parameters are similar to Pillow’s.
In [8]: pdfimage.colorspace
Out[8]: '/DeviceRGB'
In [9]: pdfimage.width, pdfimage.height