Working with content streams¶
A content stream is a stream object associated with either a page or a Form XObject that describes where and how to draw images, vectors, and text.
Content streams are binary data that can be thought of as a list of operators and zero or more operands. Operands are given first, followed by the operator. It is a stack-based language based loosely on PostScript, but without any programmable features. There are no variables, loops or conditionals.
A typical example is as follows (with additional whitespace):
q # push graphics stack
100 0 0 100 0 0 cm # set current transformation matrix
/Image1 Do # draw the object named /Image1 from the /Resources dictionary
Q # pop graphics stack
pikepdf provides a C++ optimized content stream parser and a filter. The parser is best used for reading and interpreting content streams; the filter is best used for rewriting them.
In [1]: pdf = pikepdf.open("../tests/resources/congress.pdf")
In [2]: page = pdf.pages[0]
In [3]: for operands, operator in pikepdf.parse_content_stream(page):
...: print("Operands {}, operator {}".format(operands, operator))
...:
Operands [], operator q
Operands [Decimal('200.0000'), 0, 0, Decimal('304.0000'), Decimal('0.0000'), Decimal('0.0000')], operator cm
Operands [pikepdf.Name("/Im0")], operator Do
Operands [], operator Q
Extracting text from PDFs¶
If you guessed that the content streams were the place to look for text inside a PDF – you’d be correct. Unfortunately, extracting the text is fairly difficult because content stream actually specifies as a font and glyph numbers to use. Sometimes, there is a 1:1 transparent mapping between Unicode numbers and glyph numbers, and dump of the content stream will show the text. In general, you cannot rely on there being a transparent mapping; in fact, it is perfectly legal for a font to specify no Unicode mapping at all, or to use an unconventional mapping (when a PDF contains a subsetted font for example).
We strongly recommend against trying to scrape text from the content stream.
pikepdf does not currently implement text extraction. We recommend pdfminer.six, a read-only text extraction tool. If you wish to write PDFs containing text, consider reportlab.