Why is the extraction of text from a PDF document such a hassle
When I use a text editing tool such as Microsoft Word then it is quite natural that I can select a portion of text and copy it to the clipboard and paste it in to a window of any other tool. Not so with PDF. At least not with any kind of document. Why is that?
In PDF, as in other document formats too, text is based on fonts. Fonts contain, beside other information, a collection of characters which can be used to assemble the text. PDF supports various font formats such as Type 1, CFF, TrueType and OpenType. Fonts may be embedded in the document file or referred to by name.
In a TrueType font, each character is associated with a Unicode. A Unicode is a standardized number describing the meaning of a character independant of its appearance, e.g. the characters a, a and a have the same Unicode but a different appearance. In a font, the description of the appearance of a character is called a glyph. In a Microsoft Word document the Unicodes are used to store the text. PDF, in contrast, selects the character of an embedded font by its glyph number. The glyph number is local to the font and only valid in conjunction with the particular font.
This architecture has some advantages. Glyphs can be uniquely numbered without regard to the Unicode system, different appearances of the same character can be bundled in the same font, glyphs can be used without knowing its Unicode etc. However, there are also some disadvantages.
In order to reduce the size of a PDF file, some producers remove the Unicodes and their association with the glyphs. Thus, text extraction from such documents is inhibited. At least, these kind of documents can be detected and processed accordingly, one might think. But, even this is not true in general.
There exists producer software on the market that create PDF documents with correct glyph selection information but wrong or misleading Unicode information. Such documents look as if all Unicodes of the used characters were present but the association between the appearance and its meaning is wrong. In this case the extracted text appears as garbage.
Especially a standard such as PDF/A-2u, which requires that all text can be mapped to Unicodes, does not guarantee that the Unicodes mapping is correct allthough the text appears to be meaningful when it is displayed in a reader software. In general, even validator software cannot detect such a situation.
In order to find out whether a document contains extractable text in an automated - and to some extent reliable - way is to run the document through an OCR engine.