Does OCR make sense for digitally generated PDFs?
Scanned PDF files usually consist of one raster image for each page. The OCR engine can recognize the text in this image and make the document searchable. But what about digitally generated documents?
Digital born documents contain individually generated content objects, such as texts, geometric figures and raster images. The objects are often overlaid by means of transparency and use spot colors for printing. In addition, the documents may be enriched with structural information such as articles, reading direction and tags (title, paragraph, header, footer, etc.).
In many cases, the text is embedded so that it is machine-readable. However, it is not uncommon for this information to be missing. Often the text is also embedded in the form of geometric lines and curves or as part of a raster image.
A naive approach would be to rasterize the page and then pass it to the OCR engine. As a result, you would lose all the details of the digitally generated page. It is therefore worthwhile to choose a different way.
A good OCR tool for digitally generated PDF files can enrich unreadable fonts with Unicode information, recognize texts in embedded images, and even create missing structure information, thus preparing the document for PDF/A conformance level a. Furthermore, the tool should also be able to recognize bar and QR codes and write their content in the metadata of the document. With all these features, the tool may serve as an essential component of a Robotic Process Automation (RPA) solution.
Of course, such a tool should be able to handle scanned, digital born and mixed files. As usual, scanned pages are straightened, stains removed, and the recognized text invisibly placed on top of the image, making it searchable like a digitally generated document.
With the 3-Heights™ PDF OCR Tool we have created such a tool. As part of the 3-Heights™ PDF Quality Gate solution, it ensures that the documents are enriched for further processing. The 3-Heights™ PDF OCR tool also optimizes the number of accesses to the OCR engine to keep the license costs low and increase performance.