Scan to PDF/A - some insights
Traditionally a scanner produces a TIFF or JPEG image for each page. Some of them can directly produce PDF files. And newer devices produce files conforming to the PDF/A standard. However, the quality of the produced files differ significantly. Why is this and why is it worth to use a central scan server?
Of course, the scan to PDF conversion process is not just about embedding an image in a PDF envelope. It can involve text and barcode recognition, embedding of metadata and digital signatures too. But in this article I'd like to concentrate on image data compression which is marketed as a main advantage of PDF/A over TIFF. It is said that PDF/A is better because it offers more advanced compression mechanisms than TIFF. So, let us have a closer look at this particular topic.
One of the main requirements in the scan to PDF/A conversion process is to reduce the file size. A smaller size is often achieved at the price of a lower quality. There are a some factors which have an influence on the quality / size ratio:
Color vs. Gray vs. Black / White
Choice of compression algorithm (lossless vs. lossy)
Multi vs. single page
MRC (Mixed Raster Content) mechanism
The most widely used bi-tonal (black and white) compression algorithms are G4 (standard name ITU.T6) and JBIG2. G4 is lossless whereas JBIG2 can be operated in lossless and lossy mode. In order to achieve a better compression rate lossy JBIG2 may store symbols such as text characters in a table and reuse them. If the symbol table is used it can save a significant amount of space especially in multi-page documents since the JBIG2 symbol table can be commonly used for all pages. The downside of this mechanism is that it may unexpectedly mix up some symbols. That is why lossy mode of JBIG2 is often disabled. But even in lossless mode JBIG2 has in general a better compression rate than G4.
For gray and color images the most often used algorithms are JPEG and JPEG2000. JPEG can only be used in lossy mode whereas JPEG2000 again can be used in both modes. If used in lossy mode both algorithms offer a parameter which controls the quality / size ratio. Although JPEG2000 is more modern it cannot be said to be 'better' than JPEG. Mesurements show that for higher quality settings JPEG2000 has better compression rates whereas for lower quality settings JPEG is better in general. The quality loss introduces image artifacts such as shadows which are typical for both algorithms. JPEG has an additional artifact which is called blocking. It has its origin in the subdivision of the image in 8 x 8 pixel blocks which are compressed independently. In addition to this the JPEG algorithm usually reduces the resolution of the chromaticity signal by 2 with respect to the luminosity signal which increases the compression rate but amplifies the blocking artifacts.
If converting color scans to PDF then often some sort of a mixed raster content mechanism is used. MRC separates the color information into layers: a background layer, a mask layer and a number of foreground layers. A typical example is a page that contains black text with some words emphasizes in red and blue. The mask then would contain the shapes of the characters and the background layer the color of the text. It is obvious that mask can be efficiently compressed with G4 or JBIG2 and the background layer with either JPEG or JPEG2000 using a very low resolution. When using this mechanism a scanned page can be reduced to approximately 40 k Byte with good quality. This result cannot be achieved by just using a lossy compression algorithm. However if the page contains graphics or images then these have to be isolated and compressed with good quality in one or several foreground layers. This isolation process is called segmentation and it is a essential part of the MRC mechanism.
Now, after reviewing the various compression schemes, it is time to discuss them in the context of archiving systems. Of course, the file size is often the most important issue but not always. In many scenarios the display speed is crucial issue. And, with respect to this requirement, JPEG2000 has often proved as too slow especially if it is combined with an MRC mechanism. As we learned JPEG is better at higher compression rates. So, why not use it at least for the background layer. The disturbing blocking artifacts can be reduced if disabling the down-sampling of the chromaticity signal. A bigger problem is that scanners deliver color images in JPEG compression only which reduces the power of a server based compressor software significantly because the JPEG image introduces artifacts which makes the segmentation and MRC compression much more difficult. But why not use the scanners built-in image to PDF conversion feature? This may be useful in a personal environment but in enterprise applications there exist many reasons why to use a central server. The most important are: Better quality, smaller file sizes, better OCR quality, post-scan processing steps and many more.
And, last but not least. Is PDF/A better than TIFF? The answer is definitely Yes! But not with respect to compression. TIFF offers essentially the same compression algorithms as PDF/A does. The real strength of PDF/A is that it provides the embedding of color profiles, metadata and optically recognized text in a standard manner. Furthermore, PDF/A is a uniform standard for scanned as well as born digital documents.