The discipline of converting PDF to PDF/A
We all know that the conversion one file format to another is not as easy as one might wish for and can lead to unpleasant surprises. However, it is hardly known that this is the case for the conversion from PDF to PDF/A. Why is that?
PDF/A is a subset of PDF. This is the obvious fact. Typical examples are that the fonts embedded and the colors used must be calibrated. Less well known is that the PDF/A standard includes additional, more stringent rules. An example of this is that the text characters are not allowed to refer to the .notdef glyph.
Anyway. PDF/A has been designed with document creation in mind not conversion. Nevertheless, a PDF to PDF/A converter must generate a new a PDF file, which follows the rules of the standard. On one hand, this is often not easy. On the other hand, there are many choices for the mapping and different strategies. Here are some examples.
Uncalibrated color spaces can be easily replaced with calibrated ones by choosing an ICC color profile for each of the device dependent color space DeviceGray, DeviceRGB and Device CMYK.
It is not necessary to introduce an output intent if it is not present in the input file. However, if the input file already has an output intent profile, e.g. a CMYK profile, then it is advised to keep it and the device dependent colors that refer to it.
Embed missing font programs is only easy if the original font is available which is often not the case. If the font program is not available then it has to be replaced by a font program which has the nearest possible appearance. The viewer applications have built-in font replacement strategy. A converter should follow these strategies as well since the resulting file should look the same independent whether the fonts are embedded or not.
If transparency is prohibited, such as with PDF/A-1, then the converter must perform some sort of transparency flattening or refuse the file if it cannot.
With prohibited features such as JavaScript, multimedia content, some kind of actions etc. the converter has the option of removing the features or refuse the file if the user does not want it.
Text characters that map to the .notdef glyph can be remapped to a new glyph which is a copy of the .notdef glyph.
The above list can only give an impression. There are certainly much more cases than showed here. There are, however, further tasks that a converter must perform. Here are some of them:
Pre-validation: If the input document does already conform to the requested standard there is no need to perform the conversion. This is in particular useful with digitally signed documents.
Post-validation: After the conversion the user wants to verify that the converted result conforms to the requested standard.
Repair: The user expects that the converter repairs the input file if it contains minor corruptions such as missing mandatory dictionary entries, damaged cross reference tables etc.
Return status: A fine-grained return status value allows for a well designed, user controlled conversion process.
Log file: An inevitable means to locate and eliminate conversion problems.