Skip to main content

Extract XML from PDF workflow

Workflow identifier

This workflow is identified as Extraction in the Conversion Service.

This workflow extracts XML data containing the extracted OCR information from PDFs. If no embedded OCR information is found, the Extract XML from PDF workflow outputs an empty <document> tag.

The workflow supports these features:

  • Extraction of OCR-related XML information from PDF documents
  • Simple configuration with minimal options
  • Outputs XML with document data or an empty <document> tag if no embedded OCR data is found

Supported file formats for Extract XML from PDF workflow

This workflow supports these file formats:

Content typeFile type
PDF formatsPDF 1.x, PDF 2.0, PDF/A-1, PDF/A-2, PDF/A-3
info

If the input PDF does not contain embedded OCR data, the workflow outputs an XML file with an empty <document> tag.

Job options for the Extract XML from PDF workflow

The Extract XML from PDF workflow lets you use job options to pass job-specific values for use when processing documents.

note

Job and document options you pass at runtime affect only the current job. When you change a setting of the job or document options, that change applies only to the current job. For the next job, the workflow reverts to the profile settings (saved default) unless you pass job or document options again.

Document options

Document options apply only to a specific input. It allows you to determine specific properties based on an individual document, rather than as a global setting (determined by the job or the profile). Use the default profile settings for any subsequent jobs processed with the workflow profile.

TypeOptionDescription
Document propertyDOC.PASSWORDSet the password for the document.