PDF OCR#

The PDF OCR (optical character recognition) step converts images of text embedded in PDF files into machine-encoded text. Images of text are typically found in PDFs of scanned documents.

Overview#

The PDF OCR step extracts text from files of MIME type application/pdf that don’t contain any machine-readable text.

The Content Augmentation step needs to run before and the Content Extraction (formerly Content Conversion) step after the PDF OCR step.

Configuration#

Field

Default

Description

UI Setting

replace_file

True

Replace original PDF file with a PDF file containing the extracted text overlay.

image4

ocr_timeout

60

Maximum time in seconds spent on OCR per document

image5