PDF OCR#

The PDF OCR (optical character recognition) step converts images of text embedded in PDF files into machine-encoded text. Images of text are typically found in PDFs of scanned documents.

Overview#

The PDF OCR step extracts text from files of MIME type application/pdf that don’t contain any machine-readable text.

The Content Augmentation step needs to run before and the Content Extraction (formerly Content Conversion) step after the PDF OCR step.

Configuration#

Enabling redo_ocr or force_ocr will apply the corresponding settings to the tool used for OCR. More information about these settings can be found in the ocrmypdf documentation.

Field

Default

Description

replace_file

True

Replace original PDF file with a PDF file containing the extracted text overlay.

ocr_timeout

60

Maximum time in seconds spent on OCR per document.

redo_ocr

False

Applies OCR on every page, including those that already have text. It replaces any existing text layers with OCR text. Aims to create a single, clean, and hopefully accurate text layer, in the expense of potentially increased file size and processing time.

force_ocr

False

Applies OCR on every page, including those that already have text. It layers new OCR text over any existing text, without replacing it. It ensures that all the elements in the PDF are OCR’d, however, it can significantly increase file size, and increase processing time. Force OCR takes precedense over Redo OCR when both options are enabled.

confidence

False

Annotate document with informative OCR confidence score. Enabling this may slow down OCR processing and even influence its results. Unless explictly needed, it is recommended to leave this disabled.