PDF OCR

Contents

PDF OCR#

The PDF OCR (optical character recognition) step converts images of text embedded in PDF files into machine-encoded text. Images of text are typically found in PDFs of scanned documents.

Overview#

The PDF OCR step extracts text from files of MIME type application/pdf that don’t contain any machine-readable text.

The Content Augmentation step needs to run before and the Content Extraction (formerly Content Conversion) step after the PDF OCR step.

Configuration#

Enabling redo_ocr or force_ocr will apply the corresponding settings to the tool used for OCR. More information about these settings can be found in the ocrmypdf documentation.

Field	Default	Description
`replace_file`	True	Replace original PDF file with a PDF file containing the extracted text overlay.
`ocr_timeout`	60	Maximum time in seconds spent on OCR per document.
`redo_ocr`	False	Applies OCR on every page, including those that already have text. It replaces any existing text layers with OCR text. Aims to create a single, clean, and hopefully accurate text layer, in the expense of potentially increased file size and processing time.
`force_ocr`	False	Applies OCR on every page, including those that already have text. It layers new OCR text over any existing text, without replacing it. It ensures that all the elements in the PDF are OCR’d, however, it can significantly increase file size, and increase processing time. Force OCR takes precedense over Redo OCR when both options are enabled.
`confidence`	False	Annotate document with informative OCR confidence score. Enabling this may slow down OCR processing and even influence its results. Unless explictly needed, it is recommended to leave this disabled.