PDF OCR
Contents
PDF OCR#
The PDF OCR (optical character recognition) step converts images of text embedded in PDF files into machine-encoded text. Images of text are typically found in PDFs of scanned documents.
Overview#
The PDF OCR step extracts text from files of MIME type application/pdf
that don’t contain any machine-readable text.
The Content Augmentation step needs to run before and the Content Extraction (formerly Content Conversion) step after the PDF OCR step.
Configuration#
Field |
Default |
Description |
UI Setting |
|
True |
Replace original PDF file with a PDF file containing the extracted text overlay. |
|
|
60 |
Maximum time in seconds spent on OCR per document |