PDF Conversion#
The PDF Conversion
step converts standard Microsoft Office documents (such as Microsoft Word, Excel, and PowerPoint documents) to PDF documents.
Overview#
The PDF Conversion
step can convert non-plain text file formats to PDF format.
Documents from popular office suites, such as Microsoft Office, LibreOffice, or Google Docs, can be used as input and get converted into PDFs.
Specifically, the detection of the file format happens through its MIME type. Documents with the following MIME types can be converted into PDFs by the PDF Conversion
step:
application/msword
application/rtf
application/vnd.lotus-1-2-3
application/vnd.ms-excel
application/vnd.ms-excel.sheet.macroEnabled.12
application/vnd.ms-excel.template.macroEnabled.12
application/vnd.ms-powerpoint
application/vnd.ms-powerpoint.presentation.macroEnabled.12
application/vnd.ms-powerpoint.slideshow.macroEnabled.12
application/vnd.ms-powerpoint.template.macroEnabled.12
application/vnd.ms-word.document.macroEnabled.12
application/vnd.ms-word.template.macroEnabled.12
application/vnd.ms-works
application/vnd.oasis.opendocument.chart
application/vnd.oasis.opendocument.formula
application/vnd.oasis.opendocument.graphics
application/vnd.oasis.opendocument.graphics-template
application/vnd.oasis.opendocument.presentation
application/vnd.oasis.opendocument.presentation-template
application/vnd.oasis.opendocument.spreadsheet
application/vnd.oasis.opendocument.spreadsheet-template
application/vnd.oasis.opendocument.text
application/vnd.oasis.opendocument.text-master
application/vnd.oasis.opendocument.text-template
application/vnd.oasis.opendocument.text-web
application/vnd.openxmlformats-officedocument.presentationml.presentation
application/vnd.openxmlformats-officedocument.presentationml.slideshow
application/vnd.openxmlformats-officedocument.presentationml.template
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
application/vnd.openxmlformats-officedocument.spreadsheetml.template
application/vnd.openxmlformats-officedocument.wordprocessingml.document
application/vnd.openxmlformats-officedocument.wordprocessingml.template
application/vnd.sun.xml.calc
application/vnd.sun.xml.calc.template
application/vnd.sun.xml.draw
application/vnd.sun.xml.draw.template
application/vnd.sun.xml.impress
application/vnd.sun.xml.impress.template
application/vnd.sun.xml.math
application/vnd.sun.xml.writer
application/vnd.sun.xml.writer.global
application/vnd.sun.xml.writer.template
application/vnd.wordperfect
application/x-dbf
application/x-extension-txt
application/x-quattropro
application/x-t602
The PDF Conversion
step belongs to the Enrich
section, and it is part of the Binary Documents
pipeline preset.
When the PDF Conversion
step is used in conjunction with the Content Augmentation
and Content Extraction
steps, the Pipeline Editor automatically sets its position before those two steps. This happens in order to enable further enrichments of the item to act on the obtained PDF representation, which will result in better processing and display of the item on the Squirro UI.
Configuration#
Field |
Default |
Description |
|
False |
Ignore errors and let the item continue through the Pipeline even if errors occur. |
|
False |
Enable image compression inside text files, to help reduce the size of the resulting PDF file. |
|
75 |
Quality of images in a PDF file (from 1 to 100). |
|
75 |
Maximum Image Resolution (possible values are: 75, 150, 300, 600 and 1200). |
|
False |
If a file is too big, the system tries to split it into sheets, convert the parts, and merge them again. |
Note the related pdfconversion.pdf-cache-ttl
configuration option, which is visible and configurable only by server admins. It controls the number of seconds that the generated PDF remains in the pdf_conversion
cache, set to 1 day by default. Whenever an item with a PDF representation is accessed, its TTL is refreshed. If its TTL expires without the item being accessed by anyone, it is removed from the cache. The next time it is requested, its PDF representation is generated again and displayed as such in the Squirro UI. That mechanism is transparent to the end user.
Also, there is the pdfconversion.pdf-conversion-timeout
configuration option for the maximum time in seconds that the pdfconversion
service allows for converting a file to a PDF. When increasing this value, ensure it does not exceed the configured proxy_read_timeout
directive in the pdfconversion
Nginx configuration file.