Content Extraction#

The content extraction enrichment converts incoming content to HTML. This is used to extract textual content from PDF, office and other binary file formats.

This step is best combined with Content Augmentation and PDF OCR to create the best searching experience for documents. For more information, see Indexing Common Formats.

Enrichment name

content-conversion

Stage

content

Overview#

The content-conversion step is used to convert incoming content to HTML. For supported document formats, incoming documents are split into individual pages, each represented as a separate HTML document.

The converted content is used to set the body attribute.

image1

Supported Content MIME Types#

The following content MIME types are supported for conversion.

Display Support refers to how the documents are displayed to the user. To display all office formats to the user with full display support, the PDF Conversion step can be inserted before this step. See Indexing Common Formats for a full guide.

File Extension

Mime Type

Pages Support

Display Support

.pdf

application/pdf

Yes

Full

.doc

application/msword

No

HTML only

.docx

application/vnd.openxmlformats-officedocument.wordprocessingml.document

No

HTML only

.xls

application/vnd.ms -excel

No

HTML only

.xlsx

application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

No

HTML only

.ppt

application/vnd.ms -powerpoint

No

HTML only

.pptx

application/vnd.openxmlformats-officedocument.presentationml.presentation

No

HTML only

.rtf

text/rtf

No

HTML only

.odt

application/vnd.oasis.opendocument.text

No

HTML only

.ods

application/vnd.oasis.opendocument.spreadsheet

No

HTML only

.odp

application/vnd.oasis.opendocument.presentation

No

HTML only

.sxw

application/vnd.sun.xml.writer

No

HTML only

Configuration#

There are no configuration options for this enrichment.