Content Extraction

Content Extraction#

The content extraction enrichment converts incoming content to HTML. This is used to extract textual content from PDF, office and other binary file formats.

This step is best combined with Content Augmentation and PDF OCR to create the best searching experience for documents. For more information, see Indexing Common Formats.

Enrichment name	content-conversion
Stage	content

Overview#

The content-conversion step is used to convert incoming content to HTML. For supported document formats, incoming documents are split into individual pages, each represented as a separate HTML document.

The converted content is used to set the body attribute.

Supported Content MIME Types#

The following content MIME types are supported for conversion.

Display Support refers to how the documents are displayed to the user. To display all office formats to the user with full display support, the PDF Conversion step can be inserted before this step. See Indexing Common Formats for a full guide.

File Extension	Mime Type	Pages Support	Display Support
`.pdf`	`application/pdf`	Yes	Full
`.doc`	`application/msword`	No	HTML only
`.docx`	`application/vnd.openxmlformats-officedocument.wordprocessingml.document`	No	HTML only
`.xls`	`application/`vnd.ms `-excel`	No	HTML only
`.xlsx`	`application/vnd.openxmlformats-officedocument.spreadsheetml.sheet`	No	HTML only
`.ppt`	`application/`vnd.ms `-powerpoint`	No	HTML only
`.pptx`	`application/vnd.openxmlformats-officedocument.presentationml.presentation`	No	HTML only
`.rtf`	`text/rtf`	No	HTML only
`.odt`	`application/vnd.oasis.opendocument.text`	No	HTML only
`.ods`	`application/vnd.oasis.opendocument.spreadsheet`	No	HTML only
`.odp`	`application/vnd.oasis.opendocument.presentation`	No	HTML only
`.sxw`	`application/vnd.sun.xml.writer`	No	HTML only

Configuration#

There are no configuration options for this enrichment.

Content Extraction

Contents

Content Extraction#

Overview#

Supported Content MIME Types#

Configuration#