Content Extraction#
The content extraction enrichment converts incoming content to HTML. This is used to extract textual content from PDF, office and other binary file formats.
This step is best combined with Content Augmentation and PDF OCR to create the best searching experience for documents. For more information, see Indexing Common Formats.
Enrichment name |
content-conversion |
Stage |
content |
Overview#
The content-conversion
step is used to convert incoming content to HTML. For supported document formats, incoming documents are split into individual pages, each represented as a separate HTML document.
The converted content is used to set the body
attribute.
Supported Content MIME Types#
The following content MIME types are supported for conversion.
Display Support refers to how the documents are displayed to the user. To display all office formats to the user with full display support, the PDF Conversion step can be inserted before this step. See Indexing Common Formats for a full guide.
File Extension |
Mime Type |
Pages Support |
Display Support |
|
|
Yes |
Full |
|
|
No |
HTML only |
|
|
No |
HTML only |
|
|
No |
HTML only |
|
|
No |
HTML only |
|
|
No |
HTML only |
|
|
No |
HTML only |
|
|
No |
HTML only |
|
|
No |
HTML only |
|
|
No |
HTML only |
|
|
No |
HTML only |
|
|
No |
HTML only |
Configuration#
There are no configuration options for this enrichment.