Indexing Common Formats

Indexing Common Formats#

Squirro’s pipeline supports the processing of complex document types, such as common office formats like Word or Excel files or PDFs. This page explains how to set up the pipeline to achieve this.

Overview#

Squirro can index and display many common document formats. This enables end users to interact with those documents directly and search within them.

When set up correctly, these documents become searchable and document thumbnails are shown in the result list:

They can also be displayed directly in the user interface, without the user having to navigate away to a desktop application:

Setup#

Squirro provides all the required and default configurations to index these documents out of the box.

Set Up Pipeline Workflow#

In the Setup space navigate to Pipeline.
In the top right click the pencil icon to enter edit mode.
In the bottom left choose New Pipeline to create a new pipeline.
A list of pipeline presets is displayed. Select the :guilabel:`Binary Documents`workflow.
In the Pipeline Properties on the right side, give the pipeline workflow a meaningful name, “Documents” for example.

This concludes the setup. The steps that ensure documents are correctly indexed are:

PDF Conversion, which converts Office formats to PDF.
Content Augmentation, which makes the contents of the document available to the rest of the pipeline.
Content Extraction, which extracts the text content from the document to make the document searchable.

Ingest Data#

Once the pipeline workflow is set up, data ingestion can proceed like with any other data source.

Important: Make sure to select the newly created workflow to ensure the data is processed correspondingly.