Pipeline Reruns

Pipeline Reruns#

The page provides an overview of pipeline reruns and instructions on how to perform the following types of reruns:

Rerun pipeline workflows
Rerun individual pipeline steps
Rerun enrichments

Rerun Pipeline Workflows#

The Pipeline Workflows support processing data that have been already processed once.

Every workflow can be used by none, one, or more data sources. The rerun functionality of a workflow uses the data from data sources that have been already retrieved and ingested once.

For example, in the above screenshot, there is a workflow for processing Binary Documents and is configured to be used by four data sources. Therefore, the rerun functionality will use as input the processed data of these four sources.

Note

For the Rerun from Index mode, you can enable the project configuration option datasource.rerun.index.consider-all-project-items if you would like to use all the processed data in the project as input and not only the data from the configured data sources of the workflow.

This functionality can be proven useful when you want to modify your workflow, for example, by adding a new step to it or modifying the configuration of an existing step, and would like to execute it against the same set of data and retrieve a potentially different set of Squirro Items which may be more relevant to your needs.

To invoke the rerun of a workflow, click on its three dots menu, and there you’ll find the Rerun option.

When you click on the Rerun option, a popup window is displayed where you can choose which rerun mode you would like to invoke.

Rerun Modes#

There are two rerun modes available:

Rerun from Raw Data
Rerun from Index

Rerun from Raw Data#

The rerun from raw data utilizes the data as they were retrieved by a Squirro data loader plugin from the actual data source.

When it is available, it is recommended to be used as it ensures a clean rerun.

Rerun from Index#

The rerun from index utilizes the actual Squirro Items which are stored in the storage layer of your Squirro instance.

This mode is always available. However, it is not considered as stable as the other mode, and when the workflow includes a non-idempotent step the resulting Squirro Item will not be the same as the first time that it got ingested, even if the workflow has not changed.

As an example, consider a pipelet that modifies the title of an item by appending an exclamation mark. If we invoke the rerun functionality of a workflow on this item n times, the resulting Squirro Item will contain n exclamation marks in its title.

Rerun Individual Pipeline Steps#

The Pipeline Editor provides the functionality to rerun individual steps of a pipeline workflow on data for which the pipeline is configured.

Overview#

Rerunning of an individual step is typically required when you have added or changed an enrichment in your pipeline workflow and want to (re-) apply the enrichment to already indexed data.

Example: Say you have ingested a set of PDF documents using a computationally expensive workflow for binary documents. You would now like to run the documents through a recently developed text classification model to enrich the data with labels resulting from the classification. To avoid rerunning the entire workflow again, you can choose to only run the classification step.

Configuration#

Navigate to the pipeline that contains the step you want to rerun. In the pipeline editor, hover over the step, click the three dots, then select Rerun from the dropdown (see screenshot below).

You can configure the following two options:

Name

Default

Description

UI Setting

Query

The query filters the set of item for which you want to rerun the step. Use the standard Squirro query syntax.

Providing a query is optional. If no query is provided, the step runs on all the items of the data sources that are configured to use this pipeline workflow.

Run linked steps

False

Check to run any linked steps that are required for ingesting and enriching the data successfully.

Enrichments of the step you are rerunning are not persisted if you omit running the linked steps. Omitting linked steps is meant for development and testing purposes, to check that items successfully run through the enrichment step.

Please be aware that when you add a new step to a pipeline workflow, you are not able to access its Rerun option unless you save the workflow and then hit refresh.

Example#

In the above screenshot, we submit the query source_type:ZIP. The query selects items that were indexed using a ZIP data source. The rerun will not affect any other items from data sources that are configured to use this workflow. By checking the Run linked steps option, the 5 steps linked to the Proximity Filter step will also run upon clicking the RERUN button.

Rerun Enrichments#

Warning

The Studio plugin for rerunning enrichments has been deprecated since Squirro version 3.4.4. It’s recommended that you use the rerun functionalities provided in the Pipeline Editor.

Usage#

Rerun Enrichments is used yo apply enrichments after ingestion.

Select Rerun Enrichments in the AI Studio tab.

The drop-down provides a list of the configured enrichments in the project. Select the enrichment, that should be executed.

Optionally enter a query on which the enrichment should be run. For search tagging enrichments the query is pre-filled automatically, but can also be changed. An empty query will cause the enrichment to run on all the items in the project.

Limitations#

Rerunning of enrichments has a few limitations at the moment. If any of these are an issue, then the best alternative is to do a fresh load of the data. Where that does not seem possible, contact Squirro support.

Type of Enrichments#

Rerunning is only supported for search tagging and pipelets. Built-In Steps and AI Studio models cannot be rerun.

Pipelet Features#

Number of items: only pipelets that return exactly one item are supported. There is no support for pipelets that return None (to remove an item from the index) or that yield more than one item.
Type of updates: only keywords can be updated with a rerun. Any other updates, such as changing the title or body of an item, are silently ignored.