Pipeline Prioritization#

This page provides an overview of how to prioritize items within the Squirro pipeline.

Overview#

The Squirro pipeline supports prioritization for certain data being ingested. This allows for certain data to be processed quicker than other data sets.

This becomes helpful when some data sources are more important than others. For example, data coming from a premium data provider might be more valuable than data obtained from a public RSS feed.

Priorities#

The pipeline supports three priority levels:

  • Low

  • Normal

  • High

Each of these priorities has its own processor, thus ensuring that the items from different priorities do not block each other.

Using Priorities#

Priorities can be defined in the data source, and can be influenced using the Flow (Change Pipeline) Pipeline Step step.

Data Source#

image1

By default, all the data sources are created with the Normal priority level. It is possible to define the priority level during the creation of the data source and later by editing the data source.

The priority level could also be specified using the Data Loader CLI Tool by the --priority-level argument. See the Data Loader Reference section for more information.

The rationale for choosing the priority level of a data source is to judge how valuable for you the data from this source are compared to the data from the rest of your sources.

Change Pipeline#

The priorities can also be changed when queuing work in a new workflow using the Flow (Change Pipeline) Pipeline Step step.

This allows a setup where one initial pipeline workflow does the minimum effort required to index the data. From this moment the data is available and searchable for users. The more resource-intensive processing can then be deferred to a secondary pipeline workflow which is invoked using the Change Pipeline step. To avoid those steps from clogging up the processing of the initial item the change pipeline step can reduce the priority at this time.

image2

Configuration#

The setup of the pipeline priorities can be configured in the Configuration Service. See the Configuring the Ingester for Prioritized Data Sources section later on this page for more information.

Monitoring#

To monitor how busy the different queues are use the Monitoring Plugin in the Server space.

Configuring the Ingester for Prioritized Data Sources#

The Ingester service offers a set of configuration options that control how it handles the priorities set on the data batches that derive from the configured data sources.

To access those configuration options, navigate to the Server space and then select Configuration. You can reveal those options by writing ingester in the search bar.

image3

A description of these configuration options is listed in the following table.

Name

Default

Description

ingester.priorities.optimize-processors-utilization

false

Flag to enable or disable the optimization of allowing processors of lower priorities to consume batches of higher priorities.

For example, when this optimization is enabled (true), the processors in the low and normal pools, in addition to the batches of their designated priority, will also consume batches of high priority.

By default, this optimization is disabled.

ingester.priorities.pool-high-processors

1

Define the number of processors in the pool which consumes high priority batches.

ingester.priorities.pool-normal-processors

1

Define the number of processors in the pool which consumes normal priority batches.

ingester.priorities.pool-low-processors

1

Define the Number of processors in the pool which consumes low priority batches.

Note: Changing the value of any of the above configuration options requires a restart of the ingester service for the new value to come into effect.