Scaling Pipelet Execution#

Profile: Project Creator, System Administrator

This page describes how to scale the execution of pipelets.

This is an advanced topic and should only be used if you have a large number of documents to process.

Typically, this type of work is performed by the project creator or a system administrator.

Note

The Ingester Service process forwards items and the configured pipelet-configuration to the Plumber Service and waits for its response.

Bottleneck Plumber#

Previously, per default, one plumber worker process gets spawned.

That may lead to inefficient pipeline processing because of one slow step.

It also means the ingester process may fail with a timeout if the plumber doesn’t manage to respond in time (TimeoutError)

That situation usually happens for batches (default N=1000) that contain mostly large PDFs combined with a Pipelet-Step that performs computational heavy CPU-bound tasks (like the NLP tagger).

Configuration to Increase Throughput#

Ingester Service#

The Ingester service can spawn multiple worker processes to parallelize the processing of batched steps like pipelet, language-detection, ml-workflow, etc.

One Ingester worker process consumes one batch and splits it into √len(batch_items) mini-batches to allow further parallelization and increase throughput.

Those mini-batches are handled and sent concurrently to the Plumber Service, using a ThreadPool maintaining step_plumber_mini_batch_threads threads.

$ /etc/squirro/ingester.ini
[pipeline]
step_plumber_mini_batch_threads = 2

Note: ingester.ini has a related configuration processor.workers. The setting is used only for pipeline-steps that get executed in parallel using a thread-pool (like the webshot-step).