Scaling Pipelet Execution#

Profile: Project Creator, System Administrator

This page describes how to scale the execution of pipelets.

This is an advanced topic and should only be used if you have a large number of documents to process.

Typically, this type of work is performed by the project creator or a system administrator.

Note

The Ingester Service process forwards items and the configured pipelet-configuration to the Plumber Service and waits for its response.

Bottleneck Plumber#

Previously, per default, one plumber worker process gets spawned.

That may lead to inefficient pipeline processing because of one slow step.

It also means the ingester process may fail with a timeout if the plumber doesn’t manage to respond in time (TimeoutError)

That situation usually happens for batches (default N=1000) that contain mostly large PDFs combined with a Pipelet-Step that performs computational heavy CPU-bound tasks (like the NLP tagger).

Configuration to Increase Throughput#

Ingester Service#

The Ingester service can spawn multiple worker processes to parallelize the processing of batched steps like pipelet, language-detection, ml-workflow, etc.

One Ingester worker process consumes one batch and splits it into √len(batch_items) mini-batches to allow further parallelization and increase throughput.

Those mini-batches are handled and sent concurrently to the Plumber Service, using a ThreadPool maintaining step_plumber_mini_batch_threads threads.

$ /etc/squirro/ingester.ini
[ingester]
processors = 2

[pipeline]
step_plumber_mini_batch_threads = 2

Note: ingester.ini has a related configuration processor.workers. The setting is used only for pipeline-steps that get executed in parallel using a thread-pool (like the webshot-step).

Plumber Service#

With the example configuration above, the Plumber Service should spawn 4 workers to have always enough resources ready to handle incoming mini-batches served by Ingester processes at any time

(ingester.processors x ingester.pipeline.step_plumber_mini_batch_threads | = plumber.server.max_spare = 4)

$ /etc/squirro/plumber.ini
[server]
fork = true
max_spare = 4

image2