Scaling Pipelet Execution#
Profile: Project Creator, System Administrator
This page describes how to scale the execution of pipelets.
This is an advanced topic and should only be used if you have a large number of documents to process.
Typically, this type of work is performed by the project creator or a system administrator.
Note
The Ingester Service process forwards items and the configured pipelet-configuration
to the Plumber Service and waits for its response.
Bottleneck Plumber#
Previously, per default, one plumber worker process gets spawned.
That may lead to inefficient pipeline processing because of one slow step.
It also means the ingester process may fail with a timeout if the plumber doesn’t manage to respond in time (TimeoutError)
That situation usually happens for batches (default N=1000) that contain mostly large PDFs combined with a Pipelet-Step that performs computational heavy CPU-bound tasks (like the NLP tagger).
Configuration to Increase Throughput#
Ingester Service#
The Ingester service can spawn multiple worker processes to parallelize the processing of batched steps like pipelet, language-detection, ml-workflow, etc.
One Ingester worker process consumes one batch and splits it into √len(batch_items)
mini-batches to allow further parallelization and increase throughput.
Those mini-batches are handled and sent concurrently to the Plumber Service, using a ThreadPool maintaining step_plumber_mini_batch_threads
threads.
$ /etc/squirro/ingester.ini
[pipeline]
step_plumber_mini_batch_threads = 2
Note: ingester.ini
has a related configuration processor.workers
. The setting is used only for pipeline-steps that get executed in parallel using a thread-pool (like the webshot-step).