Scaling Pipelet Execution
Scaling Pipelet Execution#
The Ingester Service process forwards items and the configured
pipelet-configuration to the Plumber Service and waits for its response.
Previously, per default, one plumber worker process gets spawned.
That may lead to inefficient pipeline processing because of one slow step.
It also means the ingester process may fail with a timeout if the plumber doesn’t manage to respond in time (TimeoutError)
That situation usually happens for batches (default N=1000) that contain mostly large PDFs combined with a Pipelet-Step that performs computational heavy CPU-bound tasks (like the NLP tagger).
Configuration to Increase Throughput#
The Ingester service can spawn multiple worker processes to parallelize the processing of batched steps like
pipelet, language-detection, ml-workflow, etc.
One Ingester worker process consumes one batch and splits it into
√len(batch_items) mini-batches to allow further parallelization and increase throughput.
Those mini-batches are handled and sent concurrently to the Plumber Service, using a ThreadPool maintaining
$ /etc/squirro/ingester.ini [ingester] processors = 2 [pipeline] step_plumber_mini_batch_threads = 2
ingester.ini has a related configuration
processor.workers. The setting is used only for pipeline-steps that get executed in parallel using a thread-pool (like the webshot-step).
- With the example configuration above, the Plumber Service should spawn 4 workers to have always enough resources ready to handle incoming mini-batches served by Ingester processes at any time (
ingester.processors x ingester.pipeline.step_plumber_mini_batch_threads
= plumber.server.max_spare = 4)
$ /etc/squirro/plumber.ini [server] fork = true max_spare = 4