Scaling Pipelet Execution#
Profile: Project Creator, System Administrator
This page describes how to scale the execution of pipelets.
This is an advanced topic and should only be used if you have a large number of documents to process.
Typically, this type of work is performed by the project creator or a system administrator.
The Ingester Service process forwards items and the configured
pipelet-configuration to the Plumber Service and waits for its response.
Previously, per default, one plumber worker process gets spawned.
That may lead to inefficient pipeline processing because of one slow step.
It also means the ingester process may fail with a timeout if the plumber doesn’t manage to respond in time (TimeoutError)
That situation usually happens for batches (default N=1000) that contain mostly large PDFs combined with a Pipelet-Step that performs computational heavy CPU-bound tasks (like the NLP tagger).
Configuration to Increase Throughput#
The Ingester service can spawn multiple worker processes to parallelize the processing of batched steps like
pipelet, language-detection, ml-workflow, etc.
One Ingester worker process consumes one batch and splits it into
√len(batch_items) mini-batches to allow further parallelization and increase throughput.
Those mini-batches are handled and sent concurrently to the Plumber Service, using a ThreadPool maintaining
processors = 2
step_plumber_mini_batch_threads = 2
ingester.ini has a related configuration
processor.workers. The setting is used only for pipeline-steps that get executed in parallel using a thread-pool (like the webshot-step).