Pipelets and the Dataloader

Pipelets and the Dataloader#

Profile: Project Creator

This page describes how to use pipelets with the Data Loader CLI Tool.

Project creators working from the command line can use the --pipelets-file and --pipelets-error-behavior arguments to specify pipelets and their behaviour.

Overview#

The Data Loader CLI Tool can run multiple pipelets before or after the templating step.

The pipelets are instantiated once for each parallel process, but the code inside is executed once for each item.

Note: It is not recommended to open files in the consume method of the pipelet, you should do it in init().

–pipelets-file#

A pipelet’s location, input arguments, and execution order are described in a JSON file supplied as an argument for the loader script --pipelets-file.

Reference: See Labels for more information.

–pipelets-error-behavior#

Also, the --pipelets-error-behavior specifies the job’s behaviour in situations where a pipelet raises an exception.

The default is error. In case of error, the load will stop. The other option is warning, which will only log the warning and continue the load.

squirro_data_load -v ^
    --token %token% ^
    --cluster %cluster% ^
    --project-id %project_id% ^
    --source-name csv_interactions ^
    --source-type csv ^
    --map-title InteractionSubject ^
    --source-file interaction.csv ^
    --facets-file config/sample_facets.json ^
    --body-template-file template/template_body_interaction.html ^
    --pipelets-error-behavior error ^
    --pipelets-file config/sample_pipelets.json

Note: The lines have been wrapped with the circumflex ^ at the end of each line. On Mac and Linux you will need to use backslash \ instead.

Configuration#

Use the following pipelets config file:

config/sample_pipelets.json

{
    "Sample": {
        "file_location": "sample_pipelet.py",
        "config": {}
    }
}

And in Python pipelet script, consume method just adds the name Squirro to the title of the item :

sample_pipelet.py

import squirro.sdk
VERSION = '0.0.1'

@squirro.sdk.require('log')

class Sample(squirro.sdk.PipeletV1):

    def __init__(self, config):
        self.config = config

    def consume(self, item):
        item['title'] = item.get('title', '') + ' - Squirro'
        return item

The result of the example pipelet is shown below: