How to Use Pipelets with the Squirro Data Loader

Note

This functionality will be deprecated soon. Please consider using server-side pipelets directly in the Squirro pipeline, instead of relying on data loader pipelets. If you have a use case that seems unsupported with server-side pipelets, please contact us.

This example shows how pipelets can be used by the data loader to enrich data before it is uploaded to a Squirro server.

What is a pipelet? Pipelets are plugins to the Squirro pipeline, used to customize how data is processed. For a more information on pipelets see Pipelets Tutorial or the Pipelets documentation.

Pipelet Configuration Basics

To use a pipelet with the data loader, the data load script must reference a pipelet configuration file. This pipelet configuration file specifies which pipelets will be included, where the source code for each pipelet can be found, when the pipelets should be run, and any configuration required by the pipelet.

{
    "CompanySizePipelet": {
        "file_location":"company_size.py",
        "stage":"before templating",
        "config": {}
    }
}

Let’s walk through this configuration one line at a time. The key for our pipelet configuration is the name of the class that our pipelet file implements, which inherits from the PipeletV1 base class from the Squirro SDK. This should match up perfectly with the class definition within the pipelet file. In this case:

"CompanySizePipelet":

Since the pipelet class in the pipelet file is:

class CompanySizePipelet(PipeletV1):

The next step is to point the data loader to the source python file for the pipelet. This is typically done using a relative path from the location of the pipelets.json file.

"file_location":"company_size.py"

Next, we want to tell the data loader when to run the pipelet within the load process. Our options here are either before templating or after templating. Typically, you run a pipelet before templating if you want to have the results of that enrichment available for use in creating the title or body template.

"stage":"before templating"

Finally, we can pass in any configuration required by the pipelet. This is typically where we will pass in any confidential information like API keys, tokens, etc. In this case, we have no configuration required by the pipelet, so we can pass in an empty object.

"config": {}

For more information on pipelet configuration files for the Squirro data loader, see the documentation data loader pipelet config reference.

Test Data Set

We will continue to use our test CSV data set from the previous examples. The data set looks like this:

Test Data

id

company

ticker

ipo_date

number_employees

link

1

Apple

AAPL

1980-12-12T00:00:00

116000

https://finance.yahoo.com/quote/AAPL

2

Google

GOOG

2004-08-19T00:00:00

73992

https://finance.yahoo.com/quote/GOOG

3

Microsoft

MSFT

1986-03-13T00:00:00

120849

https://finance.yahoo.com/quote/MSFT

4

Amazon

AMZN

1997-05-15T00:00:00

341400

https://finance.yahoo.com/quote/AMZN

5

Intel

INTC

1978-01-13T00:00:00

106000

https://finance.yahoo.com/quote/INTC

Test Pipelet

Our test pipelet is the company size pipelet provided as an example Here. This pipelet classifies each company as either a small, medium, large, or huge company based on it’s number of employees.

"""
This pipelet adds a facet with the company
size based on the number of employees
"""

from squirro.sdk import PipeletV1, require

@require('log')
class CompanySizePipelet(PipeletV1):

    def __init__(self, config):

        self.config = config

    def consume(self, item):

        number_of_employees = item['keywords']['number_employees'][0]
        number_of_employees = int(number_of_employees)

        if number_of_employees >= 250000:
            company_size = 'Huge'

        elif number_of_employees >= 100000:
            company_size = 'Large'

        elif number_of_employees >= 50000:
            company_size = 'Medium'

        else:
            company_size = 'Small'

        item['keywords']['company_size'] = [company_size]

        return item

Constructing a Load Script

To use a pipelet with the data loader, reference a pipelet configuration file from your load script:

--pipelets-file 'pipelets.json' \