How to Use Pipelets With the Squirro Data Loader#

Note

This functionality will be deprecated soon. Please consider using server-side pipelets directly in the Squirro pipeline, instead of relying on data loader pipelets. If you have a use case that seems unsupported with server-side pipelets, please contact us.

This example shows how pipelets can be used by the data loader to enrich data before it is uploaded to a Squirro server.

What is a Pipelet?

Pipelets are plugins to the Squirro pipeline, used to customize how data is processed. For more information on pipelets, see Pipelets Tutorial or the Pipelets documentation.

Pipelet Configuration Basics#

To use a pipelet with the data loader, the data load script must reference a pipelet configuration file. This pipelet configuration file specifies which pipelets will be included, where the source code for each pipelet can be found, when the pipelets should be run, and any configuration required by the pipelet.

{
    "CompanySizePipelet": {
        "file_location":"company_size.py",
        "stage":"before templating",
        "config": {}
    }
}

The key for pipelet configuration is the name of the class that our pipelet file implements, which inherits from the PipeletV1 base class from the Squirro SDK. This should match up perfectly with the class definition within the pipelet file. In this case:

"CompanySizePipelet":

Since the pipelet class in the pipelet file is:

class CompanySizePipelet(PipeletV1):

The next step is to point the data loader to the source python file for the pipelet. This is typically done using a relative path from the location of the pipelets.json file.

"file_location":"company_size.py"

Next, we want to tell the data loader when to run the pipelet within the load process. Our options here are either before templating or after templating. Typically, you run a pipelet before templating if you want to have the results of that enrichment available for use in creating the title or body template.

"stage":"before templating"

Finally, we can pass in any configuration required by the pipelet. This is typically where we will pass in any confidential information like API keys, tokens, etc. In this case, we have no configuration required by the pipelet, so we can pass in an empty object.

"config": {}

Test Data Set#

We will continue to use our test CSV data set from the previous examples. The data set looks like this:

Test Data#

id

company

ticker

ipo_date

number_employees

link

1

Apple

AAPL

1980-12-12T00:00:00

116000

https://finance.yahoo.com/quote/AAPL

2

Google

GOOG

2004-08-19T00:00:00

73992

https://finance.yahoo.com/quote/GOOG

3

Microsoft

MSFT

1986-03-13T00:00:00

120849

https://finance.yahoo.com/quote/MSFT

4

Amazon

AMZN

1997-05-15T00:00:00

341400

https://finance.yahoo.com/quote/AMZN

5

Intel

INTC

1978-01-13T00:00:00

106000

https://finance.yahoo.com/quote/INTC

Test Pipelet#

Our test pipelet is the company-size pipelet provided as an example. This pipelet classifies each company as either a small, medium, large, or huge company based on it’s number of employees.

"""
This pipelet adds a facet with the company
size based on the number of employees
"""

from squirro.sdk import PipeletV1, require

@require('log')
class CompanySizePipelet(PipeletV1):

    def __init__(self, config):

        self.config = config

    def consume(self, item):

        number_of_employees = item['keywords']['number_employees'][0]
        number_of_employees = int(number_of_employees)

        if number_of_employees >= 250000:
            company_size = 'Huge'

        elif number_of_employees >= 100000:
            company_size = 'Large'

        elif number_of_employees >= 50000:
            company_size = 'Medium'

        else:
            company_size = 'Small'

        item['keywords']['company_size'] = [company_size]

        return item

Constructing a Load Script#

To use a pipelet with the data loader, reference a pipelet configuration file from your load script:

--pipelets-file 'pipelets.json' \