Labels Tutorial#

This tutorial imports an extract of the Panama Papers into Squirro.

Its focus is on how to model and use the labels in this context.

Setup#

Installation#

To get started, make sure the Squirro Toolbox is installed.

Reference: See Squirro Toolbox to familiarize yourself with how to execute the commands contained in the Squirro Toolbox.

Folder Setup#

For this tutorial, create a new folder on your file system and use the command line to go into this window. On Windows this is achieved with the following procedure:

  1. Start the “Command Prompt” from the Start menu.

  2. Type cd FOLDER where FOLDER is the path to the folder where you are executing the tutorial.

Download Example Data#

Download /entities.zip and extract it into the tutorial folder.

This file is just a small extract of the real data.

The complete file can be downloaded directly from the original website.

Set Up Connection#

The data loader connects with a Squirro instance. For this, three pieces of information are required:

  • cluster

  • token

  • project-id

Reference: See Connecting to Squirro to learn more about how to get these values.

Once you have this information, go to the command prompt window and type the following commands:

Setting variables

set CLUSTER="..."
set TOKEN="..."
set PROJECT_ID="..."

Replace all the “…” preferences with their actual values.

Initial Import#

You can use the Data Loader CLI Tool to import the data into Squirro. This requires a bit of setup for an initial import without any labels.

This is the data loader command to execute for this:

squirro_data_load -v ^
    --cluster %CLUSTER% --token %TOKEN% --project-id %PROJECT_ID% ^
    --source-name Entities ^
    --bulk-index ^
    --source-type csv --source-file Entities.csv ^
    --map-id node_id ^
    --map-title name

Note: The lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash () instead.

This command imports the entities.csv file into the Squirro project without specifying any labels. This result is not yet very satisfactory:

image1

Specify Labels#

The next step is to load some of the structured columns as labels.

To do so, first create a facets.json file in the project folder as follows:

facets.json

{
    "incorporation_date": {
        "data_type": "datetime",
        "input_format_string": "%d-%b-%Y",
    },
    "inactivation_date": {
        "data_type": "datetime",
        "input_format_string": "%d-%b-%Y",
    },
    "struck_off_date": {
        "data_type": "datetime",
        "input_format_string": "%d-%b-%Y",
    },
    "status": {"name": "status"},
    "service_provider": {"name": "service_provider"},
    "country_codes": {
        "name": "country_code",
        "delimiter": ";",
    },
    "countries": {
        "name": "country",
        "delimiter": ";",
    },
    "jurisdiction": {"name": "jurisdiction"},
    "jurisdiction_description": {"name": "jurisdiction_description"},
    "address": {"name": "address"},
}

This facets configuration file configures the data loader to import the indicated columns into Squirro.

A few notes:

  • The three columns incorporation_date, inactivation_date and struck_off_date are treated as datetime labels. The date format is specified to parse it from the format that’s used in the file.

  • A few columns are simply copied from the CSV to Squirro without any modifications.

  • The two columns country_codes and countries are renamed to singular names. Most labels in Squirro use singular names as that makes more sense when selecting a value in the label dropdown.

Now this labels configuration file can be used in the data loader command (adding one line at the end):

squirro_data_load -v ^
    --cluster %CLUSTER% --token %TOKEN% --project-id %PROJECT_ID% ^
    --source-name Entities ^
    --bulk-index ^
    --source-type csv --source-file Entities.csv ^
    --map-id node_id ^
    --map-title name ^
    --facets-file facets.json

When executing this command, the Squirro project now looks a lot more promising:

image2

Managing labels#

In the labels management interface (in the Data tab, then labels) the labels can be customized.

Here you can add some flavor by setting display names and grouping.

The following initial screen shows the labels as they were created based on the facets.json above.

image3

This can be modified label by label to result in this screen:

image4

The labels have now been grouped into some custom groups and the date labels are no longer shown in the typeahead.

Date labels take an additional configuration, which is the format. The following screenshot illustrates this for one example:

image5

Using Labels#

Searching#

The labels are shown in the search with type-ahead functionality.

image7

Dashboards#

Dashboards can be built using these labels, with two examples shown below:

Visualization of Incorporation Dates#

image9

Managing Labels from Data Loader#

Using the user interface to manage the labels can quickly become cumbersome for larger data imports.

So, going forward, we’ll manage the grouping in the facets.json file.

{
    "incorporation_date": {
        "display_name": "Incorporation Date",
        "group_name": "Dates",
        "data_type": "datetime",
        "typeahead": false,

        "input_format_string": "%d-%b-%Y",
    },
    "inactivation_date": {
        "display_name": "Inactivation Date",
        "group_name": "Dates",
        "data_type": "datetime",
        "typeahead": false,

        "input_format_string": "%d-%b-%Y",
    },
    "struck_off_date": {
        "display_name": "Struck-off Date Date",
        "group_name": "Dates",
        "data_type": "datetime",
        "typeahead": false,

        "input_format_string": "%d-%b-%Y",
    },
    "status": {
        "display_name": "Status"
        "group_name": "Company",
    },
    "service_provider": {
        "display_name": "Service Provider"
        "group_name": "Company",
    },
    "country_codes": {
        "name": "country_code",
        "display_name": "Country Code",
        "group_name": "Geography",

        "delimiter": ";",
    },
    "countries": {
        "name": "country",
        "display_name": "Country",
        "group_name": "Geography",

        "delimiter": ";",
    },
    "jurisdiction": {
        "display_name": "Jurisdiction Code",
        "group_name": "Geography",
    },
    "jurisdiction_description": {
        "display_name": "Jurisdiction",
        "group_name": "Geography",
    },
    "address": {
        "display_name": "Address",
        "group_name": "Geography",
    },
}

Note: Observant readers will notice that the labels date format is not specified. This cannot be done through the data loader facets.json file.

Run the import with this facets.json file and all the labels will have the right display name, groups, etc.

Searchable Labels#

When searching for “Switzerland”, the type-ahead will provide some useful output. But when you then ignore the type-ahead and execute the search, no results will come back.

That’s because none of the facets are searchable and the documents don’t contain much data in their titles and bodies.

image10

To remedy this, we’ll change the address label to be searchable. There are two ways of doing this:

  1. Use the admin interface and change the address field to searchable.

  2. Use the facets.json file to declare a field as being searchable.

    "address": {
        "display_name": "Address",
        "group_name": "Geography",
        "searchable": true,
    },
    

Independent of the way you choose, the search will now return the relevant results:

image11

Numeric Labels#

The data set does not contain any numeric labels. So, now you can create one with a pipelet.

With it, you can calculate the survival time of all companies that have closed and store that as a label.

Importing#

This requires a number of steps:

  1. Create a new label survival_time with the data type int. If this is not done, any int values will be rejected and the indexing will not work.

You can do this in the user interface or with the following snippet for facets.json:

"survival_time": {
    "group_name": "Dates",
    "data_type": "int",
    "typeahead": false,
},
  1. Create a pipelet and store it in survival_time.py. The following example will get us going:

    survival_time.py

    from squirro.sdk import PipeletV1, require
    from datetime import datetime
    
    DATE_FORMAT = '%Y-%m-%dT%H:%M:%S'
    
    
    @require('log')
    class SurvivalTimePipelet(PipeletV1):
        def consume(self, item):
            kw = item.setdefault('keywords', {})
    
            started = kw.get('incorporation_date')
            ended = kw.get('inactivation_date') or kw.get('struck_off_date')
            if not started or not ended:
                # Ignore this item
                return item
    
            started = datetime.strptime(started[0], DATE_FORMAT)
            ended = datetime.strptime(ended[0], DATE_FORMAT)
    
            if ended <= started:
                # Protect against invalid ranges
                return item
    
            survival_time = int((ended - started).total_seconds() / 86400)
            kw['survival_time'] = [survival_time]
            return item
    
  2. Create a pipelets.json configuration file to use this pipelet in the data loader:

    pipelets.json

    {
        "SurvivalTimePipelet": {
            "file_location": "survival_time.py",
            "stage": "after templating",
        }
    }
    
  3. Finally, re-execute the data load step with this command:

    squirro_data_load -v ^
        --cluster %CLUSTER% --token %TOKEN% --project-id %PROJECT_ID% ^
        --source-name Entities ^
        --bulk-index ^
        --source-type csv --source-file Entities.csv ^
        --map-id node_id ^
        --map-title name ^
        --map-created-at incorporation_date ^
        --facets-file facets.json ^
        --pipelets-file pipelets.json
    

Only the last line is needed - it points the data loader to the pipelets.json file.

As a bonus, there is also added a map-created-at flag, which results in a better output on the search screen.

Using#

The data can now be used in a number of ways.

For example, which companies closed after less than a year?

image12

Or, what is the average survival time of companies by country?

image13

This particular label is not very useful in the label drill-down. So after using it in the dashboard widget, you may want to hide it by unchecking the “Visible” checkbox.

Conclusion#

This concludes the labels tutorial. Based on the Panama Papers export we experimented with the various label types and how they can be used in the dashboard and search.