Labels Tutorial#
This tutorial imports an extract of the Panama Papers into Squirro.
Its focus is on how to model and use the labels in this context.
Setup#
Installation#
To get started, make sure the Squirro Toolbox is installed.
Reference: See Squirro Toolbox to familiarize yourself with how to execute the commands contained in the Squirro Toolbox.
Folder Setup#
For this tutorial, create a new folder on your file system and use the command line to go into this window. On Windows this is achieved with the following procedure:
Start the “Command Prompt” from the Start menu.
Type
cd FOLDER
where FOLDER is the path to the folder where you are executing the tutorial.
Download Example Data#
Download /entities.zip
and extract it into the tutorial folder.
This file is just a small extract of the real data.
The complete file can be downloaded directly from the original website.
Set Up Connection#
The data loader connects with a Squirro instance. For this, three pieces of information are required:
cluster
token
project-id
Reference: See Connecting to Squirro to learn more about how to get these values.
Once you have this information, go to the command prompt window and type the following commands:
Setting variables
set CLUSTER="..."
set TOKEN="..."
set PROJECT_ID="..."
Replace all the “…” preferences with their actual values.
Initial Import#
You can use the Data Loader CLI Tool to import the data into Squirro. This requires a bit of setup for an initial import without any labels.
This is the data loader command to execute for this:
squirro_data_load -v ^
--cluster %CLUSTER% --token %TOKEN% --project-id %PROJECT_ID% ^
--source-name Entities ^
--bulk-index ^
--source-type csv --source-file Entities.csv ^
--map-id node_id ^
--map-title name
Note: The lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash () instead.
This command imports the entities.csv file into the Squirro project without specifying any labels. This result is not yet very satisfactory:
Specify Labels#
The next step is to load some of the structured columns as labels.
To do so, first create a facets.json
file in the project folder as follows:
facets.json
{
"incorporation_date": {
"data_type": "datetime",
"input_format_string": "%d-%b-%Y",
},
"inactivation_date": {
"data_type": "datetime",
"input_format_string": "%d-%b-%Y",
},
"struck_off_date": {
"data_type": "datetime",
"input_format_string": "%d-%b-%Y",
},
"status": {"name": "status"},
"service_provider": {"name": "service_provider"},
"country_codes": {
"name": "country_code",
"delimiter": ";",
},
"countries": {
"name": "country",
"delimiter": ";",
},
"jurisdiction": {"name": "jurisdiction"},
"jurisdiction_description": {"name": "jurisdiction_description"},
"address": {"name": "address"},
}
This facets configuration file configures the data loader to import the indicated columns into Squirro.
A few notes:
The three columns
incorporation_date
,inactivation_date
andstruck_off_date
are treated as datetime labels. The date format is specified to parse it from the format that’s used in the file.A few columns are simply copied from the CSV to Squirro without any modifications.
The two columns
country_codes
andcountries
are renamed to singular names. Most labels in Squirro use singular names as that makes more sense when selecting a value in the label dropdown.
Now this labels configuration file can be used in the data loader command (adding one line at the end):
squirro_data_load -v ^
--cluster %CLUSTER% --token %TOKEN% --project-id %PROJECT_ID% ^
--source-name Entities ^
--bulk-index ^
--source-type csv --source-file Entities.csv ^
--map-id node_id ^
--map-title name ^
--facets-file facets.json
When executing this command, the Squirro project now looks a lot more promising:
Managing labels#
In the labels management interface (in the Data tab, then labels) the labels can be customized.
Here you can add some flavor by setting display names and grouping.
The following initial screen shows the labels as they were created based on the facets.json
above.
This can be modified label by label to result in this screen:
The labels have now been grouped into some custom groups and the date labels are no longer shown in the typeahead.
Date labels take an additional configuration, which is the format. The following screenshot illustrates this for one example:
Using Labels#
Searching#
The labels are shown in the search with type-ahead functionality.
Dashboards#
Dashboards can be built using these labels, with two examples shown below:
Visualization of Incorporation Dates#
Managing Labels from Data Loader#
Using the user interface to manage the labels can quickly become cumbersome for larger data imports.
So, going forward, we’ll manage the grouping in the facets.json
file.
{
"incorporation_date": {
"display_name": "Incorporation Date",
"group_name": "Dates",
"data_type": "datetime",
"typeahead": false,
"input_format_string": "%d-%b-%Y",
},
"inactivation_date": {
"display_name": "Inactivation Date",
"group_name": "Dates",
"data_type": "datetime",
"typeahead": false,
"input_format_string": "%d-%b-%Y",
},
"struck_off_date": {
"display_name": "Struck-off Date Date",
"group_name": "Dates",
"data_type": "datetime",
"typeahead": false,
"input_format_string": "%d-%b-%Y",
},
"status": {
"display_name": "Status"
"group_name": "Company",
},
"service_provider": {
"display_name": "Service Provider"
"group_name": "Company",
},
"country_codes": {
"name": "country_code",
"display_name": "Country Code",
"group_name": "Geography",
"delimiter": ";",
},
"countries": {
"name": "country",
"display_name": "Country",
"group_name": "Geography",
"delimiter": ";",
},
"jurisdiction": {
"display_name": "Jurisdiction Code",
"group_name": "Geography",
},
"jurisdiction_description": {
"display_name": "Jurisdiction",
"group_name": "Geography",
},
"address": {
"display_name": "Address",
"group_name": "Geography",
},
}
Note: Observant readers will notice that the labels date format is not specified. This cannot be done through the data loader facets.json
file.
Run the import with this facets.json
file and all the labels will have the right display name, groups, etc.
Searchable Labels#
When searching for “Switzerland”, the type-ahead will provide some useful output. But when you then ignore the type-ahead and execute the search, no results will come back.
That’s because none of the facets are searchable and the documents don’t contain much data in their titles and bodies.
To remedy this, we’ll change the address label to be searchable. There are two ways of doing this:
Use the admin interface and change the address field to searchable.
Use the
facets.json
file to declare a field as being searchable."address": { "display_name": "Address", "group_name": "Geography", "searchable": true, },
Independent of the way you choose, the search will now return the relevant results:
Numeric Labels#
The data set does not contain any numeric labels. So, now you can create one with a pipelet.
With it, you can calculate the survival time of all companies that have closed and store that as a label.
Importing#
This requires a number of steps:
Create a new label
survival_time
with the data typeint
. If this is not done, anyint
values will be rejected and the indexing will not work.
You can do this in the user interface or with the following snippet for facets.json
:
"survival_time": { "group_name": "Dates", "data_type": "int", "typeahead": false, },
Create a pipelet and store it in
survival_time.py
. The following example will get us going:survival_time.py
from squirro.sdk import PipeletV1, require from datetime import datetime DATE_FORMAT = '%Y-%m-%dT%H:%M:%S' @require('log') class SurvivalTimePipelet(PipeletV1): def consume(self, item): kw = item.setdefault('keywords', {}) started = kw.get('incorporation_date') ended = kw.get('inactivation_date') or kw.get('struck_off_date') if not started or not ended: # Ignore this item return item started = datetime.strptime(started[0], DATE_FORMAT) ended = datetime.strptime(ended[0], DATE_FORMAT) if ended <= started: # Protect against invalid ranges return item survival_time = int((ended - started).total_seconds() / 86400) kw['survival_time'] = [survival_time] return item
Create a
pipelets.json
configuration file to use this pipelet in the data loader:pipelets.json
{ "SurvivalTimePipelet": { "file_location": "survival_time.py", "stage": "after templating", } }
Finally, re-execute the data load step with this command:
squirro_data_load -v ^ --cluster %CLUSTER% --token %TOKEN% --project-id %PROJECT_ID% ^ --source-name Entities ^ --bulk-index ^ --source-type csv --source-file Entities.csv ^ --map-id node_id ^ --map-title name ^ --map-created-at incorporation_date ^ --facets-file facets.json ^ --pipelets-file pipelets.json
Only the last line is needed - it points the data loader to the pipelets.json
file.
As a bonus, there is also added a map-created-at
flag, which results in a better output on the search screen.
Using#
The data can now be used in a number of ways.
For example, which companies closed after less than a year?
Or, what is the average survival time of companies by country?
This particular label is not very useful in the label drill-down. So after using it in the dashboard widget, you may want to hide it by unchecking the “Visible” checkbox.
Conclusion#
This concludes the labels tutorial. Based on the Panama Papers export we experimented with the various label types and how they can be used in the dashboard and search.