Data Loader Reference#

This page explains available options for the Data Loader and its usage.

Basic Usage#

The purpose of the Data Loader tool is to extract data from different sources, including the following: supported databases, CSV and Excel files (xls and xlsx) and then upload them into Squirro.

The Data Loader is called from the command line with multiple arguments, some of which are mandatory.

Arguments#

The following table lists all the arguments:

CSV Source Options#

When using a CSV file as a data source, only full loading is supported.

Note: Data must have a header to determine the schema.

The command line parameters used for a CSV data source include the following:

Argument

Mandatory

Description

--csv-delimiter CHARACTER

A one-character string used to separate fields. It defaults to ,.

--csv-quotechar CHARACTER

A one-character string used to quote fields containing special characters, such as the delimiter or quotechar, or which contain new-line characters. It defaults to ".

--csv-encoding STRING

A string specifying file encoding to be used, e.g. utf8. If not provided, the loader will try to best guess the encoding.

--source-file PATH

Yes

Path of csv data file.

The following example shows a simple load from a CSV file, mapping the title, id and body of the Squirro item to columns from the sample.csv file, without using any of the additional files for facets, templating, pipelets etc.

All the rows will be loaded and the delimiter between fields is considered any , (comma) found in a row. To quote fields containing special characters the double quote character " must be used.

squirro_data_load -v \
    --token $token \
    --cluster $cluster \
    --project-id $project_id \
    --source-name csv_sample \
    --source-file sample.csv \
    --source-type csv \
    --csv-delimiter , \
    --csv-quotechar " \
    --map-title Title \
    --map-id ID \
    --map-body Description

Note: The lines have been wrapped with the backslash () at the end of each line. On a bash/windows setup, you will need to use circumflex (^) instead.

This example assumes that $token, $cluster, and $project_id have been declared beforehand.

Excel Source Options#

When using an excel file as the data source, only full loading is supported and data must have a header to determine the schema.

If the first row of the data (after applying the boundaries, if needed) is not the header, a KeyError exception will be raised and the job will stop.

In this case, it’s not possible to determine the schema of the data.

The command line parameters used for an excel data source:

Argument

Mandatory

Description

--excel-sheet STRING

Excel sheet name. Default: get first sheet.

--excel-boundaries NUMBER: NUMBER

Limit rows loaded from excel. Format is: start_row:rows_discarded_from_end.

--source-file PATH

Yes

Path of excel data file.

The example below shows a simple load from an excel file, mapping only the title, id, and body of the Squirro item to columns from the sample.xlsx excel file, without using any of the additional files for facets, templating, pipelets etc.

The Data Loader tool will only load the Products sheet of the file and from this sheet the rows starting at 1 up to the last 100 rows, which will not be loaded.

squirro_data_load -v \
    --token $token \
    --cluster $cluster \
    --project-id $project_id \
    --source-name excel_sample \
    --source-file sample.xlsx \
    --source-type excel \
    --excel-sheet Products \
    --excel-boundaries 1:100 \
    --map-title Title \
    --map-id ID \
    --map-body Description

Note: The lines have been wrapped with the backslash () at the end of each line. On a bash/windows setup, you will need to use circumflex (^) instead.

This example assumes that $token, $cluster, and $project_id have been declared beforehand.

JSON Source Options#

When using a JSON file as the data source, only full load is supported.

Schema of the data is determined using the first item found, therefore assuming that all items have the same structure. If that’s not the case, the loader might fail.

The command line parameters used for a JSON data source include the following:

Argument

Mandatory

Description

--item-schema STRING

No

In case the JSON objects are not available as a top-level structure, use this parameter to un-nest the JSON structure.

--source-file PATH

No, one of --source-file or --source-folder must be provided.

Path of JSON data file.

--source-folder PATH

No, one of --source-file or --source-folder must be provided.

Path of directory containing multiple JSON files. Only available in CLI mode.

The example below shows a load from a nested JSON file:

squirro_data_load -vv \
    --cluster "$CLUSTER" \
    --token "$TOKEN" \
    --project-id "$PROJECT_ID" \
    --source-type "json" \
    --map-title "data.headline" \
    --map-body "data.body" \
    --map-id "data.id" \
    --map-created-at "data.versionCreated" \
    --source-name "JSON WIKI TEST" \
    --item-schema "Items" \
    --facets-file "facets.json" \
    --source-file "$SOURCE_FILE"

Note: The lines have been wrapped with the backslash () at the end of each line. On a bash/windows setup, you will need to use circumflex (^) instead.

This example assumes that $token, $cluster, and $project_id have been declared beforehand.

Database Options#

When loading from a database, both full and incremental load are supported, using a select query supplied as a string or in a file. The script uses uses SQLAlchemy to connect to any database.

Tested databases:

  • Postgres and all databases using the postgres driver for connection (Greenplum, Redshift etc)

  • Microsoft SQL

  • Oracle

  • MySQL

  • SQLite

The command line parameters used for a database source:

Argument

Mandatory

Description

--db-connection STRING

Yes

Database connection string.

--input-file PATH

File containing the SQL code.

--input-query STRING

SQL query.

Note that the --input-file and --input-query arguments are mutually exclusive.

The following example shows a simple load from the database mapping the title, id, and body of the Squirro item to columns from a database table that is interrogated in the sample.sql file.

The Data Loader tool makes a full load of all the rows specified in the sample.sql file since the argument --incremental-column is not set.

squirro_data_load -v \
    --token $token \
    --cluster $cluster \
    --project-id $project_id \
    --db-connection $db_connection_string \
    --source-name db_sample \
    --input-file $script_dir/interaction.sql \
    --source-type database \
    --map-title Title \
    --map-id ID \
    --map-body Description

Note: The lines have been wrapped with the backslash () at the end of each line. On a bash/windows setup, you will need to use circumflex (^) instead.

This example assumes that $token, $cluster, and $project_id have been declared beforehand.

Filesystem Options#

The command line parameters used for a filesystem source:

Option

Mandatory

Description

--folder PATH

No, one of the –folder or –file-path must be provided

Filesystem location that will be loaded in Squirro.

--file-path

No, one of the –folder or –file-path must be provided

Absolute path to a file to be loaded in Squirro. If the file is a zip archive, it will be extracted, and the contents will be loaded.

This option replaces the deprecated –zip-file-path.

--deletions

No

If set, then any files that are no longer present on the file system are also removed from Squirro. To use this, the --map-flag option also needs to be used to ensure new/updated and deleted files are handled correctly:

--map-flag flag

--include-file PATH

No

Path to a file containing inclusion rules.

This is a list of patterns that files need to be match to be indexed. If provided, then only files that match at least one pattern are indexed.

--exclude-file PATH

No

Path to a file containing exclusion rules.

This is a list of patterns for files that should not be indexed. Any file that matches at least one such pattern is not indexed, independent of whether it also matches the include rules.

--skip-errors

No

Ignore any file system errors when processing individual files. This way a single file system read error does not prevent the entire load from succeeding. If the error is temporary, then the file will be picked up in the next load.

Performance Optimisations

--convert-file PATH

No

Path to a file containing conversion file patterns.

Files that match any of these rules will be indexed with full content. See Content Extraction for the file types that Squirro supports full indexing for. By limiting to a smaller number of extensions, this allows the file system loader to only process content in Squirro for which indexing will be effective.

--file-size-limit

No

Maximum size in megabytes of files that should be indexed with content. Also see --index-all below.

--index-all

No

If set, then files over the --file-size-limit are indexed, but without their content. In the default case of this not being set, those files are skipped entirely.

--batch-size-limit

No

Maximum size of requests sent to Squirro’s API for indexing of files.

--deduplicate

No

Deduplicate files based on file content. Exact duplicates are only ever indexed ones, with duplicates ignored.

Logging and Debugging

--log-excludes

No

Log matches for inclusion/exclusion rules.

--progress

No

Log progress verbosely.

File system loading is implemented as a data loader plugin and invoked with the usual data loader.

squirro_data_load -v \
    --token $TOKEN \
    --cluster $CLUSTER \
    --project-id $PROJECT_ID \
    --source-type filesystem \
    --folder FOLDER_TO_INDEX \
    --map-title "title" \
    --map-file-name "file_name" \
    --map-file-mime "file_mime" \
    --map-file-data "file_data" \
    --map-id "id" \
    --map-url "link" \
    --map-created-at "created_at" \
    --facets-file facets.json

Note: The lines have been wrapped with the backslash () at the end of each line. On a bash/windows setup, you will need to use circumflex (^) instead.

This example assumes that $token, $cluster and $project_id have been declared beforehand.

Squirro Options#

When Loading data from any squirro source following command line parameters can be used:

Argument

Mandatory

Description

--source-cluster

Yes

Source Squirro Cluster URL

--source-token

Yes

Source Squirro Token

--source-project-id

Yes

Source Squirro Project ID

--source-query

No

Squirro query

--include-facets

No

If set, then keywords are included

--facet-delimiter

No

Character or string used to delimit facets with multiple values

--include-entities

No

If set, then entities are included

--include-webshot

No

If set, then webshot are included only when one of –map-webshot-url or –map-webshot-picture-hint is set

--progress

No

If set, detailed per row progress information is logged

--deduplicate

No

If set, items will be deduplicated based on titles

--retry

No

Number of retries to make in case of an error

A simple example for loading a data from a Squirro source is given below:

squirro_data_load -v \
    --token $TOKEN \
    --cluster $CLUSTER \
    --project-id $PROJECT_ID \
    --source-cluster $SOURCE_CLUSTER \
    --source-token $SOURCE_TOKEN \
    --source-project-id $SOURCE_PROJECT_ID \
    --source-type squirro \
    --source-query "*" \
    --include-facets \
    --include-entities \
    --map-title "title" \
    --map-id "id" \
    --map-url "link" \
    --map-created-at "created_at" \
    --progress \
    --deduplicate \
    --retry 5

Note: The lines have been wrapped with the backslash () at the end of each line. On a bash/windows setup, you will need to use circumflex (^) instead.

This example assumes that $token, $cluster, and $project_id have been declared beforehand.

Feed Options#

When Loading data from any feed source following command line parameters can be used:

Argument

Mandatory

Description

--feed-sources

Yes

Space-separated list of URLs (strings).

--query-timeout

No

Timeout (in seconds) for fetching the feed

--max-backoff

No

Maximum number of hours to wait if the feed update frequency is low

--custom-date-field

No

For non-standard rss datetime fields, enter the field

--custom-date-format

No

For non-standard rss datetime formats, enter the format i.e. %m/%d/%Y.

--rss-username

No

Username for RSS Basic Authentication

--rss-password

No

Password for RSS Basic Authentication

A simple example for loading data from feed source is given by:

squirro_data_load -v \
    --token $TOKEN \
    --cluster $CLUSTER \
    --project-id $PROJECT_ID \
    --source-type feed \
    --source-name feed_sample \
    --feed-sources 'https://www.theregister.co.uk/headlines.atom' 'http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml' \
    --map-title "title" \
    --map-id "id" \
    --map-body "description" \
    --map-created-at "created_at" \
    --batch-size 100 \
    --source-batch-size 100 \
    --priority-level low \
    --facets-file facets.json

Note: The lines have been wrapped with the backslash () at the end of each line. On a bash/windows setup, you will need to use circumflex (^) instead.

This example assumes that $token, $cluster, and $project_id have been declared beforehand.

User-Defined Sources#

If data needs to be extracted from other sources than the ones described above there is the option to write a custom source.

To do this a new Python module must be created and has to implement the abstract base class.

Reference: To learn more, see Data Loader Plugin Reference.

In this way the Data Loader can index data from other sources without modifications.

It is very simple to use the custom source, just supply the full path of the Python module by using the command line argument –source-script instead of using –source-type.

Tip: Adding the custom module to the PYTHONPATH and importing it will be done automatically by the loader.