How To Write a Custom Data Loader Plugin

How To Write a Custom Data Loader Plugin#

This page will describe in detail how you can build a custom loader to work with data formats/inputs that are not supported out of the box.

Prerequisites#

Follow the steps outlined here: Data Loader Command Line Interface Tool Tutorial.

It’s highly encouraged that you begin by installing the Squirro Toolbox package.

It’s also suggested you create a Python virtual environment (to isolate the packages) to work with.

Introduction#

For any new data loader plugin create a new Python file.

The Data Loader Plugin Boilerplate template can be used to get started.

SDK reference#

The plugin is implemented as an instance of the DataSource class. A number of methods must be implemented to provide the intended functionality. These special methods are all documented in the DataSource class.

Frontend-compatible loaders#

Uploading#

To provide a data loader plugin to the user in the user interface, it needs to be uploaded to the server.

This is done using the squirro_asset command line tool.

The following command is how a data loader plugin can be uploaded:

squirro_asset dataloader_plugin upload --folder pubmed --token $TOKEN --cluster $CLUSTER

Preview#

Apart from technical implementation differences between the command line and frontend data load which are not visible to the users, the main consideration for writing a UI compatible loader is the preview mode.

See Data Loader Plugin Preview for details.

Preview mode is a UI feature that enables the user to have a peak at the data before it is ingested into the system. It allows a preview of the first 10 items. For most use cases this should not present difficulties, but there are a few cases which might result in data loss.

Caching & Data Storage#

Data loader plugins often need to cache information or store certain progress information. For these purposes there are two types of stores that are available to use inside a data loader plugin:

key_value_cache
key_value_store

This is covered in API for Caching and Custom State Management.