Data Loader Templates#

Dataloader Templates allow you to accelerate the process of loading data in Squirro. By making use of pre-defined custom mappings while loading data, the loading process can be reduced to just a few clicks. While constructing a dataloader plugin, the author can now create a default template, which is elaborated in this section.

Construct a Data Loader Plugin With a Template#

The dataloader plugin with a default template can be constructed with the help of thedataloader_plugin.json placed in the plugin directory.

{
    "title": "Feed Testing",
    "description": "Subscribe to an RSS or Atom feed.",
    "plugin_file": "feed_plugin.py",
    "scheduling_options_file": "scheduling_options.json",
    "dataloader_options_file": "mappings.json",
    "pipeline_workflow_file": "pipeline_workflow.json",
    "category": "web",
    "thumbnail_file": "feed.png",
    "auth_file": "auth.py",
    "override": "feed"
}

Following fields in the above json contain default template information and are crucial if the plugin has to be used with a template.

  1. The scheduling_options_file field of the dataloader_plugin.json requires a json file which defined the default scheduling parameters for the plugin. The scheduling_options.json shown below exemplify such json file.

{
    "schedule": true,
    "first_run": "2020-08-31T11:30:00",
    "repeat": "15m"
}
  1. The pipeline_workflow_file field of the dataloader_plugin.json requires a json file which sets the steps for the default ingestion pipeline workflow. The pipeline_workflow.json shown below depicts the usage of such json for this field.

{
    "steps": [
        {
            "config": {
                "policy": "replace"
            },
            "id": "deduplication",
            "name": "Deduplication",
            "type": "deduplication"
        },
        {
            "id": "language-detection",
            "name": "Language Detection",
            "type": "language-detection"
        },
        {
            "id": "cleanup",
            "name": "Content Standardization",
            "type": "cleanup"
        },
        {
            "id": "index",
            "name": "Indexing",
            "type": "index"
        }
    ]
}
  1. The dataloader_options_file requires a json file used to set the mapping of various fields coming from source to corresponding squirro item fields. mappings.json exemplifies such usage.

{
    "map_id": "id",
    "map_title": "title",
    "map_created_at": "created_at",
    "map_url": "link",
    "map_body": "body",
    "facets_file": "facets.json"
}

Note: The code above is an example of a situation where the term facets persists in the code instead of labels.

On Setting the above described fields, the dataloader plugin with its templates can be created. It can be uploaded using squirro_asset.

An example of creation of the feed_plugin can be found here: squirro/dataloader-plugins

Validate Default Mappings#

To validate whether a plugin can be used with a template, has_default_mappings is used. The has_default_mappings is True if the dataloader plugin has above three specified fields set.

required_keys = [
            "dataloader_options_file",
            "scheduling_options_file",
            "pipeline_workflow_file",
        ]

NOTE: If the above required_keys are not set during the time of upload, The dataloader plugin cannot be used as template.

Squirro Client Usage for Default Fields#

To leverage the default settings for template with squirro client, a config parameter is used with the client methods. The config can be defined as below.

config = {
    "dataloader_options": {"plugin_name": "feed_plugin"},
    "dataloader_plugin_options": {
        "feed_sources": ["https://www.nzz.ch/recent.rss"],
        "query_timeout": 30,
        "max_backoff": 24,
        "custom_date_field": "",
        "custom_date_format": "",
        "rss_username": "",
        "rss_password": "",
    }
}

After the successful upload of the dataloader plugin, the new source can be created by the client using:

new_source = client.new_source(
    project_id=project_id,
    name=name,
    config=config,
    pipeline_workflow_id=None,
    scheduling_options={'schedule': True, 'repeat': '30m'},
    use_default_options=use_default_dataloader_options
)

To fetch a source which includes such config:

get_source = client.get_source(
      project_id=project_id,
      source_id=new_source["id"],
      include_config=True)