Data Loader Templates#
Dataloader Templates allow you to accelerate the process of loading data in Squirro. By making use of pre-defined custom mappings while loading data, the loading process can be reduced to just a few clicks. While constructing a dataloader plugin, the author can now create a default template, which is elaborated in this section.
Construct a Data Loader Plugin With a Template#
The dataloader plugin with a default template can be constructed with the help of thedataloader_plugin.json placed in the plugin directory.
{
"title": "Feed Testing",
"description": "Subscribe to an RSS or Atom feed.",
"plugin_file": "feed_plugin.py",
"scheduling_options_file": "scheduling_options.json",
"dataloader_options_file": "mappings.json",
"pipeline_workflow_file": "pipeline_workflow.json",
"category": "web",
"thumbnail_file": "feed.png",
"auth_file": "auth.py",
"override": "feed"
}
Following fields in the above json contain default template information and are crucial if the plugin has to be used with a template.
The
scheduling_options_filefield of thedataloader_plugin.jsonrequires a json file which defined the default scheduling parameters for the plugin. Thescheduling_options.jsonshown below exemplify such json file.
{
"schedule": true,
"first_run": "2020-08-31T11:30:00",
"repeat": "15m"
}
The
pipeline_workflow_filefield of thedataloader_plugin.jsonrequires a json file which sets the steps for the default ingestion pipeline workflow. Thepipeline_workflow.jsonshown below depicts the usage of such json for this field.
{
"steps": [
{
"config": {
"policy": "replace"
},
"id": "deduplication",
"name": "Deduplication",
"type": "deduplication"
},
{
"id": "language-detection",
"name": "Language Detection",
"type": "language-detection"
},
{
"id": "cleanup",
"name": "Content Standardization",
"type": "cleanup"
},
{
"id": "index",
"name": "Indexing",
"type": "index"
}
]
}
The
dataloader_options_filerequires a json file used to set the mapping of various fields coming from source to corresponding squirro item fields.mappings.jsonexemplifies such usage.
{
"map_id": "id",
"map_title": "title",
"map_created_at": "created_at",
"map_url": "link",
"map_body": "body",
"facets_file": "facets.json"
}
Note: The code above is an example of a situation where the term facets persists in the code instead of labels.
On Setting the above described fields, the dataloader plugin with its templates can be created. It can be uploaded using squirro_asset.
An example of creation of the feed_plugin can be found here: squirro/dataloader-plugins
Validate Default Mappings#
To validate whether a plugin can be used with a template, has_default_mappings is used. The has_default_mappings is True if the dataloader plugin has above three specified fields set.
required_keys = [
"dataloader_options_file",
"scheduling_options_file",
"pipeline_workflow_file",
]
NOTE: If the above required_keys are not set during the time of upload, The dataloader plugin cannot be used as template.
Squirro Client Usage for Default Fields#
To leverage the default settings for template with squirro client, a config parameter is used with the client methods. The config can be defined as below.
config = {
"dataloader_options": {"plugin_name": "feed_plugin"},
"dataloader_plugin_options": {
"feed_sources": ["https://www.nzz.ch/recent.rss"],
"query_timeout": 30,
"max_backoff": 24,
"custom_date_field": "",
"custom_date_format": "",
"rss_username": "",
"rss_password": "",
}
}
After the successful upload of the dataloader plugin, the new source can be created by the client using:
new_source = client.new_source(
project_id=project_id,
name=name,
config=config,
pipeline_workflow_id=None,
scheduling_options={'schedule': True, 'repeat': '30m'},
use_default_options=use_default_dataloader_options
)
To fetch a source which includes such config:
get_source = client.get_source(
project_id=project_id,
source_id=new_source["id"],
include_config=True)