Data Loader Templates#
Dataloader Templates allow you to accelerate the process of loading data in Squirro. By making use of pre-defined custom mappings while loading data, the loading process can be reduced to just a few clicks. While constructing a dataloader plugin, the author can now create a default template, which is elaborated in this section.
Construct a Data Loader Plugin With a Template#
The dataloader plugin with a default template can be constructed with the help of thedataloader_plugin.json
placed in the plugin directory.
{
"title": "Feed Testing",
"description": "Subscribe to an RSS or Atom feed.",
"plugin_file": "feed_plugin.py",
"scheduling_options_file": "scheduling_options.json",
"dataloader_options_file": "mappings.json",
"pipeline_workflow_file": "pipeline_workflow.json",
"category": "web",
"thumbnail_file": "feed.png",
"auth_file": "auth.py",
"override": "feed"
}
Following fields in the above json contain default template information and are crucial if the plugin has to be used with a template.
The
scheduling_options_file
field of thedataloader_plugin.json
requires a json file which defined the default scheduling parameters for the plugin. Thescheduling_options.json
shown below exemplify such json file.
{
"schedule": true,
"first_run": "2020-08-31T11:30:00",
"repeat": "15m"
}
The
pipeline_workflow_file
field of thedataloader_plugin.json
requires a json file which sets the steps for the default ingestion pipeline workflow. Thepipeline_workflow.json
shown below depicts the usage of such json for this field.
{
"steps": [
{
"config": {
"policy": "replace"
},
"id": "deduplication",
"name": "Deduplication",
"type": "deduplication"
},
{
"id": "language-detection",
"name": "Language Detection",
"type": "language-detection"
},
{
"id": "cleanup",
"name": "Content Standardization",
"type": "cleanup"
},
{
"id": "index",
"name": "Indexing",
"type": "index"
}
]
}
The
dataloader_options_file
requires a json file used to set the mapping of various fields coming from source to corresponding squirro item fields.mappings.json
exemplifies such usage.
{
"map_id": "id",
"map_title": "title",
"map_created_at": "created_at",
"map_url": "link",
"map_body": "body",
"facets_file": "facets.json"
}
Note: The code above is an example of a situation where the term facets persists in the code instead of labels.
On Setting the above described fields, the dataloader plugin with its templates can be created. It can be uploaded using squirro_asset.
An example of creation of the feed_plugin can be found here: squirro/dataloader-plugins
Validate Default Mappings#
To validate whether a plugin can be used with a template, has_default_mappings
is used. The has_default_mappings
is True
if the dataloader plugin has above three specified fields set.
required_keys = [
"dataloader_options_file",
"scheduling_options_file",
"pipeline_workflow_file",
]
NOTE: If the above required_keys
are not set during the time of upload, The dataloader plugin cannot be used as template.
Squirro Client Usage for Default Fields#
To leverage the default settings for template with squirro client, a config parameter is used with the client methods. The config
can be defined as below.
config = {
"dataloader_options": {"plugin_name": "feed_plugin"},
"dataloader_plugin_options": {
"feed_sources": ["https://www.nzz.ch/recent.rss"],
"query_timeout": 30,
"max_backoff": 24,
"custom_date_field": "",
"custom_date_format": "",
"rss_username": "",
"rss_password": "",
}
}
After the successful upload of the dataloader plugin, the new source can be created by the client using:
new_source = client.new_source(
project_id=project_id,
name=name,
config=config,
pipeline_workflow_id=None,
scheduling_options={'schedule': True, 'repeat': '30m'},
use_default_options=use_default_dataloader_options
)
To fetch a source which includes such config:
get_source = client.get_source(
project_id=project_id,
source_id=new_source["id"],
include_config=True)