ItemUploader Class

ItemUploader Class#

class ItemUploader(token=None, project_id=None, project_title=None, object_id=None, source_name=None, source_ext_id=None, cluster=None, client_cls=None, batch_size=None, config_file=None, config_section=None, processing_config=None, steps_config=None, source_id=None, source_secret=None, pipeline_workflow_name=None, pipeline_workflow_id=None, timeout_secs=None, dataloader_options=None, dataloader_plugin_options=None, non_retry_list=[200, 202, 400, 401, 403, 404], priority_level=None, **kwargs)#

Item uploader class. Defaults are loaded from the .squirrorc file in the current user’s home directory.

Parameters:

token – User refresh token.
project_id – Identifier of the project, optional but one of project_id or project_title has to be passed in.
project_title – Title of the project. This will use the first project found with the given title. If two projects with the same title exist the project being used is not predictable.
object_id – This parameter is deprecated, and is no longer needed.
source_name – Name of the source to be used. If a source with this name does not exist, then a new source with this name is created. If more than one sources with this name exist, then the processing is aborted and can only be resumed by specifying the source_id of the desired source to load into.
source_ext_id – External identifier of the source, if not provided defaults to source_name.
cluster – Cluster to connect to. This only needs to be changed for on-premise installations.
batch_size – Number of items to send in one request. This should be lower than 100 depending on your setup. If set to -1 the optimal batch size is calculated from the items. Defaults to -1.
config_file – Configuration file to use, defaults to ~/.squirrorc
config_section – Section of the .ini file to use, defaults to squirro.
source_id – Source which should be used. If a source with the id exists, then no source is created
source_secret – This option is deprecated now and is ignored.
pipeline_workflow_name – Pipeline workflow name. Either name or id need to be set.
pipeline_workflow_id – Pipeline workflow ID.
dataloader_options – Dataloader options to store along with a creation of a new data source.
dataloader_plugin_options – Dataloader plugin options to store with a creation of a new data source.
non_retry_list – List of status codes for which we don’t want a retry/backoff logic. Defaults to [200, 202, 401, 403, 400, 404].
priority_level –
Priority level of the data source. Possible choices: low, normal, high.

200, 202
Successful codes.

401, 403
Already have a retry block in the _perform_request method of Squirro Client.

400, 404
Does not make sense to retry for these codes as retrying won’t fix the underlying issue.

Typical usage:

>>> from squirro_client import ItemUploader
>>> uploader = ItemUploader(project_title='My Project',
...                         token='<your token>')
>>> items = [{'id': 'squirro-item1',
...           'title': 'Items arrived in Squirro!'}]
>>> uploader.upload(items)

Project selection:

The ItemUploader creates a source in your project. The project must exist before the ItemUploader is instantiated.

Source selection:

The source will be created or re-used, the above parameter define how the source will be named.

Configuration:

The ItemUploader can load its settings from a configuration file The default section is squirro and may be overridden by the parameter config_section to allow for multiple sources/projects.

Example configuration:

[squirro]
project_id = 2sic33jZTi-ifflvQAVcfw
token = 9c2d1a9002a8a152395d74880528fbe4acadc5a1

upload(items, priority=0, pipeline_workflow_id=None, num_retries=10, delay=1, backoff=2)#

Sends items to Squirro.

Parameters:

items – A list of items. See the Item Format documentation for the keys and values of the individual items.
priority – int, describing the priority of ingestion for the dataset to be loaded. Currently only supports a value of 0 or 1. 0 means that the items are loaded in an asynchronous fashion and 1 would mean that the items are loaded in a synchronous fashion.
pipeline_workflow_id – str, id of an existing pipeline workflow which should be used to process the current batch of items. Can only be used with parameter priority set to 1.
num_retries – int, Number of retries to make when a service is unavailable.
delay – int, Initial delay in seconds between retries.
backoff – int, Backoff multiplier, e.g. value of 2 will double the delay each retry.

upload_rows(rows, priority=0, pipeline_workflow_id=None, num_retries=10, delay=1, backoff=2)#

Sends rows to Squirro.

Parameters:

rows – A list of rows. Row means a single row of data that will eventually transformed, enriched, and become a Squirro item. In other words, a row is a single data point/sample of raw data, as extracted by a Squirro dataloader plugin.
priority – int, describing the priority of ingestion for the dataset to be loaded. Currently only supports a value of 0 or 1. 0 means that the items are loaded in an asynchronous fashion and 1 would mean that the items are loaded in a synchronous fashion.
pipeline_workflow_id – str, id of an existing pipeline workflow which should be used to process the current batch of items. Can only be used with parameter priority set to 1.
num_retries – int, Number of retries to make when a service is unavailable.
delay – int, Initial delay in seconds between retries.
backoff – int, Backoff multiplier, e.g. value of 2 will double the delay each retry.

ItemUploader Class

Contents

ItemUploader Class#