ItemUploader

class squirro_client.item_uploader.ItemUploader(token=None, project_id=None, project_title=None, object_id=None, source_name=None, source_ext_id=None, cluster=None, client_cls=None, batch_size=None, config_file=None, config_section=None, processing_config=None, steps_config=None, source_id=None, source_secret=None, pipeline_workflow_name=None, pipeline_workflow_id=None, timeout_secs=None, dataloader_options=None, dataloader_plugin_options=None, non_retry_list=[200, 202, 400, 401, 403, 404], **kwargs)

Item uploader class. Defaults are loaded from the .squirrorc file in the current user’s home directory.

Parameters
  • token – User refresh token.

  • project_id – Identifier of the project, optional but one of project_id or project_title has to be passed in.

  • project_title – Title of the project. This will use the first project found with the given title. If two projects with the same title exist the project being used is not predictable.

  • object_id – This parameter is deprecated, and is no longer needed.

  • source_name – Name of the source to be used. If a source with this name does not exist, then a new source with this name is created. If more than one sources with this name exist, then the processing is aborted and can only be resumed by specifying the source_id of the desired source to load into.

  • source_ext_id – External identifier of the source, if not provided defaults to source_name.

  • cluster – Cluster to connect to. This only needs to be changed for on-premise installations.

  • batch_size – Number of items to send in one request. This should be lower than 100 depending on your setup. If set to -1 the optimal batch size is calculated from the items. Defaults to -1.

  • config_file – Configuration file to use, defaults to ~/.squirrorc

  • config_section – Section of the .ini file to use, defaults to squirro.

  • source_id – Source which should be used. If a source with the id exists, then no source is created

  • source_secret – This option is deprecated now and is ignored.

  • pipeline_workflow_name – Pipeline workflow name. Either name or id need to be set.

  • pipeline_workflow_id – Pipeline workflow ID.

  • dataloader_options – Dataloader options to store along with a creation of a new data source.

  • dataloader_plugin_options – Dataloader plugin options to store with a creation of a new data source.

  • non_retry_list

    List of status codes for which we don’t want a retry/backoff logic. Defaults to [200, 202, 401, 403, 400, 404].

    200, 202

    Successful codes.

    401, 403

    Already have a retry block in the _perform_request method of Squirro Client.

    400, 404

    Does not make sense to retry for these codes as retrying won’t fix the underlying issue.

Typical usage:

>>> from squirro_client import ItemUploader
>>> uploader = ItemUploader(project_title='My Project',
...                         token='<your token>')
>>> items = [{'id': 'squirro-item1',
...           'title': 'Items arrived in Squirro!'}]
>>> uploader.upload(items)

Project selection:

The ItemUploader creates a source in your project. The project must exist before the ItemUploader is instantiated.

Source selection:

The source will be created or re-used, the above parameter define how the source will be named.

Configuration:

The ItemUploader can load its settings from a configuration file The default section is squirro and may be overridden by the parameter config_section to allow for multiple sources/projects.

Example configuration:

[squirro]
project_id = 2sic33jZTi-ifflvQAVcfw
token = 9c2d1a9002a8a152395d74880528fbe4acadc5a1
upload(items, priority=0, pipeline_workflow_id=None, num_retries=10, delay=1, backoff=2)

Sends items to Squirro.

Parameters
  • items – A list of items. See the Item Format documentation for the keys and values of the individual items.

  • priority – int, describing the priority of ingestion for the dataset to be loaded. Currently only supports a value of 0 or 1. 0 means that the items are loaded in an asynchronous fashion and 1 would mean that the items are loaded in a synchronous fashion.

  • pipeline_workflow_id – str, id of an existing pipeline workflow which should be used to process the current batch of items. Can only be used with parameter priority set to 1.

  • num_retries – int, Number of retries to make when a service is unavailable.

  • delay – int, Initial delay in seconds between retries.

  • backoff – int, Backoff multiplier, e.g. value of 2 will double the delay each retry.

upload_rows(rows, priority=0, pipeline_workflow_id=None, num_retries=10, delay=1, backoff=2)

Sends rows to Squirro.

Parameters
  • rows – A list of rows. Row means a single row of data that will eventually transformed, enriched, and become a Squirro item. In other words, a row is a single data point/sample of raw data, as extracted by a Squirro dataloader plugin.

  • priority – int, describing the priority of ingestion for the dataset to be loaded. Currently only supports a value of 0 or 1. 0 means that the items are loaded in an asynchronous fashion and 1 would mean that the items are loaded in a synchronous fashion.

  • pipeline_workflow_id – str, id of an existing pipeline workflow which should be used to process the current batch of items. Can only be used with parameter priority set to 1.

  • num_retries – int, Number of retries to make when a service is unavailable.

  • delay – int, Initial delay in seconds between retries.

  • backoff – int, Backoff multiplier, e.g. value of 2 will double the delay each retry.