Squirro Client Reference#

This page provides AI Studio reference for the SquirroClient (Python SDK).

Ground Truth#

config = {
            "type": "text",
            "tagging_level": "sentences",
            "label": ["dog", "no dog"],
            "description": "In this ground truth we select sentences are dog or not dog related.",
            "candidateset_ids": [CANDIDATE_SET_ID]
        }

client.new_groundtruth(PROJECT_ID,'Dog Ground Truth',config)

config = {
            "type": "text",
            # the tagging_level cannot be changed
            "label": ["dog", "no dog"],
            "description": "In this ground truth we select sentences are dog or not dog related.",
            "candidateset_ids": [CANDIDATE_SET_ID]
        }

client.modify_groundtruth(PROJECT_ID, GROUNDTRUTH_ID, name='Dog Ground Truth (modified name)', config=config)

client.delete_groundtruth(PROJECT_ID, GROUNDTRUTH_ID)

client.get_groundtruths(PROJECT_ID)

client.get_groundtruth(PROJECT_ID, GROUNDTRUTH_ID)

Labeled Extract#

label = {
    "item_id": SQUIRRO_ITEM_ID,
    "extract": "The dog (Canis familiaris when considered a distinct species or Canis lupus familiaris when considered a subspecies of the wolf) is a domesticated carnivore of the family Canidae.",
    "label": "dog",
    "language": "en",
    "keywords": {},
    "candidateset_id": CANDIDATE_SET_ID,
}

client.new_groundtruth_label(PROJECT_ID, GROUNDTRUTH_ID, label)

client.modify_groundtruth_label(PROJECT_ID, GROUNDTRUTH_ID, LABELED_EXTRACT_ID, 'positive')

client.delete_groundtruth_label(PROJECT_ID, GROUNDTRUTH_ID, LABELED_EXTRACT_ID)

client.get_groundtruth_labels(PROJECT_ID, GROUNDTRUTH_ID)

client.get_groundtruth_label(PROJECT_ID, GROUNDTRUTH_ID, LABELED_EXTRACT_ID)

Rule#

rule = {
    "query": "dog sitter",
    "proximity": 6,
    "is_sequence": True,
    "type": "inclusive",
    "labeled_item_id": LABELED_EXTRACT_ID,
}
client.new_groundtruth_rule(PROJECT_ID, GROUNDTRUTH_ID, rule)

client.modify_groundtruth_rule(PROJECT_ID, GROUNDTRUTH_ID, RULE_ID, rule)

client.delete_groundtruth_rule('PROJECT_ID, GROUNDTRUTH_ID, RULE_ID)

client.get_groundtruth_rule(PROJECT_ID, GROUNDTRUTH_ID, RULE_ID)

Labels Balancer#

The balancer step uniforms the distribution of the number of elements per class in a data set. Balancing is needed to allow the ML algorithm to learn more generally instead of over fitting to the largest populated class bucket.

Note

The balancer only works within a batch if the batch size is smaller than the data set size.

Parameters#

  • class_field: key name in which the classes are located.

  • classes: list of all classes which are used in the classification.

  • not_class: boolean which states if a not class should be instantiated or not.

  • output_label_field: field in which the label are stored (only important if not_class is True).

  • deviation (optional): Max deviation from the smallest class bucket to the largest bucket (1. = 100%, 0. = 0%).

  • seed (optional): Seed for the randomization process.

Example#

{
  "step": "balancer",
  "type": "balancer",
  "name": "balancer",
  "classes": ["A","B","C","D"],
  "class_field": "label",
  "not_class": false
  "output_label_field": "balanced_label"
}

Data Randomizer#

The randomizer step shuffles the order of the documents. The randomization of the data set allows the ML algorithm to come up with a more generally-applicable solution.

Note

Documents are only shuffled within a batch if the batch size is smaller than the data set size.

Parameters#

  • seed (optional): Seed for the randomization process

Example#

{
  "step": "randomizer",
  "type": "randomizer"
}

Batch Randomizer#

This step creates a checkpoint and shuffles the order of the batches before the execution of the next step.

Parameters#

  • checkpoint_processing (optional): Boolean which indicates if a checkpoint gets created in a non-training execution.

Example#

{
  "step": "batch_randomizer",
  "type": "batch_randomizer",
  "checkpoint_processing": true
}

Squirro Ground Truth loader#

The squirro_groundtruth step loads your ground truth and transforms it into the Document structure so that the data points can be used in the pipeline for training and validating a model.

Parameters#

  • temporal_version: Date which defines which Ground Truth version should get selected

  • groundtruth_id: Id of the Squirro Ground Truth

  • project_id: Id of Squirro project

  • cluster: URL of the cluster

  • token: Squirro token

Example#

{
  "step": "loader",
  "type": "squirro_groundtruth",
  "fields": [],
  "temporal_version": "2020-10-07T16:24:01.36052",
  "groundtruth_id": GROUNDTRUTH_ID,
  "project_id": PROJECT_ID,
  "cluster": CLUSTER,
  "token": TOKEN
}