AI Studio Step 3: Models#

The Model is the component that performs text classification and consists of two main components:

  • Ground Truth

  • Machine Learning Workflow

Usage#

All models within a project are listed in the Models overview screen as shown below:

Tip: A new Model can be created by clicking the plus icon in the top-right corner of the screen.

image1

In addition to being able to see all models, there are additional useful actions that can be performed:

  • Copy: Copy the config of a model.

  • Re-Train: Re-train a model based on the latest ground truth.

  • Validate: Switch to the validation screen.

  • Publish: Directly publish a model.

  • Delete: Delete an unpublished model.

image2

To create a Model you need to select a ground truth and a machine-learning template as shown above.

Further metadata is also required, as shown below:

image3

Labels Balancer#

The balancer step uniforms the distribution of the number of elements per class in a data set. Balancing is needed to allow the ML algorithm to learn more generally instead of over fitting to the largest populated class bucket.

Note

The balancer only works within a batch if the batch size is smaller than the data set size.

Parameters#

  • class_field: key name in which the classes are located.

  • classes: list of all classes which are used in the classification.

  • not_class: boolean which states if a not class should be instantiated or not.

  • output_label_field: field in which the label are stored (only important if not_class is True).

  • deviation (optional): Max deviation from the smallest class bucket to the largest bucket (1. = 100%, 0. = 0%).

  • seed (optional): Seed for the randomization process.

Example#

{
  "step": "balancer",
  "type": "balancer",
  "name": "balancer",
  "classes": ["A","B","C","D"],
  "class_field": "label",
  "not_class": false
  "output_label_field": "balanced_label"
}

Data Randomizer#

The randomizer step shuffles the order of the documents. The randomization of the data set allows the ML algorithm to come up with a more generally-applicable solution.

Note

Documents are only shuffled within a batch if the batch size is smaller than the data set size.

Parameters#

  • seed (optional): Seed for the randomization process

Example#

{
  "step": "randomizer",
  "type": "randomizer"
}

Batch Randomizer#

This step creates a checkpoint and shuffles the order of the batches before the execution of the next step.

Parameters#

  • checkpoint_processing (optional): Boolean which indicates if a checkpoint gets created in a non-training execution.

Example#

{
  "step": "batch_randomizer",
  "type": "batch_randomizer",
  "checkpoint_processing": true
}

Squirro Ground Truth loader#

The squirro_groundtruth step loads your ground truth and transforms it into the Document structure so that the data points can be used in the pipeline for training and validating a model.

Parameters#

  • temporal_version: Date which defines which Ground Truth version should get selected

  • groundtruth_id: Id of the Squirro Ground Truth

  • project_id: Id of Squirro project

  • cluster: URL of the cluster

  • token: Squirro token

Example#

{
  "step": "loader",
  "type": "squirro_groundtruth",
  "fields": [],
  "temporal_version": "2020-10-07T16:24:01.36052",
  "groundtruth_id": GROUNDTRUTH_ID,
  "project_id": PROJECT_ID,
  "cluster": CLUSTER,
  "token": TOKEN
}

Cross Validation#

K-fold cross-validation is primarily used in applied machine learning to estimate the performance of a machine learning model on unseen data.

It is a re-sampling procedure to evaluate machine learning models in limited data.

The k-fold step splits the data set into k different subsets and iterates over them using one of them as the test set and the remaining k-1 elements as the training set.

The figure below shows an example for k = 10:

image4

Parameters#

  • k: how many pieces the data set gets split into.

  • output_path: file path in which the output of the k-fold validation gets stored.

  • output_field: field in which the predicitons are going to be stored in.

  • classifier_params: parameter of a lib.nlp classifier to be used during the k-fold validation

Note: classifier and label fields exist for inheritance and are not used.

Example#

{
      "step": "classifier",
      "type": "kfold",
      "k": 5,
      "output_path": "./output.json",
      "output_field": "prediction",
      "label_field": "class",
      "classifier": "none",
      "classifier_params":   {
          "explanation_field": "explanation",
                "input_fields": [
                    "normalized_extract"
                ],
                "label_field": "label",
                "model_kwargs": {
                    "probability": true
                },
                "model_type": "SVC",
                "output_field": "prediction",
                "step": "classifier",
                "type": "sklearn"
        }
  }

Output#

It outputs the success rate for each group of the k folds.

Additionally, it lists the overall metrics of the output.

Note: In the case of multiclass classification, the precision, recall, and f1-score are macro averaged.

****** KFOLD VALIDATION OUTPUT *****
Group 0:{'successful predicted': 26, 'total': 27, 'ratio': 0.9629629629629629}
Group 1:{'successful predicted': 27, 'total': 27, 'ratio': 1.0}
Group 2:{'successful predicted': 27, 'total': 27, 'ratio': 1.0}
Group 3:{'successful predicted': 25, 'total': 27, 'ratio': 0.9259259259259259}
Group 4:{'successful predicted': 27, 'total': 27, 'ratio': 1.0}
Report saved into: ./output.json
**********************************

****** OVERALL METRICS *****
metrics: {'metrics': {'accuracy': 97.2, 'precision': 97.39999999999999, 'recall': 97.2, 'f1-score': 97.2}, 'confusion_matrix': {'labels': ['cat', 'not_cat'], 'values': [17, 0, 1, 18]}}
Report stored at: ./kfold_validation.json
**********************************