# AI Studio Step 3: Models#

The Model is the component that performs text classification and consists of two main components:

• Ground Truth

• Machine Learning Workflow

## Usage#

All models within a project are listed in the Models overview screen as shown below:

Tip: A new Model can be created by clicking the plus icon in the top-right corner of the screen.

In addition to being able to see all models, there are additional useful actions that can be performed:

• Copy: Copy the config of a model.

• Re-Train: Re-train a model based on the latest ground truth.

• Validate: Switch to the validation screen.

• Publish: Directly publish a model.

• Delete: Delete an unpublished model.

To create a Model you need to select a ground truth and a machine-learning template as shown above.

Further metadata is also required, as shown below:

## Labels Balancer#

The balancer step uniforms the distribution of the number of elements per class in a data set. Balancing is needed to allow the ML algorithm to learn more generally instead of over fitting to the largest populated class bucket.

Note

The balancer only works within a batch if the batch size is smaller than the data set size.

### Parameters#

• class_field: key name in which the classes are located.

• classes: list of all classes which are used in the classification.

• not_class: boolean which states if a not class should be instantiated or not.

• output_label_field: field in which the label are stored (only important if not_class is True).

• deviation (optional): Max deviation from the smallest class bucket to the largest bucket (1. = 100%, 0. = 0%).

• seed (optional): Seed for the randomization process.

### Example#

{
"step": "balancer",
"type": "balancer",
"name": "balancer",
"classes": ["A","B","C","D"],
"class_field": "label",
"not_class": false
"output_label_field": "balanced_label"
}


## Data Randomizer#

The randomizer step shuffles the order of the documents. The randomization of the data set allows the ML algorithm to come up with a more generally-applicable solution.

Note

Documents are only shuffled within a batch if the batch size is smaller than the data set size.

### Parameters#

• seed (optional): Seed for the randomization process

### Example#

{
"step": "randomizer",
"type": "randomizer"
}


## Batch Randomizer#

This step creates a checkpoint and shuffles the order of the batches before the execution of the next step.

### Parameters#

• checkpoint_processing (optional): Boolean which indicates if a checkpoint gets created in a non-training execution.

### Example#

{
"step": "batch_randomizer",
"type": "batch_randomizer",
"checkpoint_processing": true
}


The squirro_groundtruth step loads your ground truth and transforms it into the Document structure so that the data points can be used in the pipeline for training and validating a model.

### Parameters#

• temporal_version: Date which defines which Ground Truth version should get selected

• groundtruth_id: Id of the Squirro Ground Truth

• project_id: Id of Squirro project

• cluster: URL of the cluster

• token: Squirro token

### Example#

{
"type": "squirro_groundtruth",
"fields": [],
"temporal_version": "2020-10-07T16:24:01.36052",
"groundtruth_id": GROUNDTRUTH_ID,
"project_id": PROJECT_ID,
"cluster": CLUSTER,
"token": TOKEN
}


## Cross Validation#

K-fold cross-validation is primarily used in applied machine learning to estimate the performance of a machine learning model on unseen data.

It is a re-sampling procedure to evaluate machine learning models in limited data.

The k-fold step splits the data set into k different subsets and iterates over them using one of them as the test set and the remaining k-1 elements as the training set.

The figure below shows an example for k = 10:

### Parameters#

• k: how many pieces the data set gets split into.

• output_path: file path in which the output of the k-fold validation gets stored.

• output_field: field in which the predicitons are going to be stored in.

• classifier_params: parameter of a lib.nlp classifier to be used during the k-fold validation

Note: classifier and label fields exist for inheritance and are not used.

### Example#

{
"step": "classifier",
"type": "kfold",
"k": 5,
"output_path": "./output.json",
"output_field": "prediction",
"label_field": "class",
"classifier": "none",
"classifier_params":   {
"explanation_field": "explanation",
"input_fields": [
"normalized_extract"
],
"label_field": "label",
"model_kwargs": {
"probability": true
},
"model_type": "SVC",
"output_field": "prediction",
"step": "classifier",
"type": "sklearn"
}
}


### Output#

It outputs the success rate for each group of the k folds.

Additionally, it lists the overall metrics of the output.

Note: In the case of multiclass classification, the precision, recall, and f1-score are macro averaged.

****** KFOLD VALIDATION OUTPUT *****
Group 0:{'successful predicted': 26, 'total': 27, 'ratio': 0.9629629629629629}
Group 1:{'successful predicted': 27, 'total': 27, 'ratio': 1.0}
Group 2:{'successful predicted': 27, 'total': 27, 'ratio': 1.0}
Group 3:{'successful predicted': 25, 'total': 27, 'ratio': 0.9259259259259259}
Group 4:{'successful predicted': 27, 'total': 27, 'ratio': 1.0}
Report saved into: ./output.json
**********************************

****** OVERALL METRICS *****
metrics: {'metrics': {'accuracy': 97.2, 'precision': 97.39999999999999, 'recall': 97.2, 'f1-score': 97.2}, 'confusion_matrix': {'labels': ['cat', 'not_cat'], 'values': [17, 0, 1, 18]}}
Report stored at: ./kfold_validation.json
**********************************