AI Studio Step 3: Models
Contents
AI Studio Step 3: Models#
The Model is the component that performs text classification and consists of two main components:
Ground Truth
Machine Learning Workflow
Usage#
All models within a project are listed in the Models overview screen as shown below:
Tip: A new Model can be created by clicking the plus icon in the top-right corner of the screen.
In addition to being able to see all models, there are additional useful actions that can be performed:
Copy: Copy the config of a model.
Re-Train: Re-train a model based on the latest ground truth.
Validate: Switch to the validation screen.
Publish: Directly publish a model.
Delete: Delete an unpublished model.
To create a Model you need to select a ground truth and a machine-learning template as shown above.
Further metadata is also required, as shown below:
Labels Balancer#
The balancer
step uniforms the distribution of the number of elements per class in a data set. Balancing is needed to allow the ML algorithm to learn more generally instead of over fitting to the largest populated class bucket.
Note
The balancer
only works within a batch if the batch size is smaller than the data set size.
Parameters#
class_field
: key name in which the classes are located.classes
: list of all classes which are used in the classification.not_class
: boolean which states if a not class should be instantiated or not.output_label_field
: field in which the label are stored (only important ifnot_class
isTrue
).deviation
(optional): Max deviation from the smallest class bucket to the largest bucket (1. = 100%, 0. = 0%).seed
(optional): Seed for the randomization process.
Example#
{
"step": "balancer",
"type": "balancer",
"name": "balancer",
"classes": ["A","B","C","D"],
"class_field": "label",
"not_class": false
"output_label_field": "balanced_label"
}
Data Randomizer#
The randomizer
step shuffles the order of the documents. The randomization of the data set allows the ML algorithm to come up with a more generally-applicable solution.
Note
Documents are only shuffled within a batch if the batch size is smaller than the data set size.
Parameters#
seed
(optional): Seed for the randomization process
Example#
{
"step": "randomizer",
"type": "randomizer"
}
Batch Randomizer#
This step creates a checkpoint and shuffles the order of the batches before the execution of the next step.
Parameters#
checkpoint_processing
(optional): Boolean which indicates if a checkpoint gets created in a non-training execution.
Example#
{
"step": "batch_randomizer",
"type": "batch_randomizer",
"checkpoint_processing": true
}
Squirro Ground Truth loader#
The squirro_groundtruth
step loads your ground truth and transforms it into the Document
structure so that the data points can be used in the pipeline for training and validating a model.
Parameters#
temporal_version
: Date which defines which Ground Truth version should get selectedgroundtruth_id
: Id of the Squirro Ground Truthproject_id
: Id of Squirro projectcluster
: URL of the clustertoken
: Squirro token
Example#
{
"step": "loader",
"type": "squirro_groundtruth",
"fields": [],
"temporal_version": "2020-10-07T16:24:01.36052",
"groundtruth_id": GROUNDTRUTH_ID,
"project_id": PROJECT_ID,
"cluster": CLUSTER,
"token": TOKEN
}
Cross Validation#
K-fold cross-validation is primarily used in applied machine learning to estimate the performance of a machine learning model on unseen data.
It is a re-sampling procedure to evaluate machine learning models in limited data.
The k-fold step splits the data set into k different subsets and iterates over them using one of them as the test set and the remaining k-1 elements as the training set.
The figure below shows an example for k = 10:
Parameters#
k
: how many pieces the data set gets split into.output_path
: file path in which the output of the k-fold validation gets stored.output_field
: field in which the predicitons are going to be stored in.classifier_params
: parameter of a lib.nlp classifier to be used during the k-fold validation
Note: classifier
and label
fields exist for inheritance and are not used.
Example#
{
"step": "classifier",
"type": "kfold",
"k": 5,
"output_path": "./output.json",
"output_field": "prediction",
"label_field": "class",
"classifier": "none",
"classifier_params": {
"explanation_field": "explanation",
"input_fields": [
"normalized_extract"
],
"label_field": "label",
"model_kwargs": {
"probability": true
},
"model_type": "SVC",
"output_field": "prediction",
"step": "classifier",
"type": "sklearn"
}
}
Output#
It outputs the success rate for each group of the k folds.
Additionally, it lists the overall metrics of the output.
Note: In the case of multiclass classification, the precision, recall, and f1-score are macro averaged.
****** KFOLD VALIDATION OUTPUT *****
Group 0:{'successful predicted': 26, 'total': 27, 'ratio': 0.9629629629629629}
Group 1:{'successful predicted': 27, 'total': 27, 'ratio': 1.0}
Group 2:{'successful predicted': 27, 'total': 27, 'ratio': 1.0}
Group 3:{'successful predicted': 25, 'total': 27, 'ratio': 0.9259259259259259}
Group 4:{'successful predicted': 27, 'total': 27, 'ratio': 1.0}
Report saved into: ./output.json
**********************************
****** OVERALL METRICS *****
metrics: {'metrics': {'accuracy': 97.2, 'precision': 97.39999999999999, 'recall': 97.2, 'f1-score': 97.2}, 'confusion_matrix': {'labels': ['cat', 'not_cat'], 'values': [17, 0, 1, 18]}}
Report stored at: ./kfold_validation.json
**********************************