How to Integrate a Custom ML Classifier#

Profiles: Data Scientist, Project Creator

The integration of custom steps within libNLP is split here into four major steps:

  • Exploration and implementation

  • Local testing

  • Upload

  • Usage

Exploration and Implementation#

Step 1#

Use the Jupyter notebook setup as described in How to Interact with Squirro Using Jupyter Notebook to do your exploratory data analysis (EDA), develop your classifier, and train your machine learning models if there is dedicated HW.

Step 2#

Create a libNLP step which includes the classifier or can be run from the pre-trained model:

  1. To start, download and install libNLP.

  2. Refer to the Classifiers Package documentation. To make it easier, you can inherit from the classifier base class, which comes with some pre-defined parameters like input_field, input_fields, label_field and output_field.

Below you can see a template you can use to fill in your code for your classifier:

"""Custom classifier class"""

from squirro.lib.nlp.steps.classifiers.base import Classifier

class CustomClassifier(Classifier):
    """Custom #Classifier.

    # Parameters
    type (str): `my_custom_classifier`
    my_parameter (str): my parameter
    """

    def __init__(self, config):
        super(CustomClassifier, self).__init__(config)

    def process(self, docs):
        """ process/execute inference job on the incoming data """
        return docs

    def train(self, docs):
        """ train your model """
        return self.process(docs)
  1. To make it work, you will also need to examine the incoming data structure. In both functions train and process there is a list of Documents handed over.

Which fields are populated is dependant on the prior steps and their configuration.

Note

The train-function can be a pseudo function, especially if a pre-trained model is used. This means that the train-function does not need to actually train a model if a trained model is provided and the step only is meant for inference execution.

Local testing#

You can run your custom step with the following workflow locally to test if it works, though you can also load data via CSV, JSON, and so on:

workflow = {
    "dataset": {
        "items": [{
        "id": "0",
        "keywords": {
            "label": ["fake"]
        },
        "body": "<html><body><p>This is a fake Squirro Item. It is composed of a couple fake sentences.</p></body></html>"
        },{
        "id": "1",
        "keywords": {
            "label": ["not fake"]
        },
        "body": "<html><body><p>This is not a fake Squirro Item. It is composed of a couple not fake sentences.</p></body></html>"
        }]
    },
    "pipeline": [
        {
            "fields": [
                "body",
                "keywords.label"
            ],
            "step": "loader",
            "type": "squirro_item"
        },
        {
            "input_fields": ["body"],
            "output_field": "prediction",
            "label_field": "keywords.label",
            "step": "custom",
            "type": "custom_classifier",
            "name": "custom_classifier",
            "my_parameter": "my_value"
        }
    ]
}

Note: The field name refers to the file name of the custom classifier.

The workflow can be run if libNLP is installed with:

  • Training Mode

    from squirro.lib.nlp.runner import Runner
    runner = Runner(workflow)
    try:
        for _ in runner.train():
            continue
    
    finally:
        runner.clean()
    
  • Inference Mode

    from squirro.lib.nlp.runner import Runner
    runner = Runner(workflow)
    result = []
    try:
        for item in runner.infer():
            result.append(item)
            continue
    
    finally:
        runner.clean()
    

Note: Execute the script above in the same folder as where you have stored the custom classifier.

Upload#

  1. Your custom classifier step and (optionally) the trained model needs to be stored in the same folder before you can upload your custom classifier:

    custom_ml_workflow
    |
    + - custom_classifier.py
      - (optional: physical_model)
    
  2. Now the folder can be uploaded using the SquirroClient:

from squirro_client import SquirroClient

client = SquirroClient(None, None, cluster=CLUSTER_URL)
client.authenticate(refresh_token=TOKEN)
client.new_machinelearning_workflow(
    project_id=PROJECT_ID,
    name="My Custom Workflow",
    config=workflow,
    ml_models="custom_ml_workflow"
  )

Note

The workflow is a dict containing the ml workflow as shown in the local testing section. The ml workflow can be adjusted later. ml_models is the path to the folder where the classifier and the model has been stored.

  1. When checking the ML Workflow studio plugin in the AI Studio tab you can see the uploaded ML Workflow. This can be edited as shown below:

    image1

Usage#

Now the custom classifier can be used in three different scenarios:

  • Training and Inference Jobs

  • Trained Model With Custom Classifer Step

  • Pre-Trained Model With Custom Classifer Step

Training and Inference Jobs#

Use the ML Job studio plugin to run training and inference jobs. To run the jobs on data indexed in Squirro, it’s recommended you use the SquirroQueryLoader.

To store classifications and predictions, for sentence-level jobs Squirro suggests using the SquirroEntityFilter step.

For document-level jobs, Squirro recommends using the SquirroItemSaver step.

Trained Model With Custom Classifer Step#

If a trained model is uploaded with the custom classifier step, you can also make it available in the enrich pipeline.

This allows you to directly classify documents that get loaded.

Reference: For more information, see How To Publish ML Models Using the Squirro Client.

Custom Step Within a ML Template#

The last option is to make a custom step available within a ML Template in the AI Studio.

Currently, there is no other option than to transform the custom step to an actual libNLP step and then create a new ML template.