How To Publish ML Models Using the Squirro Client#

Besides publishing models directly from the AI Studio or importing published models through project import, you can also use the SquirroClient to publish ML models to the Data Processing Pipeline.

This guide explains

  • How to publish a new ML model by submitting a complete ML workflow configuration.

  • How to publish an already existing ML workflow as a model to the Squirro pipeline.

Reference: To learn more about the SquirroClient, see SquirroClient (Python SDK).

Publish a New ML Model#

Using Python, connect and authenticate with the SquirroClient:

from squirro_client import SquirroClient

cluster = '<YOUR CLUSTER>'
project_id='<YOUR PROJECT_ID>'
token = '<YOUR TOKEN>'

client = SquirroClient(None, None, cluster=cluster)
client.authenticate(refresh_token=token)

Define the workflow configuration as shown below:

config =
{'dataset': {},
 'pipeline': [{'fields': ['body'], 'step': 'loader', 'type': 'squirro_query'},
              {'fields': ['body'],
               'mark_as_skipped': True,
               'step': 'filter',
               'type': 'empty'},
              {'cleaning': {'approx.': 'approx',
                            'etc.': 'etc',
                            'i.e.': 'ie'},
               'input_fields': ['body'],
               'output_fields': ['extract_sentences'],
               'rules': ['**',
                         '...',
                         '…',
                         ': '],
               'step': 'tokenizer',
               'type': 'sentences_nltk'},
              {'fields': ['extract_sentences'],
               'step': 'filter',
               'type': 'doc_split'},
              {'input_fields': ['extract_sentences'],
               'output_fields': ['extract_sentences'],
               'step': 'tokenizer',
               'type': 'html'},
              {'fields': ['extract_sentences'],
               'step': 'filter',
               'type': 'doc_split'},
              {'input_fields': ['extract_sentences'],
               'output_fields': ['sentences_normalized'],
               'step': 'normalizer',
               'type': 'html'},
              {'fields': ['sentences_normalized'],
               'mark_as_skipped': True,
               'step': 'filter',
               'type': 'regex',
               'whitelist_regexes': ['^.{20,}$']},
              {'blacklist_terms': [],
               'fields': ['sentences_normalized'],
               'matching_label': 'tax_rate1',
               'name': './models/ais/proximity',
               'non_matching_label': 'not_tax_rate1_tax_rate2',
               'output_field': 'prediction_tax_rate1',
               'step': 'filter',
               'type': 'proximity',
               'whitelist_terms': ['tax rate of~1|','tax rate~2|']},
              {'blacklist_terms': [],
               'fields': ['sentences_normalized'],
               'matching_label': 'tax_rate2',
               'name': './models/ais/proximity',
               'non_matching_label': 'not_tax_rate1_tax_rate2',
               'output_field': 'prediction_tax_rate2',
               'step': 'filter',
               'type': 'proximity',
               'whitelist_terms': ['tax rate~4|']},
              {'delimiter': ',',
               'input_fields': ['prediction_tax_rate1', 'prediction_tax_rate2'],
               'output_field': 'prediction',
               'step': 'filter',
               'type': 'merge'},
              {'input_field': 'prediction',
               'output_field': 'prediction',
               'step': 'filter',
               'type': 'split'},
              {'fields': ['sentences_normalized', 'prediction'],
               'step': 'filter',
               'type': 'doc_join'},
              {'entity_name_field': 'Catalyst',
               'entity_type': 'Catalyst',
               'excluded_values': ['not_tax_rate1_tax_rate2'],
               'extract_field': 'sentences_normalized',
               'format_values': False,
               'global_property_field_map': {},
               'modes': ['process'],
               'property_field_map': {'Catalyst': ['prediction']},
               'required_properties': ['Catalyst'],
               'source_field': 'body',
               'step': 'filter',
               'type': 'squirro_entity'}
               ]
    }

Publish the model using client.ml_publish_model. Below is an example of how the command might look like for the above workflow config:

client.ml_publish_model(project_id,\
    published_as='Proximity Model Tax Rate',\
    description='Proximity Model for Tax Rate v1',\
    external_model=True,\
    global_id='<UNIQUE_HASH>',\
    location='<LOCATION_OF_ORIGIN>',\
    labels=['tax_rate1','tax_rate2','not_tax_rate1_tax_rate2'],\
    tagging_level='sentence',\
    workflow_name='[PUB] prox config import',\
    workflow_config=config)

Publish an Existing ML Workflow#

To publish an existing ML Workflow, retrieve its ID from ML Workflows under the AI STUDIO tab:

image1

Publish the model using client.ml_publish_model and submitting the workflow_id:

client.ml_publish_model(project_id,\
    published_as='Proximity Model Tax Rate',\
    description='Proximity Model for Tax Rate v1',\
    external_model=True,\
    global_id='<UNIQUE_HASH>',\
    location='<LOCATION_OF_ORIGIN>',\
    labels=['tax_rate1','tax_rate2','not_tax_rate1_tax_rate2'],\
    tagging_level='sentence',\
    workflow_id='VLGRAEbLRZ2v5Uq_MPt77w')

Notes#

For document-level tagging you must provide the keywords.prediction field in the output_fields to store the predictions in a keyword.

See the example below:

{
    "dataset": {
        "infer": {
            "count": 10000,
            "query_string": "language:en"
        }
    },
    "pipeline": [
        {
            "fields": [
                "body"
            ],
            "step": "loader",
            "type": "squirro_query"
        },
        {
            "fields": [
                "body"
            ],
            "step": "filter",
            "type": "empty"
        },
        {
            "input_fields": [
                "body"
            ],
            "output_fields": [
                "clean_body"
            ],
            "step": "normalizer",
            "type": "html"
        },
        {
            "input_fields": [
                "clean_body"
            ],
            "output_fields": [
                "extract_sentences"
            ],
            "step": "tokenizer",
            "type": "sentences_nltk"
        },
        {
            "fields": [
                "extract_sentences"
            ],
            "step": "filter",
            "type": "doc_split"
        },
        {
            "input_fields": [
                "extract_sentences"
            ],
            "label_field": "",
            "output_field": "prediction",
            "step": "classifier",
            "type": "vadersentiment"
        },
        {
            "fields": [
                "extract_sentences",
                "prediction"
            ],
            "step": "filter",
            "type": "doc_join"
        },
        {
            "input_fields": [
                "prediction"
            ],
            "output_fields": [
                "keywords.prediction"
            ],
            "step": "filter",
            "type": "vote"
        },
        {
            "fields": [
                "keywords.prediction"
            ],
            "step": "saver",
            "type": "squirro_item"
        }
    ]
}

You can then assign the keyword in which the predictions are stored from the Label dropdown in the Pipeline Editor when editing the published model step, as shown below:

image3