How to Create Custom Query-Processing Steps

How to Create Custom Query-Processing Steps#

This page discusses the following:

How to write a custom libNLP step to extend the capabilities of the default query-processing workflow.
How to upload a new query-processing workflow to your Squirro project.

Reference: For more information about Squirro’s proprietary Natural Language Processing library, see the libNLP docs page.

Overview#

Each Squirro project is preconfigured with a default query-processing workflow containing multiple libNLP steps.

Python engineers and data scientists can create and upload custom libNLP steps to expand the workflow. This may include, but is not limited to, custom:

Boosters
Classifiers
Expanders
Language detectors
Modifiers
Parsers

Quick Summary#

To build and upload a custom libNLP query-processing step, perform the following:

Verify prerequisites are installed and available.
Create a new folder, placing the following inside:
- The JSON config file.
- Your new custom step Python file.
Write your new custom step using the template.
Locally test your files.
Update the workflow within your JSON config file.
Upload to Squirro.

Squirro Profiles#

Python engineers and data scientists build and upload custom libNLP steps.
Data scientists configure and optimize existing libNLP steps for project needs.
Project creators enable and disable libNLP steps for a given project.

Squirro Products#

Custom libNLP query-processing steps are used in Squirro’s Cognitive Search and Insight Engine products.

Prerequisites#

Local Build Requirements

Upload Requirements

To build a custom libNLP step, you will require local installations of the following:

Squirro Toolbox
libNLP Download
en_core_web_sm SpaCy model

To upload the workflow to your Squirro project, you’ll need the following authentication information handy:

token
cluster
project_id

To download the en_core_web_sm SpaCy model, use the following command:

python -m spacy download en_core_web_sm

Workflow Structure and Templates#

Custom Step Template#

Create your custom libNLP query-processing step using the following template:

After the step has been created and tested, your workflow is then configured within config.json. The workflow can include any combination of default and custom steps.

Important: You must place your custom step file and the JSON configuration file in the same folder.

/custom_query_processing
    / config.json
    / my_query_classifier.py

Default Config File#

Use Squirro’s current default JSON Config file as shown below:

Reference: Default JSON Query Processing Workflow

config.json#

{
    "cacheable": true,
    "dataset": {
        "items": []
    },
    "pipeline": [
        {
            "fields": ["query"],
            "step": "loader",
            "type": "squirro_item"
        },
        {
            "step": "app",
            "type": "query_processing",
            "name": "syntax_parser"
        },
        {
            "step": "app",
            "type": "query_processing",
            "name": "lang_detection",
            "fallback_language": "en"
        },
        {
            "step": "flow",
            "type": "condition",
            "condition": {
                "healthy_nlp_service": {
                    "service": "spacy",
                    "language": "*",
                    "worker": "*"
                }
            },
            "true_step": {
                "step": "external",
                "type": "remote_spacy",
                "name": "remote_spacy",
                "field_mapping": {
                    "user_terms_str": "nlp"
                },
                "disable_pipes__default": ["merge_noun_chunks"]
            },
            "false_step": {
                "step": "app",
                "type": "query_processing",
                "name": "custom_spacy_normalizer",
                "model_cache_expiration": 345600,
                "infix_split_hyphen": false,
                "infix_split_chars": ":<>=",
                "merge_noun_chunks": false,
                "merge_phrases": true,
                "merge_entities": true,
                "fallback_language": "en",
                "exclude_spacy_pipes": [],
                "spacy_model_mapping": {
                    "en": "en_core_web_sm",
                    "de": "de_core_news_sm"
                }
            }
        },
        {
            "step": "app",
            "type": "query_processing",
            "name": "pos_booster",
            "phrase_proximity_distance": 10,
            "min_query_length": 2,
            "pos_weight_map": {
                "PROPN": "-",
                "NOUN": "-",
                "VERB": "-",
                "ADV": "-",
                "CCONJ": "-",
                "ADP": "-",
                "ADJ": "-",
                "X": "-",
                "NUM": "-",
                "SYM": "-"
            }
        },
        {
            "step": "app",
            "type": "query_processing",
            "name": "lemma_tagger"
        },
        {
            "step": "app",
            "type": "query_processing",
            "name": "query_classifier",
            "model": "svm-query-classifier"
        },
        {
            "date_match_on_facet": "item_created_at",
            "date_match_rewrite_mode": "boost_query",
            "label_lookup_match_ngram_field": true,
            "label_lookup_fuzzy": true,
            "label_lookup_prefix_queries": false,
            "label_lookup_most_common_rescoring": true,
            "label_match_category_weights": {
            },
            "label_lookup_rescore_parameters": {
            },
            "label_match_rewrite_mode": "boost_query",
            "match_entity_phrase_slop": 1,
            "name": "intent_detector",
            "step": "app",
            "type": "query_processing"
        },
        {
            "step": "app",
            "type": "query_processing",
            "name": "query_modifier"
        },
        {
            "step": "debugger",
            "type": "log_fields",
            "fields": [
                "user_terms",
                "facet_filters",
                "pos_mutations",
                "type",
                "enriched_query",
                "lemma_map"
            ],
            "log_level": "info"
        }
    ]
}

The default query-processing steps are the following types:

step:app
type:query_processing

Example Custom Classifier Step Creation#

The example discussed in this section shows how to add a custom query classifier. The classifier allows users to search within a smaller, filtered subset of project data.

Squirro refers to this as inferred faceted search, which improves the overall search experience by returning documents that share the same topic as the user query.

In this section, you will find the following:

Example parameters.
An example Python custom classifier step.
An example JSON custom step configuration.
An example config.json showing the updated workflow with the new classifier step.

The following are example classifier parameters:

Input

Output

Query processing input: main symptoms of flu vs covid

Classified label: topic:"health care"

(main symptoms of flu vs covid) AND (topic:"health care")

The following is an example custom classifier step:

The following is an example custom step configuration:

The following is an example config.json with the custom step included:

Reference: Example config.json File

{
    "cacheable": true,
    "pipeline": [
        {
            "fields": ["query"],
            "step": "loader",
            "type": "squirro_item"
        },
        {
            "step": "app",
            "type": "query_processing",
            "name": "syntax_parser"
        },
        {
            "step": "app",
            "type": "query_processing",
            "name": "lang_detection",
            "fallback_language": "en"
        },
        {
            "step": "app",
            "type": "query_processing",
            "name": "custom_spacy_normalizer",
            "infix_split_hyphen": false,
            "infix_split_chars": ":<>=",
            "merge_noun_chunks": false,
            "merge_phrases": true,
            "merge_entities": true,
            "fallback_language": "en",
            "exclude_spacy_pipes": [],
            "spacy_model_mapping": {
                "en": "en_core_web_sm",
                "de": "de_core_news_sm"
            }
        },
        {
            "step": "app",
            "type": "query_processing",
            "name": "pos_booster",
            "phrase_proximity_distance": 15,
            "pos_weight_map": {
                "PROPN": 10,
                "NOUN": 10,
                "VERB": 2,
                "ADJ": 5,
                "X": "-",
                "NUM": "-",
                "SYM": "-"
            }
        },
        {
            "step": "custom",
            "type": "classifier",
            "name": "my_query_classifier",
            "model": "valhalla/distilbart-mnli-12-1",
            "target_facet":"topic",
            "target_classes": [
                "login tutorial", "sports", "health care",
                "merge and acquisition", "stock market"
            ],
            "output_field": "my_classified_topic"
        },
        {
            "step": "app",
            "type": "query_processing",
            "name": "query_modifier",
            "term_mutations_metadata": [
            "pos_mutations",
            "my_classified_topic"
            ]
        },
        {
            "step": "debugger",
            "type": "log_fields",
            "fields": [
                "user_terms", "facet_filters", "pos_mutations",
                "type", "enriched_query","my_classified_topic"
            ],
            "log_level": "info"
        }
    ]
}

Local Testing#

You should test your newly created step locally during development. To do so, perform the following:

Instantiate your code.
Provide a squirro.lib.nlp.document.Document with the configuration for the steps you want to test.

Example Content for Baseline Testing#

You can use the following example content to perform a simple baseline test of test_my_classifier.py:

from my_query_classifier import MyClassifier

if __name__ == "__main__":
    # Documents are tagged with facet called `topic`
    target_facet = "topic"
    # The facet `topic` can be one of the following values from `target_classes`
    target_classes = ['login tutorial', 'sports', 'health care', 'merge and acquisition', 'stock market']

    # Instantiate custom classifier step
    step = MyClassifier(config={
        "target_facet": "topic",
        "target_classes": target_classes,
    })

    # Setup simple test cases
    queries = [
        "how to connect to wlan",
        "elon musk buys shares at twitter",
        "main symptoms of flu vs covid"
    ]

    for query in queries:
        doc = Document(doc_id="", fields={"user_terms_str": query})
        step.process_doc(doc)
        print("=================")
        print(f"Classified Query")
        print(f"\tQuery:\t{query}")
        print(f"\tLabel:\t{doc.fields.get('facet_filters')}")

Demo Output#

The following is the demo output of test_my_classifier.py:

$ python test_custom_spacy_normalizer.py

=================
Query Classified
        Query:  'how to connect to wlan'
        Label:  'topic:"login tutorial"'
=================
Query Classified
        Query:  'elon musk buys shares at twitter'
        Label:  'topic:"stock market"'
=================
Query Classified
        Query:  'main symptoms of flu vs covid'
        Label:  'topic:"health care"'
=================

Uploading#

Perform the following steps to upload to your Squirro project:

Upload the workflow using the following upload_workflow.py script:

Execute at the location of your workflow or provide the correct path to your steps:

python upload_workflow.py --cluster=$cluster \
            --project_id=$project_id \
            --token=$token \
            --config=config.json \
            --custom_steps="."

Enabling Your Custom Workflow#

Your project creator can now enable your custom workflow from the project dashboard by completing the following steps:

Open the project in your browser.
Navigate to the AI Studio tab.
Click ML Workflows from the sidebar menu.
Hover over your step and click Set Active.

ML Dashboard within the Squirro Platform

Note: You can modify the configuration on the uploaded steps by hovering over the step and clicking Edit.

Troubleshooting and FAQ#

Q1: How is the workflow executed?

Currently, the workflow is integrated into a squirro-application via the natural language understanding plugin.

The search bar first reaches out to the the natural language query plugin /parse endpoint that triggers the configured query processing workflow.

For further information, see Query Processing.

Q2: Where can I find query processing logs?

The machinelearning service runs the configured query-processing workflow end to end and logs debugging and detailed error data.

The pipeline itself logs enriched metadata as configured via the logfielddebugger step as follows:

{
    "step": "debugger",
    "type": "log_fields",
    "fields": [
        "user_terms",
        "type",
        "enriched_query"
        "my_classified_topic"], # appended by our classifier
    "log_level": "info"
}