Query Processing

Query Processing#

Profile: Search Engineer

This page provides an overview of query-processing workflows in the Squirro platform.

Query-processing workflows are modified by search engineers to fine-tune the search experience for users.

Overview#

Query processing improves a user’s search experience by providing more relevant search results.

Squirro achieves this improvement by running each user query through a customizable query-processing workflow that parses, filters, enriches, and expands queries before performing the actual search and presenting the search results to the user.

Example: Part of speech (POS) boosting and filtering removes irrelevant terms like conjunctions from the query and gives more weight to relevant parts of the query, like nouns. The returned search results rank items that match boosted query terms higher.

Query-Processing Flow Diagram#

The figure below illustrates how query processing fits into Squirro’s overall architecture.

Overview of Squirro Query-Processing Workflow

In the example above, the user enters the query country:us 2020-10 covid cases in new york in the global search bar.

Squirro then sends the query through the Query Understanding Plugin (1) to the ML-Service where the query-processing workflow (a Squirro ML-Workflow) executes and applies the following steps on the incoming query:

Language detection.
Languag-specific spaCy analysis. This uses the pre-trained spaCy language model (see example) for the detected language. The analysis includes:
- Tokenization and lemmatization
- POS tagging
- Named Entity Recognition (NER)
POS booster and filer.
Query modifier. The final query modifier step applies all modifications to the initial query to produce the Enriched Query (2) which is then used to retrieve the candidate documents that best match the query from the Elasticsearch index (3).

Query processing and rewriting improve the search experience by ranking items that match boosted terms higher and reducing the appearance of irrelevant search results for the query. It achieves this by combining terms that belong together. Entities like “New York “will be treated as such in the query, preventing multipage items (e.g., PDFs) that have “new” on one page and “york” on a different page from being matched and appearing in the search results.

Configuration#

Each project is pre-configured with a default query processing workflow. The workflow is installed on the server as a global asset and cannot be deleted via the user interface.

The query-processing workflow is enabled by default.

You can manage the behavior of the workflow in the project configuration under the Settings tab.

Name	Value	Description
`topic.search.query-workflow-enabled`	`false`	Disable query processing.
`topic.search.query-workflow-enabled`	`true`	Enable query processing.
`topic.search.query-workflow`	${`workflow_id`}	Set the value to the `workflow_id` of the ML-workflow you want to use for query processing. By default, the `workflow_id` is set to the ID of the pre-configured workflow that is setup upon project creation.
`topic.search.query-workflow-mode`	`always`	Execute workflow for every request to the `/query` endpoint. This mode is useful when Squirro is used as an API only.
`topic.search.query-workflow-mode`	`global`	Execute query processing workflow once for the whole dashboard (triggered via Global Search Bar widget).

Workflow Management#

Configuring Available Workflows#

You can configure the available workflows under AI Studio > ML Workflows.

Every project has a default query-processing workflow by default. This default workflow is read-only and cannot be deleted or modified. The Machine-Learning (ML) Service manages this and automatically updates it to the latest version.

The default query-processing workflow is set as the Active Query Processor and is listed along with any other custom workflow, as shown in the screen capture below:

If you want to customize the behavior of the default query-processing workflow, perform the following steps:

Caution: Following these steps will make the workflow you select the active query processor.

Clone the workflow.
Edit its configuration.
Hover over the newly-created workflow and click Set Active.

The screen capture below shows how Set Active becomes visible on hover:

Setting the Active Query Workflow in the Squirro Dashboard

Disabling the Default Query-Processing Workflow#

The default query processing workflow cannot be deleted, but can be disabled. To disable query processing, perform the following:

Navigate to Settings > Project Configuration.
Change the topic.search.query-workflow-enabled option by clicking Edit.
Uncheck the checkbox.

Query-Processing Workflow Steps#

For information on creating custom query-processing workflow steps, see How to Create Custom Query-Processing Steps.

The default query-processing workflow uses the following built-in libNLP app.query_processing steps:

Reference: Pre-configured Query Processing Pipeline Steps

config.json#

{
    "cacheable": true,
    "dataset": {
        "items": []
    },
    "pipeline": [
        {
            "fields": ["query"],
            "step": "loader",
            "type": "squirro_item"
        },
        {
            "step": "app",
            "type": "query_processing",
            "name": "syntax_parser"
        },
        {
            "step": "app",
            "type": "query_processing",
            "name": "lang_detection",
            "fallback_language": "en"
        },
        {
            "step": "flow",
            "type": "condition",
            "condition": {
                "healthy_nlp_service": {
                    "service": "spacy",
                    "language": "*",
                    "worker": "*"
                }
            },
            "true_step": {
                "step": "external",
                "type": "remote_spacy",
                "name": "remote_spacy",
                "field_mapping": {
                    "user_terms_str": "nlp"
                },
                "disable_pipes__default": ["merge_noun_chunks"]
            },
            "false_step": {
                "step": "app",
                "type": "query_processing",
                "name": "custom_spacy_normalizer",
                "model_cache_expiration": 345600,
                "infix_split_hyphen": false,
                "infix_split_chars": ":<>=",
                "merge_noun_chunks": false,
                "merge_phrases": true,
                "merge_entities": true,
                "fallback_language": "en",
                "exclude_spacy_pipes": [],
                "spacy_model_mapping": {
                    "en": "en_core_web_sm",
                    "de": "de_core_news_sm"
                }
            }
        },
        {
            "step": "app",
            "type": "query_processing",
            "name": "pos_booster",
            "phrase_proximity_distance": 10,
            "min_query_length": 2,
            "pos_weight_map": {
                "PROPN": "-",
                "NOUN": "-",
                "VERB": "-",
                "ADV": "-",
                "CCONJ": "-",
                "ADP": "-",
                "ADJ": "-",
                "X": "-",
                "NUM": "-",
                "SYM": "-"
            }
        },
        {
            "step": "app",
            "type": "query_processing",
            "name": "lemma_tagger"
        },
        {
            "step": "app",
            "type": "query_processing",
            "name": "query_classifier",
            "model": "svm-query-classifier"
        },
        {
            "date_match_on_facet": "item_created_at",
            "date_match_rewrite_mode": "boost_query",
            "label_lookup_match_ngram_field": true,
            "label_lookup_fuzzy": true,
            "label_lookup_prefix_queries": false,
            "label_lookup_most_common_rescoring": true,
            "label_match_category_weights": {
            },
            "label_lookup_rescore_parameters": {
            },
            "label_match_rewrite_mode": "boost_query",
            "match_entity_phrase_slop": 1,
            "name": "intent_detector",
            "step": "app",
            "type": "query_processing"
        },
        {
            "step": "app",
            "type": "query_processing",
            "name": "query_modifier"
        },
        {
            "step": "debugger",
            "type": "log_fields",
            "fields": [
                "user_terms",
                "facet_filters",
                "pos_mutations",
                "type",
                "enriched_query",
                "lemma_map"
            ],
            "log_level": "info"
        }
    ]
}

The workflow is set to:

Parse Squirro query syntax and detect query Language based on available natural language terms.
Perform named entity recognition (NER).pip install –upgrade -r requirements.txt The entity compound is then rewritten into an additional phrase query.

Example: cases in new york –> rewritten as –> cases in(new york OR "new york"~10)
Boost important terms based on their POS tags. (For more information, see Universal POS tags).

Example: Boost nouns (tags NOUN and PROPN) by assigning higher weights in the pos_weight_map.

Example: Reduce the impact of verbs (VERB) by assigning lower weights.
Remove terms like determiners and conjunctions from the query.
Perform query classification: question_or_statement vs keyword.

Reminder: You can configure the steps of the query-processing workflow in AI Studio.

Lemmatized Search#

To return better and more relevant search results, Squirro uses lemmatized search as a step within the default query-processing workflow.

Lemmatized search is an advanced alternative to stemming, which focuses on explicit word roots.

Lemmatized Search Explained#

Lemmatization aims to reduce multiple similar words to common root forms. As opposed to more basic reduction techniques, such as stemming, lemmatization considers more than the physical word itself.

Rather than removing prefixes and suffixes to identify a stem, lemmatization looks at the word itself combined with the context of the words around it to identify a lemma tied to a dictionary definition.

Example: If the words good and better appear in your documents, stemming or simple keyword matching won’t relate the two words. Lemmatization, however, would relate the two words because it would treat good as the lemma of better.

Pros versus Cons#

Generally speaking, lemmatization provides better, more relevant results than stemming or using simple keyword matching.

However, performing lemmatized search requires some additional query-processing time and resources.

Disabling Lemmatization#

For projects that do not require anything beyond straightforward keyword matching, or projects where minimizing processing time is a priority, you may want to consider disabling lemmatized search.

If you wish to disable lemmatized search, follow the steps below:

Log in to your Squirro project.
Navigate to Setup > AI Studio.
Click ML Workflows in the left menu.
Hover over your project’s active query processor and click Edit.
In the configuration code, locate the exclude_spacy_pipes setting and add “lemmatizer”. (This prevents the nlp-analysis step from computing lemmas.)

Delete the lemma_tagger step as shown in the image below:

Click Save.

Creating Custom Query-Processing Steps#

See How to Create Custom Query-Processing Steps for instructions on how to create custom query-processing steps.

How To Install a SpaCy Language Model#

You can install additional language models based upon available SpaCy Models.

To install Japanese, for example, you would run the following command:

python -m spacy download ja_core_news_sm