Significant Terms Extraction in Squirro

Significant Terms Extraction in Squirro#

This section provides information on how the Significant Terms Extractor works and how it can be used inside a Squirro project.

Extracting Significant Terms#

Given a set of documents, the goal of significant terms extractor is to identify the set of words that can potentially describe a document within the text.

It is an unsupervised technique that constantly adapts itself at every call to maintain an up-to-date model.

The subsequent idea is to use this topic in various use cases:

Fast Look: From a group of documents, generate a word cloud and immediately see what are the core topics.
Clustering: Group documents together with the same set of significant terms and rank them according to the distance.
Sentiment Analysis: Show the negative and positive terms in the corpora.

How to use Significant Terms in Squirro#

Example of workflow:

{
    "dataset": {
        "infer": {
            "count": 1000,
            "query_string": "NOT ml_terms_version:1"
        },
        "train": {
            "count": 2500,
            "query_string": "*"
        }
    },
    "pipeline": [
        {
            "fields": [
                "body",
                "keywords.ml_terms_version",
                "keywords.terms"
            ],
            "batch_size": 100,
            "step": "loader",
            "type": "squirro_query"
        },
        {
            "fields": [
                "body"
            ],
            "step": "filter",
            "type": "empty"
        },
        {
            "fields": [
                "keywords.terms"
            ],
            "step": "filter",
            "type": "clear"
        },
        {
            "add_lemmization": false,
            "cache_lemmas": false,
            "filter_list": [],
            "input_field": [
                "body"
            ],
            "max_chunk_len": 2,
            "min_word_len": 4,
            "output_field": "keywords.terms",
            "p_significant_terms": 0.2,
            "save_model": true,
            "step": "embedder",
            "type": "terms_extraction"
        },
        {
            "fields": [
                "keywords.terms"
            ],
            "step": "saver",
            "tracking_facet_name": "ml_terms_version",
            "tracking_facet_value": "1",
            "type": "squirro_item"
        }
    ]
}

The parameters of the step are:

{
    "add_lemmization": false, # add lemmization to the words
    "cache_lemmas": false, # save the lemmas in a cache to improve the performances
    "filter_list": [], # List of terms you do not want to see
    "input_field": [ "body"], # where get the input text
    "max_chunk_len": 2, # max number of words in a chunk
    "min_word_len": 4, # min character in a words
    "output_field": "terms", # where store the output (list)
    "p_significant_terms": 0.2,  # % of important terms to pickup
    "save_model": true, # save the generate model
    "step": "embedder",
    "type": "terms_extraction"
},

The input must be text. HTML Cleaning and word splitting are performed by the embedder itself. The reason for this is to enhance parallelization and make the step scalable.

The output is a list of significant words per document.

The lemmatization process reduces the words into primitive forms (e.g. better became good). It is very useful but it also impacts the overall performance.

In this case, it is important to define both Train and Inference jobs.

Train: Generates an initial model. Therefore, there must be at least an interesting number of documents to produce a significant output from the beginning. It does not save the list back on the original item.
Inference: Loads the model, updates it with the words in the current document, then produces the list and saves the new model.

Note: The inference must be applied also to the train data to generate the significant term list.

Scroll timeout#

For slower servers, the scroll timeout for the queries may need to be lengthened. This is the case if the workflow reports errors including the error string No search context found for id.

This can be done by changing the dataset parameters like this:

{
    "dataset": {
        "infer": {
            "count": 1000,
            "kwargs": {
                "scroll": "30m"
            },
            "query_string": "NOT ml_terms_version:1"
        },
        "train": {
            "count": 2500,
            "kwargs": {
                "scroll": "30m"
            },
            "query_string": "*"
        }
    },
    …
}

libNLP Information#

For information about how the TermsExtractionEmbedder functions as part of the Embedders Package within libNLP, see TermsExtractionEmbedder.