POSBooster

class squirro.lib.nlp.apps.query_processing.POSBooster(config)

Bases: squirro.lib.nlp.steps.batched_step.BatchedStep

Perform term level boosting and cleaning based on detected POS tags.

The following POS tags are known:

ADJ: adjectives ADP: adpositions (prepositions and postpositions) ADV: adverbs CONJ: conjunctions DET: determiners INTJ: interjection NOUN: nouns NUM: numeral PART: particles PRON: pronouns PROPN: proper nouns PUNCT: punctuations SPACE: spaces SYM: symbols VERB: verbs (all tenses and modes) X: other: foreign words, typos, abbreviations

POS Weight Map

Each POS tag can be configured with a specific numeric weight.

{
    "PROPN": 10,
    "NOUN": 10,
    "VERB": 5,
    "ADJ": 2,
    "X": "-",
    "NUM": "-",
    "SYM": "-"
}
  • Boost term relevancy: A higher number boosts the relevancy of matched terms.

  • Remove terms: NOT specified POS types can be stripped out

    (useful to remove determiners or stopwords)

  • Ignore mutation: Don’t mutate tokens per POS type by setting the corresponding weight to -

Improve retrieval Precision with Phrases & Term Proximity

Note: Especially useful to search for data inside very large documents that may match all query terms - but matches are scattered across the whole document - and not within the right matching context (e.g. within the same section/paragraph)

Relevant chunks - detected via SpacyNormalizer - are converted into loose-phrases to improve search precision.

  • population of new york => population of “new york”~5

  • will the european union extend brexit deadline => “the european union”~15 extend^2 “brexit deadline”~15

To reduce too high of an impact on recall, we search for “loosely coupled phrases”.

  • Match all phrase terms within a window of N terms; with default of N=15 (average length of english sentences).

Example:

  • Input query: why is austria again the virus center

  • Output POS mutations:

{
    "why": "",
    "is": "",
    "austria": "austria^10",
    "again": "",
    "the": "",
    "virus": "virus^10",
    "center": "center^10"
}

Annotated mutation dictionary is used in a succeeding step to enrich the search-query.

Parameters
  • step (str, "app") – app

  • type (str, "query_processing") – query_processing

  • name (str, "pos_booster") – pos_booster

  • analyzed_input_field (str, "nlp") – analyzed spacy Doc

  • (dict (pos_weight_map) – 10,”NOUN”:10,”VERB”:5,”ADJ”:2,”X”:”-“,”NUM”:”-“,”SYM”:”-“}) : dictionary mapping between Spacy POS tag to weight used for term boosting

  • {"PROPN" – 10,”NOUN”:10,”VERB”:5,”ADJ”:2,”X”:”-“,”NUM”:”-“,”SYM”:”-“}) : dictionary mapping between Spacy POS tag to weight used for term boosting

  • phrase_proximity_distance (int, 15) – Merged tokens (via SpacyNormalizer) are converted into phrases.

  • min_query_length (int, 2) – Only queries that contain more tokens than the configured threshold are considered for POS based transformation

  • output_field (str, "pos_mutations") – map of term => replacement

  • path (str, ".") – path

Attributes Summary

STEP_PREFIX

Methods Summary

apply_term_mutation(term_mutation_map, …)

Specify applied mutation rule on a given token.

process_doc(doc)

Process a document

quote_compound_term(token)

Convert merged-tokens to phrase-term.

should_mutate(token)

Check if incoming Spacy Token should get mutated.

Attributes Documentation

STEP_PREFIX = 'pos_booster'

Methods Documentation

apply_term_mutation(term_mutation_map, token, weight)

Specify applied mutation rule on a given token. (“is” -> “”)

process_doc(doc)

Process a document

Parameters

doc (Document) – Document

Returns

Processed document

Return type

Document

classmethod quote_compound_term(token)

Convert merged-tokens to phrase-term. - merging_entities results in new york being recognised as one token. - tokens with intra-hyphens would be split by elastic covid-19, 2010-10

Return type

str

should_mutate(token)

Check if incoming Spacy Token should get mutated. :type token: Token :param token: :return: