POSBooster#

class POSBooster(config)#

Bases: BatchedStep

Perform term level boosting and cleaning based on detected POS tags.

The following POS tags are known:

ADJ: adjectives ADP: adpositions (prepositions and postpositions) ADV: adverbs CONJ: conjunctions DET: determiners INTJ: interjection NOUN: nouns NUM: numeral PART: particles PRON: pronouns PROPN: proper nouns PUNCT: punctuations SPACE: spaces SYM: symbols VERB: verbs (all tenses and modes) X: other: foreign words, typos, abbreviations

POS Weight Map

Each POS tag can be configured with a specific numeric weight.

{
    "PROPN": 10,
    "NOUN": 10,
    "VERB": 5,
    "ADJ": 2,
    "X": "-",
    "NUM": "-",
    "SYM": "-"
}
  • Boost term relevancy: A higher number boosts the relevancy of matched terms.

  • Remove terms: NOT specified POS types can be stripped out

    (useful to remove determiners or stopwords)

  • Ignore mutation: Don’t mutate tokens per POS type by setting the corresponding weight to -

Improve retrieval Precision with Phrases & Term Proximity

Note: Especially useful to search for data inside very large documents that may match all query terms - but matches are scattered across the whole document - and not within the right matching context (e.g. within the same section/paragraph)

Relevant chunks - detected via SpacyNormalizer - are converted into loose-phrases to improve search precision.

  • population of new york => population of “new york”~5

  • will the european union extend brexit deadline => “the european union”~15 extend^2 “brexit deadline”~15

To reduce too high of an impact on recall, we search for “loosely coupled phrases”.

  • Match all phrase terms within a window of N terms; with default of N=15 (average length of english sentences).

Example:

  • Input query: why is austria again the virus center

  • Output POS mutations:

{
    "why": "",
    "is": "",
    "austria": "austria^10",
    "again": "",
    "the": "",
    "virus": "virus^10",
    "center": "center^10"
}

Annotated mutation dictionary is used in a succeeding step to enrich the search-query.

Parameters:
  • step (str, "app") – app

  • type (str, "query_processing") – query_processing

  • name (str, "pos_booster") – pos_booster

  • analyzed_input_field (str, "nlp") – analyzed spacy Doc

  • (dict (pos_weight_map) – 10,”NOUN”:10,”VERB”:5,”ADJ”:2,”X”:”-“,”NUM”:”-“,”SYM”:”-“}) : dictionary mapping between Spacy POS tag to weight used for term boosting

  • {"PROPN" – 10,”NOUN”:10,”VERB”:5,”ADJ”:2,”X”:”-“,”NUM”:”-“,”SYM”:”-“}) : dictionary mapping between Spacy POS tag to weight used for term boosting

  • phrase_proximity_distance (int, 15) – Merged tokens (via SpacyNormalizer) are converted into phrases.

  • sub_token_aware_quoting (bool, True) – Perform additional PHRASE-TERM matching for tokens that might get split by ES analyzer. (‘wi-fi’ -> ‘wi-fi OR “wi-fi”~0’)

  • min_query_length (int, 2) – Only queries that contain more tokens than the configured threshold are considered for POS based transformation

  • output_field (str, "pos_mutations") – map of term => replacement

  • path (str, ".") – path

Attributes Summary

Methods Summary

apply_term_mutation(term_mutation_map, ...)

Specify applied mutation rule on a given token.

process_doc(doc)

Process a document

should_add_phrase_match(token)

should_mutate(token)

Check if incoming Spacy Token should get mutated.

Attributes Documentation

STEP_PREFIX = 'pos_booster'#

Methods Documentation

apply_term_mutation(term_mutation_map, token, weight)#

Specify applied mutation rule on a given token. (“is” -> “”)

process_doc(doc)#

Process a document

Parameters:

doc (Document) – Document

Returns:

Processed document

Return type:

Document

should_add_phrase_match(token)#
Return type:

str

Specifies if Spacy Token should be rewritten to perform additional phrase matching:

(token OR “token”)

Applicable: - merging_entities results in new york being recognised as one token. - tokens suitable for sub-word-delimiting like covid-19, 2010-10, email@adress.com

:returns Quoted Token.text if string might be split by elasticsearch analyzer

should_mutate(token)#

Check if incoming Spacy Token should get mutated. :type token: Token :param token: :return: