POSBooster

POSBooster#

class POSBooster(config)#

Bases: BatchedStep

Perform term level boosting and cleaning based on detected POS tags.

The following POS tags are known:: ADJ: adjectives ADP: adpositions (prepositions and postpositions) ADV: adverbs CONJ: conjunctions DET: determiners INTJ: interjection NOUN: nouns NUM: numeral PART: particles PRON: pronouns PROPN: proper nouns PUNCT: punctuations SPACE: spaces SYM: symbols VERB: verbs (all tenses and modes) X: other: foreign words, typos, abbreviations

POS Weight Map

Each POS tag can be configured with a specific numeric weight.

{
    "PROPN": 10,
    "NOUN": 10,
    "VERB": 5,
    "ADJ": 2,
    "X": "-",
    "NUM": "-",
    "SYM": "-"
}

Boost term relevancy: A higher number boosts the relevancy of matched terms.
Remove terms: NOT specified POS types can be stripped out
(useful to remove determiners or stopwords)
Ignore mutation: Don’t mutate tokens per POS type by setting the corresponding weight to -

Improve retrieval Precision with Phrases & Term Proximity

Note: Especially useful to search for data inside very large documents that may match all query terms - but matches are scattered across the whole document - and not within the right matching context (e.g. within the same section/paragraph)

Relevant chunks - detected via SpacyNormalizer - are converted into loose-phrases to improve search precision.

population of new york => population of “new york”~5
will the european union extend brexit deadline => “the european union”~15 extend^2 “brexit deadline”~15

To reduce too high of an impact on recall, we search for “loosely coupled phrases”.

Match all phrase terms within a window of N terms; with default of N=15 (average length of english sentences).

Example:

Input query: why is austria again the virus center
Output POS mutations:

{
    "why": "",
    "is": "",
    "austria": "austria^10",
    "again": "",
    "the": "",
    "virus": "virus^10",
    "center": "center^10"
}

Annotated mutation dictionary is used in a succeeding step to enrich the search-query.

Parameters:

step (str, "app") – app
type (str, "query_processing") – query_processing
name (str, "pos_booster") – pos_booster
analyzed_input_field (str, "nlp") – analyzed spacy Doc
(dict (pos_weight_map) – 10,”NOUN”:10,”VERB”:5,”ADJ”:2,”X”:”-“,”NUM”:”-“,”SYM”:”-“}) : dictionary mapping between Spacy POS tag to weight used for term boosting
{"PROPN" – 10,”NOUN”:10,”VERB”:5,”ADJ”:2,”X”:”-“,”NUM”:”-“,”SYM”:”-“}) : dictionary mapping between Spacy POS tag to weight used for term boosting
phrase_proximity_distance (int, 15) – Merged tokens (via SpacyNormalizer) are converted into phrases.
sub_token_aware_quoting (bool, True) – Perform additional PHRASE-TERM matching for tokens that might get split by ES analyzer. (‘wi-fi’ -> ‘wi-fi OR “wi-fi”~0’)
min_query_length (int, 2) – Only queries that contain more tokens than the configured threshold are considered for POS based transformation
output_field (str, "pos_mutations") – map of term => replacement
path (str, ".") – path

Attributes Summary

STEP_PREFIX

Methods Summary

`apply_term_mutation`(term_mutation_map, ...)	Specify applied mutation rule on a given token.
`process_doc`(doc)	Process a document
`should_add_phrase_match`(token)
`should_mutate`(token)	Check if incoming Spacy Token should get mutated.

Attributes Documentation

STEP_PREFIX = 'pos_booster'#

Methods Documentation

apply_term_mutation(term_mutation_map, token, weight)#: Specify applied mutation rule on a given token. (“is” -> “”)

process_doc(doc)#

Process a document

Parameters:: doc (Document) – Document
Returns:: Processed document
Return type:: Document

should_add_phrase_match(token)#

Return type:: str

Specifies if Spacy Token should be rewritten to perform additional phrase matching:: (token OR “token”)

Applicable: - merging_entities results in new york being recognised as one token. - tokens suitable for sub-word-delimiting like covid-19, 2010-10, email@adress.com

:returns Quoted Token.text if string might be split by elasticsearch analyzer

should_mutate(token)#: Check if incoming Spacy Token should get mutated. :type token: Token :param token: :return:

POSBooster

Contents

POSBooster#