POSBooster#
- class POSBooster(config)#
Bases:
BatchedStep
Perform term level boosting and cleaning based on detected POS tags.
- The following POS tags are known:
ADJ: adjectives ADP: adpositions (prepositions and postpositions) ADV: adverbs CONJ: conjunctions DET: determiners INTJ: interjection NOUN: nouns NUM: numeral PART: particles PRON: pronouns PROPN: proper nouns PUNCT: punctuations SPACE: spaces SYM: symbols VERB: verbs (all tenses and modes) X: other: foreign words, typos, abbreviations
POS Weight Map
Each POS tag can be configured with a specific numeric weight.
{ "PROPN": 10, "NOUN": 10, "VERB": 5, "ADJ": 2, "X": "-", "NUM": "-", "SYM": "-" }
Boost term relevancy: A higher number boosts the relevancy of matched terms.
- Remove terms: NOT specified POS types can be stripped out
(useful to remove determiners or stopwords)
Ignore mutation: Don’t mutate tokens per POS type by setting the corresponding weight to -
Improve retrieval Precision with Phrases & Term Proximity
Note: Especially useful to search for data inside very large documents that may match all query terms - but matches are scattered across the whole document - and not within the right matching context (e.g. within the same section/paragraph)
Relevant chunks - detected via SpacyNormalizer - are converted into loose-phrases to improve search precision.
population of new york => population of “new york”~5
will the european union extend brexit deadline => “the european union”~15 extend^2 “brexit deadline”~15
To reduce too high of an impact on recall, we search for “loosely coupled phrases”.
Match all phrase terms within a window of N terms; with default of N=15 (average length of english sentences).
Example:
Input query: why is austria again the virus center
Output POS mutations:
{ "why": "", "is": "", "austria": "austria^10", "again": "", "the": "", "virus": "virus^10", "center": "center^10" }
Annotated mutation dictionary is used in a succeeding step to enrich the search-query.
- Parameters
step (str, "app") – app
type (str, "query_processing") – query_processing
name (str, "pos_booster") – pos_booster
analyzed_input_field (str, "nlp") – analyzed spacy Doc
(dict (pos_weight_map) – 10,”NOUN”:10,”VERB”:5,”ADJ”:2,”X”:”-“,”NUM”:”-“,”SYM”:”-“}) : dictionary mapping between Spacy POS tag to weight used for term boosting
{"PROPN" – 10,”NOUN”:10,”VERB”:5,”ADJ”:2,”X”:”-“,”NUM”:”-“,”SYM”:”-“}) : dictionary mapping between Spacy POS tag to weight used for term boosting
phrase_proximity_distance (int, 15) – Merged tokens (via SpacyNormalizer) are converted into phrases.
sub_token_aware_quoting (bool, True) – Perform additional PHRASE-TERM matching for tokens that might get split by ES analyzer. (‘wi-fi’ -> ‘wi-fi OR “wi-fi”~0’)
min_query_length (int, 2) – Only queries that contain more tokens than the configured threshold are considered for POS based transformation
output_field (str, "pos_mutations") – map of term => replacement
path (str, ".") – path
Attributes Summary
Methods Summary
apply_term_mutation
(term_mutation_map, ...)Specify applied mutation rule on a given token.
process_doc
(doc)Process a document
should_add_phrase_match
(token)Specifies if Spacy Token should be rewritten to perform additional phrase matching:
should_mutate
(token)Check if incoming Spacy Token should get mutated.
Attributes Documentation
- STEP_PREFIX = 'pos_booster'#
Methods Documentation
- apply_term_mutation(term_mutation_map, token, weight)#
Specify applied mutation rule on a given token. (“is” -> “”)
- process_doc(doc)#
Process a document
- should_add_phrase_match(token)#
- Specifies if Spacy Token should be rewritten to perform additional phrase matching:
(token OR “token”)
Applicable: - merging_entities results in new york being recognised as one token. - tokens suitable for sub-word-delimiting like covid-19, 2010-10, email@adress.com
:returns Quoted Token.text if string might be split by elasticsearch analyzer
- Return type
- should_mutate(token)#
Check if incoming Spacy Token should get mutated. :type token:
Token
:param token: :return: