SpacyNormalizer#
- class SpacyNormalizer(config)#
Bases:
Normalizer
Multi-Lingual Spacy Text Analyzer.
The Spacy
Normalizer
analyses incoming text and stores the Spacy Document as an enrichment for further usage in subsequent steps.- Note
The documents language field is used to load and use the corresponding spacy model (as defined in the spacy_model_mapping)
For performance reasons, input_field is expected to be split already on sentence/paragraph boundary. Currently not optimised for high throughput on big documents (should use spacy.pipe() instead).
Usable built-in components see (https://spacy.io/usage/processing-pipelines#built-in)
Input - the input field needs to be of type
str
.Output - the output field is filled with data of type
spacy.Doc
.- Configured components
- Tokenization: parse text and split into unique (meaningful) tokens
(optional) merge phrases into one token, like “covid-19 vaccine”
(optional) merge tokens separated by hyphen into one token, like covid-19 instead of [“covid”,”-“,”19”]
(optional) merge recognized entities into one token, like New York
- Tagging:
Part of Speech: Custom processing of terms based on their tag (e.g. “University:Noun” vs “should:Verb”)
Named Entities: Custom processing of named-entities (Switzerland:Location)
Lemmatization: Reduce variety of vocabulary
- Parameters
type (str) – spacy
input_fields (list,["body"]) – This step only takes one input field
output_fields (list,["nlp"]) – This step only takes one output field
(dict (spacy_model_mapping) – “en_core_web_sm”}): Map language code to specific spacy language model
{"en" – “en_core_web_sm”}): Map language code to specific spacy language model
fallback_language (str, "en") – Default language to use
exclude_spacy_pipes (list, ["ner"]) – Depending on the use-case exclude predefined spacy steps to reduce computational effort.
infix_split_hyphen (bool, False) – Should a single token like new-york get split by hyphen (currently only for english language).
merge_phrases (bool, True) – Recognize and merge phrases into one Spacy Token.
merge_entities (bool, True) – Recognize and merge Named Entities into one Spacy Token.
Example
{ "step": "normalizer", "type": "spacy", "input_fields": ["body"], "output_fields": ["nlp"], "exclude_spacy_pipes": [], "spacy_model_mapping": {"en": "en_core_web_sm"} }
Attributes Summary
Methods Summary
add_custom_config
(name, value)Add custom config that is used to customise Spacy model.
customise_spacy
(nlp)Customise spacy model.
process_doc
(doc)Process a document
register_pipe
(pipe, enable)Register Spacy pipeline.
Attributes Documentation
Methods Documentation
- add_custom_config(name, value)#
Add custom config that is used to customise Spacy model.
The use of step specific model-config options requires to store one model per unique set of configurations (to not interfere with other pipelines that use the same model but different options). All options (except pipes) that change the model on the fly have to be added using this method so that the model cache-key (hash) is correctly created.
- Return type
- customise_spacy(nlp)#
Customise spacy model.
Enable custom steps to inherit from this step to customize specific spacy-components, for example the behaviour of the Tokenizer. :type nlp:
Language
:param nlp: :rtype:None
:return:
- process_doc(doc)#
Process a document