SpacyNormalizer#

class SpacyNormalizer(config)#

Bases: Normalizer

Multi-Lingual Spacy Text Analyzer.

The Spacy Normalizer analyses incoming text and stores the Spacy Document as an enrichment for further usage in subsequent steps.

Note
  • The documents language field is used to load and use the corresponding spacy model (as defined in the spacy_model_mapping)

  • For performance reasons, input_field is expected to be split already on sentence/paragraph boundary. Currently not optimised for high throughput on big documents (should use spacy.pipe() instead).

  • Usable built-in components see (https://spacy.io/usage/processing-pipelines#built-in)

Input - the input field needs to be of type str.

Output - the output field is filled with data of type spacy.Doc.

Configured components
  • Tokenization: parse text and split into unique (meaningful) tokens
    • (optional) merge phrases into one token, like “covid-19 vaccine”

    • (optional) merge tokens separated by hyphen into one token, like covid-19 instead of [“covid”,”-“,”19”]

    • (optional) merge recognized entities into one token, like New York

  • Tagging:
    • Part of Speech: Custom processing of terms based on their tag (e.g. “University:Noun” vs “should:Verb”)

    • Named Entities: Custom processing of named-entities (Switzerland:Location)

  • Lemmatization: Reduce variety of vocabulary

Parameters
  • type (str) – spacy

  • input_fields (list,["body"]) – This step only takes one input field

  • output_fields (list,["nlp"]) – This step only takes one output field

  • (dict (spacy_model_mapping) – “en_core_web_sm”}): Map language code to specific spacy language model

  • {"en" – “en_core_web_sm”}): Map language code to specific spacy language model

  • fallback_language (str, "en") – Default language to use

  • exclude_spacy_pipes (list, ["ner"]) – Depending on the use-case exclude predefined spacy steps to reduce computational effort.

  • infix_split_hyphen (bool, False) – Should a single token like new-york get split by hyphen (currently only for english language).

  • merge_phrases (bool, True) – Recognize and merge phrases into one Spacy Token.

  • merge_entities (bool, True) – Recognize and merge Named Entities into one Spacy Token.

Example

{
    "step": "normalizer",
    "type": "spacy",
    "input_fields": ["body"],
    "output_fields": ["nlp"],
    "exclude_spacy_pipes": [],
    "spacy_model_mapping": {"en": "en_core_web_sm"}
}

Attributes Summary

disabled_pipes

rtype

List[str]

Methods Summary

add_custom_config(name, value)

Add custom config that is used to customise Spacy model.

customise_spacy(nlp)

Customise spacy model.

process_doc(doc)

Process a document

register_pipe(pipe, enable)

Register Spacy pipeline.

Attributes Documentation

disabled_pipes#
Return type

List[str]

Methods Documentation

add_custom_config(name, value)#

Add custom config that is used to customise Spacy model.

The use of step specific model-config options requires to store one model per unique set of configurations (to not interfere with other pipelines that use the same model but different options). All options (except pipes) that change the model on the fly have to be added using this method so that the model cache-key (hash) is correctly created.

Return type

None

customise_spacy(nlp)#

Customise spacy model.

Enable custom steps to inherit from this step to customize specific spacy-components, for example the behaviour of the Tokenizer. :type nlp: Language :param nlp: :rtype: None :return:

process_doc(doc)#

Process a document

Parameters

doc (Document) – Document

Returns

Processed document

Return type

Document

register_pipe(pipe, enable)#

Register Spacy pipeline.

Each pipeline has to be registered to the model that is saved to the cache. Then by providing the enable flag the pipe is either enabled or disabled during documents processing.

Return type

None