class TermsExtractionEmbedder(config)#

Bases: Embedder

The term extraction Embedder allows to extract significant terms provided textual data.

Note - the significant terms get extracted based on tf/idf. There is some text cleaning, html removal, stopword removal, noun chunking, lemmatization etc. done prior to the extraction.

Input - all input fields need to be of type str.

Output - the output field is filled with data of type list [ str ].

  • type (str) – terms_extraction

  • input_field (list, []) – list of fields to embed from

  • filter_list (list, []) – extra words to added to the stopword list

  • max_content_length (int, 50000) – maximum document length to be processed - longer documents are cut off

  • min_word_len (int) – minimal word length

  • max_chunk_len (int) – max chunk length

  • max_cores (int, -1) – max number of cpu cores used

  • save_model (bool) – save the tf-idf model

  • cache_lemmas (bool) – cache the lemmatized words

  • p_significant_terms (float) – % of significant term to be extracted

  • add_lemmization (bool) – lemmatize input text


    "step": "embedder",
    "type": "terms_extraction",
    "input_field": ["text"],
    "output_field": "significant_terms",
    "max_chunk_len": 2,
    "min_word_len": 4,
    "p_significant_terms": 0.2,
    "add_lemmization": false,
    "cache_lemmas": false,
    "save_model": false

Methods Summary


Process a batch of documents.

Methods Documentation


Process a batch of documents. If not defined will default to using self.process_doc for each document in the batch.


batch (list(Document)) – List of documents


List of processed documents

Return type