TermsExtractionEmbedder

TermsExtractionEmbedder#

class TermsExtractionEmbedder(config)#

Bases: Embedder

The term extraction Embedder allows to extract significant terms provided textual data.

Note - the significant terms get extracted based on tf/idf. There is some text cleaning, html removal, stopword removal, noun chunking, lemmatization etc. done prior to the extraction.

Input - all input fields need to be of type str.

Output - the output field is filled with data of type list [ str ].

Parameters:

type (str) – terms_extraction
input_field (list, []) – list of fields to embed from
filter_list (list, []) – extra words to added to the stopword list
max_content_length (int, 50000) – maximum document length to be processed - longer documents are cut off
min_word_len (int) – minimal word length
max_chunk_len (int) – max chunk length
max_cores (int, -1) – max number of cpu cores used
save_model (bool) – save the tf-idf model
cache_lemmas (bool) – cache the lemmatized words
p_significant_terms (float) – % of significant term to be extracted
add_lemmization (bool) – lemmatize input text

Example

{
    "step": "embedder",
    "type": "terms_extraction",
    "input_field": ["text"],
    "output_field": "significant_terms",
    "max_chunk_len": 2,
    "min_word_len": 4,
    "p_significant_terms": 0.2,
    "add_lemmization": false,
    "cache_lemmas": false,
    "save_model": false
}

Methods Summary

process_batch(batch)

Process a batch of documents.

Methods Documentation

process_batch(batch)#

Process a batch of documents. If not defined will default to using self.process_doc for each document in the batch.

Parameters:: batch (list(Document)) – List of documents
Returns:: List of processed documents
Return type:: list(Document)

TermsExtractionEmbedder

Contents

TermsExtractionEmbedder#