TermsExtractionEmbedder#
- class TermsExtractionEmbedder(config)#
Bases:
Embedder
The term extraction
Embedder
allows to extract significant terms provided textual data.Note - the significant terms get extracted based on tf/idf. There is some text cleaning, html removal, stopword removal, noun chunking, lemmatization etc. done prior to the extraction.
Input - all input fields need to be of type
str
.Output - the output field is filled with data of type
list
[str
].- Parameters
type (str) – terms_extraction
input_field (list, []) – list of fields to embed from
filter_list (list, []) – extra words to added to the stopword list
max_content_length (int, 50000) – maximum document length to be processed - longer documents are cut off
min_word_len (int) – minimal word length
max_chunk_len (int) – max chunk length
max_cores (int, -1) – max number of cpu cores used
save_model (bool) – save the tf-idf model
cache_lemmas (bool) – cache the lemmatized words
p_significant_terms (float) – % of significant term to be extracted
add_lemmization (bool) – lemmatize input text
Example
{ "step": "embedder", "type": "terms_extraction", "input_field": ["text"], "output_field": "significant_terms", "max_chunk_len": 2, "min_word_len": 4, "p_significant_terms": 0.2, "add_lemmization": false, "cache_lemmas": false, "save_model": false }
Methods Summary
process_batch
(batch)Process a batch of documents.
Methods Documentation