TermsExtractionEmbedder

class squirro.lib.nlp.steps.embedders.TermsExtractionEmbedder(config)

Bases: squirro.lib.nlp.steps.embedders.Embedder

Terms Extraction Embedder.

Parameters
  • type (str) – terms_extraction

  • input_field (list, []) – terms_extraction

  • filter_list (list, []) – extra words to added to stoword list

  • max_content_length (int, 50000) – maximum document length to be processed - longer documents are cut off

  • min_word_len (int) – minimal word length

  • max_chunk_len (int) – max chunk length

  • max_cores (int, -1) – max number of cpu cores used

  • save_model (bool) – save the tf-idf model

  • cache_lemmas (bool) – cache the lemmatized words

  • p_significant_terms (float) – % of signficant term to be extracted

  • add_lemmization (bool) – true/false.

Methods Summary

process_batch(batch)

Process a batch of documents.

Methods Documentation

process_batch(batch)

Process a batch of documents. If not defined will default to using self.process_doc for each document in the batch.

Parameters

batch (list(Document)) – List of documents

Returns

List of processed documents

Return type

list(Document)