Tokenizer

class squirro.lib.nlp.steps.tokenizers.Tokenizer(config)

Bases: squirro.lib.nlp.steps.batched_step.BatchedStep

The Tokenizer step takes specified fields and splits them into tokens to be used by a downstream step.

Parameters
  • type (str) – Type of Tokenizer (sentences, word_ngrams, or spaces)

  • fields (list, []) – List of fields to tokenize (in place)

  • input_fields (list, None) – List of fields to tokenize from (defaults to fields)

  • output_fields (list, None) – List of fields to tokenize to (defaults to fields)