Tokenizer

Contents

Tokenizer#

class Tokenizer(config)#

Bases: BatchedStep

The Tokenizer step takes specified fields and splits them into tokens to be used by a downstream step.

Parameters:
  • type (str) – Type of Tokenizer (sentences, word_ngrams, or spaces)

  • fields (list, []) – List of fields to tokenize (in place)

  • input_fields (list, None) – List of fields to tokenize from (defaults to fields)

  • output_fields (list, None) – List of fields to tokenize to (defaults to fields)