Tokenizers Package#

Functions#

make_tokenizer(config)

Tokenizer factory

Classes#

HtmlTokenizer(config)

The HTML Tokenizer splits the input fields by HTML tags into sentences.

PdfSentencesTokenizer(config)

PDF Sentences Tokenizer splits the PDF files by sentences and keeps track of the positional information of the extracted sentences.

SentencesNLTKTokenizer(config)

Sentences Tokenizer splits the input fields by sentences with NLTK and additional custom rules.

SpacesTokenizer(config)

Spaces Tokenizer that splits the input fields on spaces.

Tokenizer(config)

The Tokenizer step takes specified fields and splits them into tokens to be used by a downstream step.