tokenizers package

Functions

make_tokenizer(config)

Tokenizer factory

Classes

HtmlTokenizer(config)

The HTML Tokenizer splits the input fields by HTML tags into sentences.

PdfSentencesTokenizer(config)

PDF Sentences Tokenizer splits the PDF files by sentences and keeps track of the positional information of the extracted sentences.

SentencesNLTKTokenizer(config)

Sentences Tokenizer splits the input fields by sentences with NLTK and additional custom rules.

SpacesTokenizer(config)

Spaces Tokenizer that splits the input fields on spaces.

Tokenizer(config)

The Tokenizer step takes specified fields and splits them into tokens to be used by a downstream step.