HtmlTokenizer (config)
|
The HTML Tokenizer splits the input fields by HTML tags into sentences. |
PdfSentencesTokenizer (config)
|
PDF Sentences Tokenizer splits the PDF files by sentences and keeps track of the positional information of the extracted sentences. |
SentencesNLTKTokenizer (config)
|
Sentences Tokenizer splits the input fields by sentences with NLTK and additional custom rules. |
SpacesTokenizer (config)
|
Spaces Tokenizer that splits the input fields on spaces. |
Tokenizer (config)
|
The Tokenizer step takes specified fields and splits them into tokens to be used by a downstream step. |