Tokenizers Package

Tokenizers Package#

Tokenizer factory

`HtmlTokenizer`(config)	The HTML `Tokenizer` splits the input fields by HTML tags into sentences.
`PdfPagesTokenizer`(config)	PDF Pages Extractor: reads each PDF in your files field and extracts full-page text content using PyMuPDF (fitz).
`PdfSentencesTokenizer`(config)	PDF Sentences `Tokenizer` splits the PDF files by sentences and keeps track of the positional information of the extracted sentences.
`SentencesNLTKTokenizer`(config)	Sentences `Tokenizer` splits the input fields by sentences with NLTK and additional custom rules.
`SpacesTokenizer`(config)	Spaces `Tokenizer` that splits the input fields on spaces.
`Tokenizer`(config)	The `Tokenizer` step takes specified fields and splits them into tokens to be used by a downstream step.