HtmlTokenizer#
- class HtmlTokenizer(config)#
Bases:
TokenizerThe HTML
Tokenizersplits the input fields by HTML tags into sentences.Input - all input fields need to be of type
str.Output - all output fields are filled with data of type
list[str].- Parameters:
type (str) – html
Example
{ "name": "html", "step": "tokenizer", "type": "html", "input_fields": ["body"], "output_fields": ["tokenized_body"] }
Methods Summary
process_doc(doc)Process a document
Methods Documentation