HtmlTokenizer#
- class HtmlTokenizer(config)#
- Bases: - Tokenizer- The HTML - Tokenizersplits the input fields by HTML tags into sentences.- Input - all input fields need to be of type - str.- Output - all output fields are filled with data of type - list[- str].- Parameters:
- type (str) – html 
 - Example - { "name": "html", "step": "tokenizer", "type": "html", "input_fields": ["body"], "output_fields": ["tokenized_body"] } - Methods Summary - process_doc(doc)- Process a document - Methods Documentation 
