HtmlTokenizer#
- class HtmlTokenizer(config)#
Bases:
Tokenizer
The HTML
Tokenizer
splits the input fields by HTML tags into sentences.Input - all input fields need to be of type
str
.Output - all output fields are filled with data of type
list
[str
].- Parameters:
type (str) – html
Example
{ "name": "html", "step": "tokenizer", "type": "html", "input_fields": ["body"], "output_fields": ["tokenized_body"] }
Methods Summary
process_doc
(doc)Process a document
Methods Documentation