HtmlTokenizer

HtmlTokenizer#

class HtmlTokenizer(config)#

Bases: Tokenizer

The HTML Tokenizer splits the input fields by HTML tags into sentences.

Input - all input fields need to be of type str.

Output - all output fields are filled with data of type list [ str ].

Example

{
    "name": "html",
    "step": "tokenizer",
    "type": "html",
    "input_fields": ["body"],
    "output_fields": ["tokenized_body"]
}

Methods Summary

Process a document

Methods Documentation

process_doc(doc)#

Process a document