HTMLMappingNormalizer#

class HTMLMappingNormalizer(config)#

Bases: Normalizer

The HTML Normalizer removes HTML markup

Input - all input fields need to be of type str.

Output - all output fields are filled with data of type str.

Parameters:
  • type (str) – html_mapping

  • parse_html5 (bool, False) – If True, parse the HTML document to HTML5 standard

  • output_mapping_fields (list, None) – Name of the output mapping fields that will contain the mapping index

Example

{
    "step": "normalizer",
    "type": "html_mapping",
    "input_fields": ["body"],
    "parse_html5": false,
    "output_fields": ["normalized_body"]
    "mapping_index_fields": ["mapping_index_body"],
}

Methods Summary

process_doc(doc)

Process a document

Methods Documentation

process_doc(doc)#

Process a document

Parameters:

doc (Document) – Document

Returns:

Processed document

Return type:

Document