HTMLNormalizer

class squirro.lib.nlp.steps.normalizers.HTMLNormalizer(config)

Bases: squirro.lib.nlp.steps.normalizers.Normalizer

HTML normalizer that removes HTML markup

Parameters
  • type (str) – html

  • encoding (str, 'utf-8') – (deprecated option) Content encoding

  • remove_tags (list, []) – remove the html tags provided in the list

Methods Summary

process_doc(doc)

Process a document

Methods Documentation

process_doc(doc)

Process a document

Parameters

doc (Document) – Document

Returns

Processed document

Return type

Document