RegexFilter#

class RegexFilter(config)#

Bases: Filter

The regex Filter filters documents based on a supplied list of blacklist and whitelist regexes

Input - all input fields needs to be of type str.

Output - (optional) the output field is filled with data of type str.

Parameters:
  • type (str) – regex

  • blacklist_regexes (list, []) – List of blacklist regexes to apply

  • fields (list) – Fields to apply regexes

  • output_field (str, None) – Field to record if regex matches

  • matching_label (str, 'match') – Label given if regex matches

  • non_matching_label (str, 'no_match') – Label given if regex does not match

  • whitelist_regexes (list, []) – List of whitelist regexes to apply

  • rule_field (str, None) – Field to record the rule which triggered the match (manly used in the context of proximity filters)

  • no_rule_matched_label (str, 'NO_RULE_MATCHED') – Rule given if regex does not match (manly used in the context of proximity filters)

  • default_language (str, 'en') – Default language if language_field is not present.

  • language_field (str, 'language') – Document field that gives the language.

Example

{
    "step": "filter",
    "type": "regex",
    "fields": ["body"],
    "mark_as_skipped": true,
    "whitelist_regexes": ["^.{20,}$"]
}

Attributes Summary

Methods Summary

process_doc(doc)

Process a document

Attributes Documentation

REG_FLAGS = 0#

Methods Documentation

process_doc(doc)#

Process a document

Parameters:

doc (Document) – Document

Returns:

Processed document

Return type:

Document