Filter

Filter#

class Filter(config)#

Bases: BatchedStep

The Filter steps take the stream of Document and filter out specific entries based on the criteria of the step or perform reduction/enlargment operations on fields and Document.

Parameters:
  • step (str) – filter

  • type (str) – Type of Filter

  • mark_as_skipped (bool, False) – Keep track of rejected items by marking them as skipped. The default behaviour is to completely discard documents that are filtered out. With this property the document is kept in the pipeline but skipped (ignored) by most steps. Only steps that have handle_skipped settings can be set to process skipped documents.

Methods Summary

reject_doc(doc)

Helper to reject documents.

Methods Documentation

reject_doc(doc)#

Helper to reject documents.

If mark_as_skipped is True, then the document is returned with skipped set to True. Otherwise nothing is returned.

Implementations use this by using the following line of code inside process_doc for any document that is to be filtered out:

return self.reject_doc(doc)