SquirroEntityFilter

class squirro.lib.nlp.steps.filters.SquirroEntityFilter(config)

Bases: squirro.lib.nlp.steps.filters.Filter

The Squirro entitiy Filter takes a set of fields and creates the Squirro entities structure

Note - The squirro.lib.nlp.steps.tokenizers.PdfSentencesTokenizer need to be run before this step for PDF support

Input - There is no specific input field for this step.

Output - The output field is formatted as follows (page to rects is only produced for pdf files):

[
  {
    "type": "ENTITY_TYPE",
    "name": "ENTITY_NAME",
    "extracts":[
       {
          "text":"TEXT_FRAGMENT",
          "offset":"OFFSET",
          "length":"LENGTH"
          "page_to_rects":{
             "PAGE_NUM":[{"x":"X", "y":"Y", "height":"H", "width":"W"}]
          },
       }
    ],
    "properties":{ "PROPERTY_KEY":["PROPERTY_VALUE"]}
  }
]
Parameters
  • entity_name (str, None) – Name of entity (defaults to entity_type)

  • entity_name_field (str, None) – Field with entity name (defaults to entity_name if None)

  • entity_type (str) – Type of the squirro entity, value of the type field in the entity data structure

  • excluded_values (list, []) – Values that will not be added as entity properties

  • extract_field (str) – Field with list of text extracts

  • format_values (bool, False) – Whether or not to format string values as titles

  • global_property_field_map (dict, {}) – Map for fields with properties that are copied in from the item

  • output_field (str, 'entities') – Field to write resulting entities

  • property_field_map (dict, {}) – Map for fields with properties that match the number of extracts

  • property_value_map (dict, {}) – Map for renaming values of fields with properties that match the number of extracts

  • static_properties (dict, {}) – Map of static property values to attach to entities

  • required_properties (list, []) – Properties that must exist (after exclusion) for the entity to be added

  • source_field (str, '') – Field where extracted text originated

  • source_fields (list, []) – List of fields where extracted text originated

Example

{
    "step": "filter",
    "type": "squirro_entity",
    "entity_name_field": "prediction",
    "entity_type": "ENTITY_TYPE",
    "excluded_values": [],
    "extract_field": "sentences",
    "format_values": false,
    "global_property_field_map": {},
    "modes": ["process"],
    "property_field_map": {
        "PROPERTY_KEY": ["prediction"]
    },
    "required_properties": ["PROPERTY_KEY"],
    "source_field": "body"
}

Methods Summary

process_doc(doc)

Process a document

Methods Documentation

process_doc(doc)

Process a document

Parameters

doc (Document) – Document

Returns

Processed document

Return type

Document