Catalyst Data Model

Catalyst Data Model#

The catalyst data model provides a sub-item model so that when significant events are detected, you can see exactly which sentence or phrase in a document triggered the catalyst, as well as detailed relationships across documents.

Individual labels are stored in the entities item field.

Warning

Labels were previously referred to as facets in the Squirro UI. You will still see references to facets in the code, and in some places within the Squirro UI. All facets can be treated as labels.

Definitions / Vocabulary#

Document

Original data as provided by the customer.

Item

A modified version of Document as stored within Squirro.

Label

Metadata assigned to an Item in the form of a key/[list of values] pair. Stored as attribute keywords in the Item.

Entity

A real-world or higher level object of a pre-defined type, such as persons, locations, organizations, products, events etc., that can be denoted with a proper name. An Entity has a list of Extracts with all its appearances within one Item. Optionally it can maintain a list of instantiations of properties. Properties are pre-defined per Entity type and are simple values or references to other Entities.

Extract

A single occurrence of a detected Entity within one Item. Keeps track of the location and the original text of the detection.

Catalyst

A mapping between a Query and a set of Actions.

Query

A string conforming to our query syntax. The query syntax is extended to allow searching for Entities. See Query Syntax below.

Action

Some action executed based on a Catalyst match. E.g., send an email, call callback.

Entity Profile

Pre-computed model for each value of an Entity. Used for ranking Recommendations.

Recommendation

Ranked result list of Entities based on a Query (potentially containing Entities).

image1

Models#

Item

see Item Model

Facet

see Facets API

Entity

[{
    "id": "1234",  # unique entity id
    "item_id": "123456",  # reference to original item id
    "type": "company",  # type of the entity, e.g. company,
    "name": "Thomson Reuters",
    "confidence": 0.8,  # aggregated confidence of all extracts [0-1]
    "relevance": 0.9,  # relevance of this entity for the item [0-1]
    "extracts": [{
        "text": "Thomson Reuters",  # original representation
        "field": "title",  # on which Item field can this extract be found
        "confidence": 0.9,  # confidence level [0-1]
        "offset": 14,  # start offset of text within original item
        "length": 15,  # length of text within original item
    }, {
        "text": "TR",  # original representation
        "field": "body",  # on which Item field can this extract be found
        "confidence": 0.1,  # confidence level [0-1]
        "offset": 0,  # start offset of text within original item
        "length": 2,  # length of text within original item
    }],
    "properties": {
        "stock_symbol": "TR",  # value based property
        "parent_company_ref": "<id of company type entity>"  # reference based property
    },
}, {
    "id": "1237", # unique entity id
    "item_id": "123456",  # original item id
    "type": "deal", # type of the entity, e.g. deal,
    "name": "Thomson Reuters bought Squirro for 1Mio in the US.",
    "confidence": 0.3  # confidence level of this entity [0-1]
    "extracts": [{
        "text": "Thomson Reuters bought Squirro for 1Mio in the US.",  # original representation
        "field": "body",  # on which Item field can this extract be found
        "confidence": 0.3,  # confidence level [0-1]
        "offset": 114,  # start offset of text within original item
        "length": 52,  # length of text within original item
    }],
    "properties": {  # variable set of keys depending on the entity type
        "region_ref": <entity_id_1_of_type_geo>,
        "size": 10000000,
        "industry": null,
        "acquirer": <entity_id_3_of_type_company>,
        "target": <entity_id_3_of_type_company>,
    }
}]

Note: Properties can come in two different types: string (default) or numeric. If they are numeric, e.g. of type float or int they will be indexed on a field ‘numeric_properties’ in Elasticsearch and mapped back to ‘properties’ before returned. This allows e.g. for proper number comparison or range queries. Unlike for keywords we do not maintain a DB to keep track of the types of properties, but only infer the type from the submitted value.

Query Syntax#

Entities

entity:{< any query to match a single entity document >}

Examples#

Search for Items containing a specific Entity of type company:

entity:{type:company AND name:”Thomson Reuters”}

Search for Items containing at least one company-typed Entity “Thomson Reuters” and another one Entity “Squirro”:

entity:{type:company AND name:”Thomson Reuters”} AND entity:{type:company AND name:Squirro}

Search for Items containing a specific Entity of type company with a confidence higher than 80%:

entity:{type:company AND name:”Thomson Reuters” AND confidence > 0.8}

Search for Items containing any Entity of type company with confidence higher than 70%:

entity:{type:company AND NOT confidence < 0.7}

Search for Items containing no Entity of type company with confidence higher or equal than 20%:

entity:{type:company AND confidence < 0.2}

Search for Items containing any Entity of type deal with at least a 70% confidence:

entity:{type:deal AND confidence > 0.7}

Search for Items containing a specific Entity of type deal:

entity:{type:deal AND properties.size:100 AND properties.region:US AND properties.industry:Tech AND properties.target:Whatsapp AND properties.acquirer:Facebook}

Search for Items containing one Entity with target Squirro and another Entity with target Whatsapp:

entity:{type:deal AND properties.target:Squirro AND properties.industry:Tech} AND entity:{type:deal AND properties.target:Whatsapp AND properties.industry:Tech}

Search for Items containing an Entity of type deal with a property size bigger than 100:

entity:{type:deal AND properties.size > 100}