Config Reference

This document details all configuration options for Known Entity Extraction (KEE). These configurations are stored in the config.json located in the root folder of the KEE.

Format

The format of the config.json is JSON or rather a more human-friendly superset called Hjson. The main improvements in Hjson are comments and trailing commas (which are not allowed in standard JSON). Additionally, you are allowed to leave out most of the quotes if desired. Here is an example using HJson:

{
    sources: {
        demo: {
                // The data is in a CSV file in the current directory
                "source_type": "csv",
                "source_file": "demo.csv",
        },
    },

    /* Just repeating the default settings for testing here */
    testing: {
        fixtures_dir: "fixtures/",
        snapshots_dir: "snapshots/",
    }
}

A full reference of Hjson is available on hjson.org. The same configuration in standard JSON:

{
    "sources": {
        "demo": {
            "source_type": "csv",
            "source_file": "demo.csv"
        }
    },

    "testing": {
        "fixtures_dir": "fixtures/",
        "snapshots_dir": "snapshots/"
    }
}

Structure

In the examples above, the curly brackets at the root of the file open a new dictionary. Within each dictionary there are a number of key / value pairs. For example, the keys of the top level dictionary shown above are sources and testing.

Each entry in the top-level dictionary indicates a different section of the KEE configuration. This reference describes the usage of each of the different sections:

Reference

KEE

The kee section configures basic parameters for the whole process. The full section is optional.

pipelet

The name of the pipelet when uploading the KEE project to the Squirro server. The name should be unique across your KEE projects to avoid any collision.

Type: str

pipelet_file

Path to a custom KEE pipelet.

Type: str

database

File name, relative to the configuration file directory, where the lookup database is located.

Type: str

Default: "db/lookup.json"

version

A version indicator for the current KEE configuration. This can be modified whenever the strategy or the data changes significantly, thus warranting a re-tagging of previous items.

The kee rerun command makes use of this information to select older items for re-tagging.

Type: str

Default: not set

version_keyword

Name of the keyword to use for version tagging on items. If the version is set, then it is written into a keyword of this name on the item.

Type: str

Default: "KEE Version"

debug

If true, a few debug options are enabled. For example the lookup database is written in a more human-readable format.

Type: bool

Default: false

Example usage:

{
    "kee": {
        "database": "db/product_list.json",
        "debug": false
    }
}

Squirro

Adding a KEE project to a Squirro project requires the configuration of cluster information in the squirro section. See Connecting to Squirro for how to get this information.

You can omit this section, if you are not using the upload or get_fixture commands that connect to Squirro.

cluster

The endpoint of you Squirro Installation.

Type: str

token

The authentication token with which to log into the system. Treat this token confidentially. If you share the config file, the token should not be included. See the environment variables section for an alternative.

Type: str

project_id

The identifier of the Squirro project in which you want to use the KEE.

Type: str

Example usage:

{
    "squirro": {
        "cluster": "https://www.mysquirrocluster.com",
        "token": "MY-ACCESS-TOKEN",
        "project_id": "MY-PROJECT-ID"
    }
}

Sources

The sources section in the configuration contains a list of all the data sources. These data sources are configured to load the structured KEE input data for building the lookup database of the KEE.

Each entry is in itself an object with the required configuration for the source. A partial example, creating the two sources "clients" and "employees":

{
    "sources": {
        "clients": {
            // The keys from the reference below come here
        },
        "employees": {
            // The keys from the reference below come here
        },
        // …
    }
}

Each source takes its own configuration. First, set the configuration for loading the data. You can make use the Data Loader options, including Data Loader Plugins. Second, configure the KEE behavior for the source.

Data Source

source_type

The type of source the KEE connects to.

Type: str

Valid values: csv, excel, database

source_script

The Data Loader Plugin to use for loading the KEE data. The source_script and source_type keys are mutually exclusive.

Additional connection options

You can specify additional connection options as key/value pairs.

Use the same options as you would use for the data loader, but replace the dashes with underscores. For example if the data loader is invoked with

squirro_data_load --source-type csv --source-file test.csv

then the corresponding source configuration in the KEE configuration is:

{
    "source_type": "csv",
    "source_file": "test.csv",
}

Check the Data Loader Reference for all the possible options. Plugin-specific options are documented in each plugin.

KEE Configuration

strategy

The name of the strategy that is used for matching on this KEE data source. The name needs to reference a strategy key that has been defined in the strategies section of the configuration file.

Type: str

field_id

The name of the column in the KEE input data holding a unique identifier of the row.

Type: str

generate_id

Automatically generate the unique identifier for each row. Only use if field_id is not specified.

Type: bool

field_matching

Names of the columns in the KEE input data that contain the entity names used for the KEE matching. This is commonly the primary name and often an alias column.

Type: list

hierarchy

Specifies a hierarchy in the data. Using a hierarchy, you can configure the KEE to tag items, for example, with the matching company and all of its parent companies.

The format of this configuration is:

{
    "sources": {
        "example": {
            // ...
            "hierarchy": "Parent Column -> Child Column"
            // ...
        }
    }
}

See the hierarchy section for examples.

Type: str

multivalue

A list of column names that can contain multiple values. This is commonly used for the alias column. The default separator for multiple values in the source data is the pipe (|). If your KEE input data uses a different separator, specify the separator after a colon following the column name (see example below).

Type: list

Partial example:

{
    "sources": {
        "companies": {
            // Other keys omitted for clarity
            // …
            "multivalue": [
                "Aliases",  // Aliases are multiple values, separate with pipe |
                "Sectors:,",  // Sectors are multiple values, separate with comma ,
            ]
        }
    }
}

Testing

The optional testing section allows the specification of the location of test files and snapshots. See the KEE Testing documentation for details on the testing process.

The valid configuration keys in this section are:

fixtures_dir

Name of the folder, relative to the configuration file directory, containing the test fixtures.

Type: str

Default: "fixtures"

snapshots_dir

Name of the folder, relative to the configuration file directory, containing the snapshots.

Type: str

Default: "snapshots"

Again a partial example:

{
    "testing": {
        "fixtures_dir": "fixtures/",
        "snapshots_dir": "snapshots/"
    }
    // …  Other keys omitted
}

Extraction

The section extraction configures how the Squirro items are processed during the KEE process.

item_fields

Which item fields to use for Known Entity Extraction. See item fields for possible values.

Type: list

Default: ["title", "body"]

Matching on a specific keyword requires to add it to the extraction. Example:

{
    "extraction": ["title", "body", "keyword.company"]

    // …  Other keys omitted
}

With the above configuration the KEE looks for matches in the keyword company of an item in addition to the item fields title and body.

Strategies

A strategy in KEE defines how the mapping is executed and which keywords should be added to Squirro items upon a successful match.

The matching is modified depending on a number of factors, including:

  • How precise should the matching be? In certain cases false positives (matches that shouldn’t have been done) are a smaller problem than false negatives (matches that were missed). In other cases the opposite is true.

  • The format of the incoming data matters. If the input data is all uppercase (as often happens in legacy data) the KEE matching has less precision to work with. If the input data is known to be of high quality, then signals like camel case can be taken into account.

  • The domain of the data to match. For example, company names have many suffixes that are often not spelled out in common language (e.g. Inc., Limited, (Pty) Ltd, etc.).

The strategy is referenced in the configuration of the input data and based on that name looked up in the strategies configuration. The following incomplete example references a strategy called "companies" that is correspondingly defined in the strategies section.

{
    // Most keys omitted for brevity

    "sources": {
        "clients": {
            "source_type": "csv",
            "source_file": "…",
            "strategy": "companies"
        }
    }

    "strategies": {
        "companies": {
            "tokenizer": "default",
            // …
        }
    }
}

The configuration keys to tailor a strategy to specific requirements are grouped into Matching and Keywords related configurations.

Matching

tokenizer

For processing the text input, the text is split into individual tokens. The tokenizer and the filters specify how this is done.

Supported tokenizers: default, brackets

Type: str

Please refer to KEE Tokenizers and Filters for details on the tokenizers.

filters

Together with the tokenizer, the filters specify how text is matched. The filters influence how much leniency is applied when matching and makes sure that different spellings of a word can still be matched.

Available filters are:

  • camelcase

  • initials

  • lowercase

  • singular

  • accents

  • stem

Type: list

Default: ["lowercase"]

Please refer to KEE Tokenizers and Filters for details on the filters. That section also explains how to create custom filters.

min_score

How good a score is required for a token to match. 1.0 is a perfect match, 0.0 is no match at all.

Use KEE Testing to find the right balance for each use case. Turning on verbose logging or tracing (see the –trace argument of kee test) to see the score that tokens receive.

Type: float

Default: 0.9

spellfix

Allow small spelling mistakes. This allows at most one letter swap, so e.g. “Apple” and “Appel” will both match each-other.

Type: bool

Default: false

blacklist

A list of entity names to ignore. If any of the match_field (see sources) column values is contained in this list, the entity is never tagged.

Type: list

suffix_list

The suffix list that is used to remove common suffixes in the entity names. See the section suffix list below for details.

Type: str

geo_strategy

Strategy for handling geographic names in entity names.

Possible values:

  • noop (the default): Don’t handle geographic names at all.

  • ignore: Detect common geographic names and ignore them for matching. This especially affects trailing geographic names and means that a company designation like “Acme Inc” is matched the same way as e.g. “Acme Inc - Switzerland”.

Type: str

Keywords

keywords

The keywords section defines which keywords are added to a Squirro item upon matching any entity. This is a list of keywords that can be added, where each individual entry contains the input file column to write and the keyword name into which to store it.

The target value can make use of simple template substitution to add keyword names based on the data of the matching row. The syntax is a field name surrounded by curly brackets.

Type: list

Partial example:

//…
    "keywords": [
        "Name",  // Takes the "Name" column and writes it into a "Name" keyword
        "Id -> Company ID",  // Writes the "Id" column into a "Company ID" keyword
        // Adds the "Id" a second time. If the "Type" of the match is e.g. "SME" then
        // this adds the keyword "SME ID".
        "Id -> {Type} ID",
    ]
//…
parent_keywords

The parent_keywords sections follows the same structure as introduced for the keywords section but targets all the parent entities. All parent entities (if any) are processed recursively and the item is tagged with the keywords according to the rules specified under parent_keywords.

For the tagging following the definition of parent_entities to take effect, a hierarchy must be defined for the KEE data source. See also the example in the hierarchy section for more details.

Type: list

clean_keywords

A list of keywords to be removed from the items before applying the KEE tagging.

Keyword cleaning is useful when rerunning KEE tagging to ensure that old keywords are removed.

Type: list

Partial example:

//…
    "clean_keywords": ["Name", "Company ID"],

    "keywords": [
        "Name", "Id -> Company ID",
    ]
//…

Suffix List

The suffix list is used by the strategy to ignore common suffixes. Examples for such suffixes:

  • Companies: Names for enterprises legally end with, for example, Inc., Pty, AG, a.s., …

  • Stock ticker symbols: These may often include also the stock exchange, for example, NASDAQ, NYSE, etc

These suffixes are often omitted when writing about named entities and thus you can configure the matching strategy to ignore them.

To create a custom suffix list, add it to the suffix_list section and define the various patterns as key/value pairs. The keys are currently ignored by the KEE extraction. You can use the keys to group the tokens into logical groupings like all company suffixes associated with a country.

An example:

{
    // Most keys omitted for clarity

    "strategies": {
        "orgs": {
            "suffix_list": "companies",
        }
    },

    "suffix_list": {
        "companies": {
            "GLOBAL": ["Inc", "Limited"]
            "DEU": ["AG", "GmbH"]
            "ZAF": ["(Pty) Ltd", "LIMITED"]
            "INDUSTRY": ["Bank"]
        }
    },
}

Language Handling (ngram)

For improved KEE matching, Squirro can make use of a language model that allows the matching to handle common words in entity names correctly. Two simple examples will show the possibilities:

  • The Capital Group: this is a company that contains only quite common words. So if a text talks about “the capital” it shouldn’t match this company yet. But when the name is fully spelled out, then the entity should match.

  • Carrefour Group: The word “Carrefour” is sufficiently rare in common language usage that just writing “Carrefour” in a text should be enough to match this entity.

The required frequency model is created using the ngram definition. Please contact support to get access to Squirro’s pre-compiled language models.

An example configuration for ngram is as follows:

{
    // Most keys omitted for clarity

    "strategies": {
        "orgs": {
            "ngram": "companies",
        }
    },

    "ngram": {
        "companies": {
            "source": "ngram/",
            "whitelist": ["Apple"],
        }
    },
}

The following is a reference of all of the keys in an individual ngram section.

source

Folder name, relative to the configuration file directory, where the ngram database is located.

Please contact support to get access to Squirro’s pre-compiled language models.

Type: str

default_language

Sets the default language for language model lookups. When the ngram folder does not contain a model for the language of the Squirro item that is being processed, then the default language is read.

Type: str

Default: en

whitelist

A list of entity names (e.g., company names) for which ngram correction is not done. The whitelist is useful to handle corner cases where a lax match is desired, even though a company name is penalized from the language model.

Partial example:

// …
"ngram": {
    "companies": {
        "source": "ngram/",
        "whitelist": ["Apple"],
    }
},
common

A list of prefixes that should be treated as common language terms. Use this configuration to overwrite the language model in order to be more strict about certain terms.

It is sometimes helpful to overwrite terms that would lead to imprecise matching. An example are prefix words from other languages. “Svensk”, for example, is the Swedish word for “Swedish”. In the context of company names, any company that starts with “Svensk” may just be saying “Swedish Acme Corp.”. Matches just based on “Svensk” in any text might lead to unwanted results and thus, “Svensk” should be treated as common language.

The following example snippet defines "Svensk" as common language:

// …
"ngram": {
    "companies": {
        "source": "ngram/",
        "common": ["Svensk"],
    }
},

Environment Variables

You can make use of environment variables to define the settings in the Squirro section <kee_config_squirro>. That is especially helpful to avoid writing the token into the config.json file.

This documentation can not go into details on how to set environment variables. Please consult the documentation of your system, such as Bash or Windows PowerShell, for documentation on environment variables.

The environment variables that are respected are:

  • SQ_CLUSTER

  • SQ_TOKEN

  • SQ_PROJECT_ID

Custom KEE Pipelet

You can extend the KEE pipelet to gain further control over the KEE processing of the items. Use the following template to get started:

import squirro.sdk.kee.pipelet

class CustomKEEPipelet(squirro.sdk.kee.pipelet.KeePipelet):
    def consume(self, item):
        # Pre-processing of item goes here

        # KEE processing
        item = super().consume(item)

        # Post-processing of item goes here

        return item

Examples

The following sections give a few examples for how to achieve common use cases.

Hierarchy

Hierarchies are created using the hierarchy setting on a source. Tagging of hierarchies is achieved using the parent_keywords setting in the strategy.

The input data here is a CSV file with the following contents (top 3 lines only):

Hierarchy Example

Id

Name

Aliases

ParentId

1

Apple Inc.

2

Google Inc.

Googl|Goog

3

3

Alphabet Inc.

abc.xyz

The KEE config.json file that makes full use of this data can look as follows:

{
    "sources": {
        "demo": {
            "source_type": "csv",
            "source_file": "hierarchy.csv",

            "strategy": "demo",
            "multivalue": "Aliases",
            "field_id": "Id"
            "field_matching": ["Name", "Aliases"]

            // The data is hierarchical, with the children declaring their
            // parent (ParentId field points to a valid Id from another row).
            "hierarchy": "ParentId->Id",
        }
    },

    "strategies": {
        "demo": {
            // Score at which the hit is a good one
            "min_score": 0.6,

            // Depending on Type we assign different keywords
            "keywords": "Name -> Name",
            "parent_keywords": "Name -> Parent Name",
        },
    },
}

Custom KEE Pipelet

For certain use cases, you might want to exclude part of the item body when running KEE. You can achieve this in a custom KEE pipelet by modifying the item before and after running KEE. This example runs on the first 100 words of the body.

import squirro.sdk.kee.pipelet

class CustomKEEPipelet(squirro.sdk.kee.pipelet.KeePipelet):
    def consume(self, item):
        body = item['body']
        body_short = ' '.join(body.split(' ')[0:100])
        item['body'] = body_short
        item = super().consume(item)
        item['body'] = body
        return item