KEE Studio Plugin#
This page explains the configuration of Known Entity Extraction (KEE) through its Studio plugin.
For a step-by-step walkthrough of KEE Studio setup, see the KEE Studio Plugin Tutorial.
Creating a New KEE#
To create a new KEE, follow the instructions below:
Open the Setup space in your Squirro project.
Click the AI STUDIO tab.
Click Known Entity Extraction.
Click Plus in the top-right corner to create a new KEE.
Configuration#
You can specify the configuration options listed below.
The internal configuration key is used in the config.json
file and documented in KEE Configuration.
- Name:
Required
The name of the KEE enrichment. Must be unique on the entire server, and will overwrite any existing enrichment with the same name.
In
config.json
corresponds to:kee.pipelet
- Enable community loader:
Only applies Community Type KEE.
The community loader is enabled for Community Type KEEs created through the set up of a community type. The data for the lookup database will we taken and updated from the community type. Submitting a data file is not required for the Community Type KEE.
- Community Type:
Only applies Community Type KEE.
Name of the community type to which the Community Type KEE is linked. This is preset for Community Type KEEs.
- KEE data:
CSV or Excel file containing the structured information. The first row must contain column headers. The columns are referred to as fields in the configuration options below. A data file is required if the KEE is not a Community Type KEE.
Example:
sources[<source_name>].source_file
(source_name
defaults to"upload_source"
)- ID field:
Field that is used as the unique ID of each records. IDs are auto-generated when this is left empty. If IDs are provided, there must be no duplicate IDs in the CSV/Excel.
sources[<source_name>].field_id
,sources[<source_name>].generate_id
- Matching fields:
Required
Fields from the input KEE data on which the match is executed. Typically the name field, for example the field holding the company name.
sources[<source_name>].field_matching
- Keywords to assign:
Fields for which you want to assign keywords (facets) and tag matched items. Provide each field for which you want to assign a keyword on a separate line. Use the arrow (
->
) notation to set the name of the keyword to a different name than the field.- For example:
Name -> company industry
This keywords configuration will assign the
Name
field from the source data to the keywordcompany
. The fieldindustry
is assigned to the keywordindustry
.Note
The keyword is automatically created if it is not yet existing.
strategies[<strategy_name>].keywords
(strategy_name
defaults to"basic"
)- Minimum score for matches:
The minimum score at which a match is considered. Can be any value between 0 and 1, such as 0.5, 0.9 or 1.0.
strategies[<strategy_name>].min_score
- Enable ngram database:
Enables a default ngram database to improve matching precision for common English terms.
strategies[<strategy_name>].ngram
,ngram[default]
(The ngram name is always default)- Enable fuzzy matching:
Allow small spelling mistakes. This allows at most one letter swap, so e.g. “Apple” and “Appel” will both match each-other.
strategies[<strategy_name>].spellfix
- Enable company suffix list:
Defines a company-specific suffix list which removes common company suffixes when matching company names.
strategies[<strategy_name>].suffix_list
- Config (JSON):
JSON dictionary to customize configuration values. See the example below.
Limitations#
The following limitations apply to the UI integration of KEE.
Single source only. The advanced version of KEE (on the command line) supports multiple KEE sources in one KEE configuration. This is not supported in the user interface, so only one source at a time can be included. It is easy to work around this by creating multiple KEE configurations, though.
Whenever editing the Known Entity Extraction, the original file has to be uploaded again. The file is not currently persisted in its raw form on the server.
The created pipelet can not be removed. Even when the KEE definition is removed, the pipelet stays around.
Only items indexed after the KEE enrichment has been set up are tagged. That is a general limitation with all limitations and can be worked around by using the rerun functionality. See Pipeline Reruns to learn more.
Customization#
The Config (JSON) field can be filled with a KEE JSON configuration dictionary. If it is defined, then all the configuration values mentioned above in the Configuration section are overwritten, but otherwise the config is used as is. This allows for advanced customisations.
For example, the following configuration can be used to provide custom versioning, filters, cleaning keywords before rerunning, specifying item_fields
, keywords/facets to run on.
{
"kee": {
"version": "2",
"version_keyword": "kee_companies"
},
"strategies": {
"basic": {
"filters": ["camelcase", "lowercase", "initials"],
"clean_keywords": ["company_name", ...],
},
},
"extraction": {
"item_fields": [
"title",
"clean_body",
"abstract",
"summary",
"keywords.your_facet"
]
},
}