KEE CLI Tool Tutorial#

This tutorial guides you through the process of setting up, testing and deploying a KEE using the KEE command line tool.

We will take you through an example that starts simple and, as we proceed, will increase in complexity. On the way of developing the KEE we will point out the relevant features and parameters that you can use to tune the KEE. The goal is to setup a KEE that uses two sources of structured data and two strategies for matching the names of companies in items. The first strategy will use the company names and suffices the second strategy will use the stock ticker symbols of companies.

To follow this tutorial you need

Initial KEE Setup#

For this KEE tutorial, we will do all of our work in a new folder called kee_cli_tutorial. Within this folder we need the following content:

  • The company_list_example.csv (download) file containing the structured company information that we want to match in the text.

  • A directory called fixtures/ where we store the test items (download) that we use during the development of the KEE.

  • The KEE configuration file config.json.

Initial Configuration#

Let’s start by setting up the KEE configuration. In your config.json file, add the sources section. This section configures the list (source) of known entities that we use for the KEE:

{
    "sources": {
        "company_names": {
            "source_type": "csv",
            "source_file": "company_list_example.csv",
            "field_id": "id",
            "field_matching": [
                "name",
                "aliases"
            ],
            "strategy": "companies",
            "hierarchy": "parent_id -> id",
            "multivalue": [
                "aliases:,",
                "industry:,"
            ]
        }
    }
}

The above configuration creates a new source of known entities called "company_names", and loads the CSV file company_list_example.csv which is located in the same folder as the config.json file.

The field_id field points to the column id, being the unique identifier for each entity in the CSV file.

The field_matching field provides a list of all the fields that we want to look for to identify (match) a known entity within an item. In this case, we want to look for references to either the company name, or their aliases in the items in our Squirro project, so we include both fields in a list.

The hierarchy field indicates that there is a hierarchy within the entities in the CSV file, where the value in the parent_id column of one entity points to the id of that entity’s parent entity.

The field "strategy": "companies" refers to the strategy that is used by this source. We will setup the strategy comapnies next.

Initial Strategy#

Having configured the source of known entities, we want to create our first strategy for matching the known entities within each item.

We define the following strategy under the strategies section in the config.json file:

{
    "sources": {
        "company_names": {
            "source_type": "csv",
            "source_file": "company_list_example.csv",
            "field_id": "id",
            "field_matching": [
                "name",
                "aliases"
            ],
            "strategy": "companies",
            "hierarchy": "parent_id -> id",
            "multivalue": [
                "aliases:,",
                "industry:,"
            ]
        }
    },

    "strategies": {
        "companies": {
            "spellfix": true,
            "tokenizer": "default",
            "min_score": 0.8,
            "keywords": [
                "name -> company",
                "sector",
                "industry",
                "ticker"
            ],
            "parent_keywords": [
                "name -> parent_company"
            ],
            "clean_keywords": [
                "company",
                "sector",
                "industry",
                "parent_company",
                "ticker"
            ],

        }
    }
}

The configuration above creates a strategy called companies for identifying entities and applies this strategy to the source company_names.

We also set a few basic parameters for this new strategy, such as the minimum score (min_score) required to produce a match, which we set to 0.8 (make sure that it’s a number (float) and not in quotes). For more detail and configuration options refer to the Configuration Reference.

Finally, we set the configurations related to the keywords that items are tagged with when the KEE finds a matching entity. The list under keywords instructs the KEE to tag the item with the keywords

  • company

  • sector

  • industry

  • ticker.

The keyword must match the exact name from the data source (CSV file) header column. If you want to change the keyword naming you can use the arrow (->) as used for the name header column that is assigned to the keyword company (see the configuration reference Keywords section for details).

In the list parent_keywords we instruct the KEE to tag the item with a keyword parent_company that holds the company name - defined in the column with header name in the CSV file - of the parent entity (following from the hierarchy section of the sources configuration above).

Testing the KEE#

The above configuration is sufficient to run a first test of the KEE. Let’s first compile the lookup database:

squirro_kee compile

Have a look at the newly created folder db/. It contains the lookup database created from the CSV input file.

Note

It is important to remember that any time we make changes to the config.json file, we have to recompile the lookup database for those changes to take effect.

You can rerun squirro_kee compile command. The test and upload commands also trigger the compile command by default (see the Tool Reference for more details)

For testing our KEE we need test items located in the fixtures/ directory. For this tutorial, the test items are already provided. A test item contains the item field and a keywords field (not to confuse with the keywords field of the item it self). Under keywords we specify the keywords and the respective values that we expect to be tagged by our KEE:

{
    "item": {
        // content omitted for display purposes. See the original .json file for details.
    },
    "keywords": {
        "company": [
            "Google Inc.",
            "Apple Inc.",
            "Exxon Mobil Corporation",
            "Starbucks Corporation",
            "Meta Platforms Inc."
        ],
        "parent_company": [
            "Alphabet Inc."
        ]
    }
}

For this test item, we expect the item to be tagged with 6 keywords. 5 company names and one name of a parent company.

Let’s now run the test command:

squirro_kee -v test

You should see the following output in your console:

- Running fixture fixture-2
-    0 (100%) correct results: []
-    0 (  0%) missed results: []
-    0 (  0%) extra results: []
- Running fixture fixture-1
-    0 (  0%) correct results: []
-    6 (100%) missed results:
    ['Alphabet Inc.',
     'Apple Inc.',
     'Exxon Mobil Corporation',
     'Google Inc.',
     'Meta Platforms Inc.',
     'Starbucks Corporation']
-    0 (  0%) extra results: []
- Processed 2 fixtures
-    0 (  0%) correct results
-    6 (100%) missed results
-    0 (  0%) extra results

100% missed results! What happened here? To investigate the behavior of the KEE in more detail, we can run the test command again with increased verbosity (-vv):

squirro_kee -vv test

and browse through the output in our console. Another option is to specifically trace the output for certain entities using the --trace option of the test command. For example, for the entity "Exxon Mobil Corporation" you can use:

squirro_kee test --trace "Exxon Mobil Corporation"

and find the following lines in the output:

Scoring name 'Exxon Mobil Corporation' from tokens [Token(value='exxon', score=1.0,
stopword=False, tags=set()), Token(value='mobil', score=1.0, stopword=False, tags=set()),
Token(value='corporation', score=1.0, stopword=False, tags=set())]

Overlap score for [Token(value='exxon', score=1.0, stopword=False, tags=set()),
Token(value='mobil', score=1.0, stopword=False, tags=set()), Token(value='corporation',
score=1.0, stopword=False, tags=set())] is 2.00/3.00 = 0.67

Name [Token(value='exxon', score=1.0, stopword=False, tags=set()),
Token(value='mobil', score=1.0, stopword=False, tags=set()), Token(value='corporation',
score=1.0, stopword=False, tags=set())] has 2 exact matches: [Token(value='exxon',
score=1.0, stopword=False, tags=set()), Token(value='mobil', score=1.0, stopword=False,
tags=set())] (0.6666666666666666)

Name score (overlap_score=0.6666666666666666, name_tokens=3.0/3): 0.6666666666666666

Form the output we see that the scoring of 0.67 is below the min_score threshold of 0.8 that we defined in the configuration. Thus, the entity "Exxon Mobil Corporation" is not considered a match.

We can do a quick check by changing the min_score value to 0.6 in our configuration. Save the config.json file and run the test command one more time. The summary at the end of the output should show:

...
1 ( 16%) correct results: ['Exxon Mobil Corporation']
-    1 ( 16%) correct results: ['Exxon Mobil Corporation']
5 ( 83%) missed results:
        ['Alphabet Inc.',
        'Apple Inc.',
        'Google Inc.',
        'Meta Platforms Inc.',
        'Starbucks Corporation']
...

Improving the KEE#

Tuning the min_score value is one approach to improve the matching quality of our KEE. However, there are other approaches that we can explore. For example, using a company suffix list.

A suffix list is used by the KEE strategy to ignore common suffices in the matching process (see Suffix List for more details).

In our configuration file we add a new section suffix_list with the suffices that should be ignored. Additionally, we need to point to the added suffix list in our comapnies strategy in the strategies section: "suffix_list": "company". Here is the updated configuration:

{
    "sources": {
        "company_names": {
            "source_type": "csv",
            "source_file": "company_list_example.csv",
            "field_id": "id",
            "field_matching": [
                "name",
                "aliases"
            ],
            "strategy": "companies",
            "hierarchy": "parent_id -> id",
            "multivalue": [
                "aliases:,",
                "industry:,"
            ]
        }
    },

    "strategies": {
        "companies": {
            "spellfix": true,
            "tokenizer": "default",
            "suffix_list": "company",
            "min_score": 0.6,
            "keywords": [
                "name -> company",
                "sector",
                "industry",
                "ticker"
            ],
            "parent_keywords": [
                "name -> parent_company"
            ],
            "clean_keywords": [
                "company",
                "sector",
                "industry",
                "parent_company",
                "ticker"
            ],

        }
    }
    "suffix_list": {
        "company": {
            "GLOBAL": [
                "LTD.",
                "LIMITED",
                "INC.",
                "GMBH.",
                "CO.",
                "AG.",
                "Competitor",
                "Worldwide",
                "International"
            ],
            "GBR": [
                "CIC",
                "CIO",
                "LLP",
                "LP",
                "Ltd.",
                "Cyf",
                "plc",
                "Ccc",
                "LIMITED",
                "CO LTD",
                "L P"
            ],
            "USA": [
                "NA",
                "NT&SA",
                "NCUA",
                "LP",
                "LLP",
                "LLLP",
                "LLC",
                "PLLC",
                "Corp.",
                "Inc.",
                "Co.",
                "Company",
                "S.p.A.",
                "and Company",
                "and Co.",
                "& Company",
                "& Co.",
                "PC",
                "DBA",
                "L.L.C.",
                "L.L.P.",
                "L.P.",
                "CORPORATION",
                "LTD",
                "INCORPORATED",
                "LIMITED",
                "LIMITED PARTNERSHIP",
                "GENERAL PARTNERSHIP"
            ]
        }
    }
}

Save the configuration and run the test command:

squirro_kee -v test

The following output is displayed in your terminal:

- Running fixture fixture-2
Consuming item fixture-2
  -    0 (100%) correct results: []
  -    0 (  0%) missed results: []
  -    1 (100%) extra results: ['Air T Inc.']
- Running fixture fixture-1
Consuming item fixture-1
  -    5 ( 83%) correct results:
        ['Alphabet Inc.',
         'Apple Inc.',
         'Exxon Mobil Corporation',
         'Google Inc.',
         'Meta Platforms Inc.']
  -    1 ( 16%) missed results: ['Starbucks Corporation']
  -    0 (  0%) extra results: []
- Processed 2 fixtures
  -    5 ( 83%) correct results
  -    1 ( 16%) missed results
  -    1 ( 16%) extra results

Adding the company list increased the number of correct matches. However, we still have missed one match and produced an extra match in our test item fixture-2. Let us first investigate on the extra match.

You can run the test command with increased verbosity (-vv). Searching for the term air in the output in your terminal, you can see that the token air is matched and the KEE tags the item with the company name Air T Inc..

To improve tagging accuracy and prevent the KEE from tagging items solely based on common words, we can make use of a language model (ngram). See the documentation on Language Handling (ngram) for more details. For this tutorial you can use the ngram model provided here. Store it in your KEE project in the folder named ngram. To make use of the language model, we have to define the ngram section in the configuration file and point in our strategy to the respective configuration: "ngram": "companies":

{
    "sources": {
        // omitted for brevity
    },

    "strategies": {
        "companies": {
            "spellfix": true,
            "tokenizer": "default",
            "suffix_list": "company",
            "ngram":"companies"
            "min_score": 0.6,
            "keywords": [
                "name -> company",
                "sector",
                "industry",
                "ticker"
            ],
            "parent_keywords": [
                "name -> parent_company"
            ],
            "clean_keywords": [
                "company",
                "sector",
                "industry",
                "parent_company",
                "ticker"
            ],

        }
    },
    "ngram": {
        "companies": {
            "source": "ngram/",
        }
    },
    "suffix_list": {
        //omitted for brevity
    }
}

Save the configuration and run the test command. In the output, observe that the extra match has vanished:

- Running fixture fixture-2
Consuming item fixture-2
  -    0 (100%) correct results: []
  -    0 (  0%) missed results: []
  -    0 (  0%) extra results: []
- Running fixture fixture-1
Consuming item fixture-1
  -    4 ( 66%) correct results:
        ['Alphabet Inc.',
         'Exxon Mobil Corporation',
         'Google Inc.',
         'Meta Platforms Inc.']
  -    2 ( 33%) missed results: ['Apple Inc.', 'Starbucks Corporation']
  -    0 (  0%) extra results: []
- Processed 2 fixtures
  -    4 ( 66%) correct results
  -    2 ( 33%) missed results
  -    0 (  0%) extra results

Compared to the test with the configuration before adding the ngram model, our KEE now misses Apple Inc. in addition to the previously missed Starbucks Corporation.

Let’s first check why the KEE did not match Starbucks Corporation. Looking at the suffix list in our configuration we see that "Corporation" is not yet included. Add it to the list and rerun the test command. We now have successfully matched Starbucks Corporation.

The tag for Apple Inc. got lost after including the ngram language model. Let’s investigate the behavior of the KEE with respect to the entity Apple Inc.:

squirro_kee test --trace "Apple Inc."

and find the following line in the output:

ngram: penalizing token Token(value='apple', score=0, stopword=False, tags=set())
('apple') due to percentage 0.0002573755536119994 (old score: 1.0, new score: 0)

Since apple is a common english word, the token got penalized, preventing the KEE from matching the company name. One way to circumvent penalizing tokens is to put them on a whitelist in the ngram configuration:

{
    // ... omitted for brevity
    "ngram": {
        "companies": {
            "source": "ngram/",
            "whitelist": ["Apple"]
        }
    },
    // ... omitted for brevity
}

Add the whitelist to the configuration and save it. Again, run the test command. Now we have 100% correct results, none missed and no extra matches.

Deploy the KEE#

Now it is time to deploy the KEE on your server and used as a pipeline step in our project. If you don’t yet have a project to test the KEE it is now time to set one up.

Connecting to the server requires to include a section "squirro" holding the relevant information. Furthermore, we add a section "kee" with the name of the pipelet that is created upon upload and a version number:

"kee": {
    "pipelet": "Company KEE Example (CLI Tutorial)",
    "version": 2,
},
"squirro": {
    "cluster": "<YOUR_CLUSTER>",
    "token": "<YOUR_TOKEN>",
    "project_id": "<YOUR_PROJECT_ID>"
}
// ... rest of the configuration omitted for brevity

Insert your cluster, token and project_id in the respective fields. Now you can run the upload command:

squirro_kee -v upload

On the server, navigate to the PIPELINE tab and edit the pipeline in which you want to add the uploaded KEE step (pipelet). You can easily find the step by searching it in the pipeline editor. Drag the Company KEE Example (CLI Tutorial) to the Relate field of the pipeline.

Save the pipeline. Now the KEE step is part of the pipeline and is applied to any newly indexed item that runs through this pipeline. You can create a new data source and load the items provided here. Make sure to select the pipeline workflow that contains your KEE when creating the data source.

If you want to apply the KEE step to existing items, you have to rerun the KEE step for those items. See Pipeline Reruns for information on rerunning an individual pipeline step.

Including Additional Strategies, Sources and Custom Filters#

You can have multiple strategies and structured data sources within a single KEE configuration. As a simple example, We will add a csv source containing the stock tickers and company names and define a new strategy to match known entities on the stock ticker.

Download the csv file and store it in the root of your KEE project.

In the sources section of your configuration, add:

"company_tickers": {
    "source_type": "csv",
    "source_file": "company_stocktickers_example.csv",
    "field_id": "id",
    "field_matching": [
        "ticker"
    ],
    "strategy": "stock_ticker",
}

Note

The IDs of the known entities (provided under field_id) must be unique across all sources.

Looking at the values in the tickers column in the csv file, we see that these ara all upper case strings. For accurate matching of the stock tickers, we can define a custom filter that converts all tokens in the text content to uppercase strings. See the documentation for more details. To include a custom filter in our KEE strategy, we first have to implement it.

Create the file tokenizers.py in the folder of your KEE project. Add the following content to that file:

from squirro.sdk.kee.lib.tokenizers import Filter


class CustomuppercaseFilter(Filter):
    def __call__(self, tokens):
        for token in tokens:
            token.text = token.text.upper()
            yield token

You can add the custom filter to the strategy as "filters": ["customuppercase"]. Add a new stock_ticker strategy to the strategies in your configuration file:

"stock_ticker": {
    "tokenizer": "default",
    "filters": ["customuppercase"],
    "keywords": [
        "name -> company",
        "ticker"
    ],
    "clean_keywords": [
        "company",
        "ticker"
    ]
}

Let’s test the new strategy on this simple test item. Add it to the fixtures/ directory in your KEE project.

Now run the test command. In the new test item the company Amazon Inc. is matched based on its ticker AMNZ.

If all tests look good, upload the updated KEE to the server. The KEE pipelet in your pipeline is automatically updated with the new configuration.

Try Yourself#

Increase the the min_score value back to 0.8 and save the configuration. Can you tune the companies strategy to get all matches correct?

  • Check the ngram settings

  • Check the company suffices

We can achieve the tagging based on the stock ticker without adding the extra source and strategy. What do we need to change in the company_names source and companies strategy to achieve this?