Tokenizers and Filters

Tokenizers and Filters#

In a KEE strategy, tokenizers and filters process the entity names and the input text to make the two comparable to each other.

Tokenizers and filters define a set of rules that split up the entity names and input text. The result is a sequential stream of tokens that are compared to each other.

A few examples:

Input (entity name or text)

Tokenizer

Filter

Resulting Tokens

McDonalds corporation

default

lowercase

mcdonalds corporation

McDonalds corporation

default

lowercase, initials, singular

mc donald corporation

Crédit Agricole

default

lowercase, accents

credit agricole

Credit Agricole

default

lowercase, accents

credit agricole

Built-in Tokenizers#

The following tokenizers are offered out-of-the-box.

  • default: Splits the input on common word boundaries, sentences, etc.

  • brackets: Removes any trailing brackets. This may be useful in a context where the entity names have descriptions in brackets or parentheses that should be ignored, e.g. “Acme Inc. (Parts supplier)”.

Built-in Filters#

The following filters are available out-of-the-box with Squirro Known Entity Extraction:

  • camelcase: Return one token for each camel case component. Camel case is the concept of mixing upper and lower case letters in the same word e.g. TechCrunch, JPMorgan, etc.). By applying this filter the matching will not distinguish between writing those together or separately, so that “JP Morgan”, “JPMorgan” or “JpMorgan” all correctly match the entity “JPMorgan”. To use this filter, it must be listed before the lowercase filter.

  • initials: Combines one-letter initials together. This way writing “JP Morgan”, “J & P Morgan” or “J.P. Morgan” all have the same effect. To use this filter, it must be listed before the lowercase filter.

  • lowercase: Converts the text into lowercase, thus making the matching case insensitive.

  • singular: A very basic singular filter that works by removing trailing s-letters from longer words. When this is used, writing “MacDonald” and “MacDonalds” has the same effect.

  • accents: Normalize accents and umlauts. When using this, “Crédit Agricole” and “Credit Agricole” will match each-other.

  • stem: Uses a porter filter to change a word to its stem (e.g. ‘waiting’ → ‘wait’). This is generally not recommended for data sources with proper names, but may be useful for generic language concepts.

Default: by default only the lowercase filter is applied.

Custom Filters#

You can also define custom filters. However, it is not currently possible to add custom tokenizers, but that can mostly be worked around by using a filter instead.

Filters and Tokenizers are implemented in Python using the analysis module of whoosh. To add custom filters, create a file called tokenizers.py in the same folder as the KEE config.json file.

Any filters declared in that file can then be referenced in the filters setting of a strategy. For full reference on how to write filters, refer to the Whoosh documentation, specifically About analyzers and the analysis module.

As an example, the following filter implements a custom stopword filter that omits some tokens from the input stream.

tokenizers.py#
from squirro.sdk.kee.lib.tokenizers import Filter

STOPLIST = [
    "remove",
    "this",
    "token",
]

class CustomstopwordsFilter(Filter):
    def __call__(self, tokens):
        for token in tokens:
            if token.text not in STOPLIST:
                yield token

To include the custom filter in the KEE, add it to the filters in the strategies section of the config.json file:

{
    //…

    "strategies": {
        "example": {
            "tokenizer": "default",
            "filters": ["lowercase", "customstopwords"],
        },
    },
}