Tokenizers and Filters#
In a KEE strategy, tokenizers and filters process the entity names and the input text to make the two comparable to each other.
Tokenizers and filters define a set of rules that split up the entity names and input text. The result is a sequential stream of tokens that are compared to each other.
A few examples:
Input (entity name or text) |
Tokenizer |
Filter |
Resulting Tokens |
---|---|---|---|
McDonalds corporation |
default |
lowercase |
mcdonalds corporation |
McDonalds corporation |
default |
lowercase, initials, singular |
mc donald corporation |
Crédit Agricole |
default |
lowercase, accents |
credit agricole |
Credit Agricole |
default |
lowercase, accents |
credit agricole |
Built-in Tokenizers#
The following tokenizers are offered out-of-the-box.
default
: Splits the input on common word boundaries, sentences, etc.brackets
: Removes any trailing brackets. This may be useful in a context where the entity names have descriptions in brackets or parentheses that should be ignored, e.g. “Acme Inc. (Parts supplier)”.
Built-in Filters#
The following filters are available out-of-the-box with Squirro Known Entity Extraction:
camelcase
: Return one token for each camel case component. Camel case is the concept of mixing upper and lower case letters in the same word e.g. TechCrunch, JPMorgan, etc.). By applying this filter the matching will not distinguish between writing those together or separately, so that “JP Morgan”, “JPMorgan” or “JpMorgan” all correctly match the entity “JPMorgan”. To use this filter, it must be listed before the lowercase filter.initials
: Combines one-letter initials together. This way writing “JP Morgan”, “J & P Morgan” or “J.P. Morgan” all have the same effect. To use this filter, it must be listed before the lowercase filter.lowercase
: Converts the text into lowercase, thus making the matching case insensitive.singular
: A very basic singular filter that works by removing trailing s-letters from longer words. When this is used, writing “MacDonald” and “MacDonalds” has the same effect.accents
: Normalize accents and umlauts. When using this, “Crédit Agricole” and “Credit Agricole” will match each-other.stem
: Uses a porter filter to change a word to its stem (e.g. ‘waiting’ → ‘wait’). This is generally not recommended for data sources with proper names, but may be useful for generic language concepts.
Default: by default only the lowercase
filter is applied.
Custom Filters#
You can also define custom filters. However, it is not currently possible to add custom tokenizers, but that can mostly be worked around by using a filter instead.
Filters and Tokenizers are implemented in Python using the analysis module of whoosh. To add custom filters, create a file called tokenizers.py
in the same folder as the KEE config.json
file.
Any filters declared in that file can then be referenced in the filters
setting of a strategy. For full reference on how to write filters, refer to the Whoosh documentation, specifically About analyzers and the analysis module.
As an example, the following filter implements a custom stopword filter that omits some tokens from the input stream.
from squirro.sdk.kee.lib.tokenizers import Filter
STOPLIST = [
"remove",
"this",
"token",
]
class CustomstopwordsFilter(Filter):
def __call__(self, tokens):
for token in tokens:
if token.text not in STOPLIST:
yield token
To include the custom filter in the KEE, add it to the filters
in the strategies
section of the config.json
file:
{
//…
"strategies": {
"example": {
"tokenizer": "default",
"filters": ["lowercase", "customstopwords"],
},
},
}