Smart Filters#

Profiles: Project Creator, Search User

Smart Filters make it possible to search for concepts, not just keywords.

They are configured by the project creator as outlined below.

Overview#

Smart Filters are trained with text documents. Training documents can be paragraphs of text or entire documents. Various formats from plain text to PDF and Microsoft Word are supported.

The Smart Filter algorithm looks at the frequency of terms in the training documents and correlates them with normal occurrence of these terms (see Global Document Frequencies - GDFS). If a word occurs more often in the training text than in the GDFS definition it receives a higher score.

When using a Smart Filter, the index is searched using all the entities that the Smart Filter was trained with. Results get ranked higher based on the number of matching entities and the score of the matching entity.

Strictness#

The strictness of a Smart Filter can be controlled by the Noise Level - set to 1.0 (the highest Noise level) all results that match at least one entity are returned. At a lower level (e.g. 0.2) the matching results get ordered by relevance and low relevance matches are eliminated from the result set.

Note: The result elimination is not linear; Noise level 0.1 is much stricter than 0.2 for example - this is easy to inspect by adjusting the Noise level in the Squirro UI.

Searching#

Noise Level#

Concept#

When matching documents to a Smart Filter, each document is compared to the Smart Filter. The more closely the document matches the concept, the higher its score.

The noise level determines which documents are returned based on their score. The lower the noise level, the more precise does the match have to be. Setting the noise level to 1.0 (the highest noise level) all results that match at least one entity are returned. At lower levels the matching results are ordered by relevance and low relevance matches are eliminated from the result set. This result elimination is not linear - noise level 0.1 is much stricter than 0.2 for example.

Finding the Right Noise Level#

Which noise level is right for a given Smart Filter and use case depends on the requirements of precision vs. recall. Is it more important that the returned results are all relevant or is it preferable to have everything potentially relevant included?

As the default sorting is by relevance, it will often not be readily apparent how a different noise level changes the result set. To understand the relationship between the result set and the noise level in a given Smart Filter, sort the results by date. Then it will be much easier to judge the result relevance.

Setting the Noise Level#

The noise level can be set in three ways:

  • Use the noise level slide in the Smart Filter drop-down

  • Use the same slider in the Smart Filter edit view header

  • Change it directly in the query - the query syntax is smartfilter:SMARTFILTER_NAME:NOISE_LEVEL

Multiple Smart Filters#

Multiple Smart Filters can be combined in a single query in which case the filters get combined with an AND operator. All results that match both filters are returned. Other boolean operators are not supported for Smart Filters, so OR and NOT will not work.

To work around this, search tagging can be used with Smart Filters. For this, assign keywords by creating a search tagging rule for each Smart Filter. Those keywords then support boolean search as described in the Query Syntax.

Explain#

Explain mode can be enabled in the Smart Filter dropdown.

With explain mode enabled each matching document contains some details on why the document matched the Smart Filter.

Limitations#

At maximum, the most recent 10,000 results are returned as the result of a Smart Filter query. This is due to performance considerations, because the noise level calculations need to be done for every matching document.

If more than 10,000 results should be reliably returned in a project, combine Smart Filters with search tagging rules.

Training#

Editing Workflow#

When editing a Smart Filter in the user interface, it is not written to the system unless the Save button is pressed. This way it is possible to experiment with adding training, excluding tokens, etc. without fear of messing up the Smart Filter for the users.

Automated Training#

Improving Relevance#

When training Smart Filters from documents, terms that do not appear often (or at all) in the language model, will gain a lot of importance. This can often happen with names (people, companies, places) or even mis-spelled words. There are a few strategies to improve the situation when that happens:

  • Increase the min_feature_count configuration setting (see fingerprint.ini). This way a term needs to appear more often in the training documents to be considered. The default value of min_feature_count is 1, which means that a term is potentially included in the Smart Filter even if it appears only once in all the training documents together. This can happen especially for terms which are not present in the language model and because of that are calculated to have a high weight. A common example of this happening is names or spelling mistakes.

  • Exclude irrelevant terms. Note that this may simply promote the next worst term and reducing the number of entities should also be considered.

  • Remove irrelevant content from the training documents. For example headers or footers. This can often be achieved with enrichments, e.g. a Pipelets.

Negative Training#

When training a Smart Filter items can be added both as positive or negative items. The negative content trains the Smart Filters to detect content that should be excluded.

The considerations on smart-filter-reference-improving-relevance apply even more when using Negative Training or you quickly end up with very nonsensical concepts.

Max Number of Entities#

By default 30 terms are extracted from the training content. This can be modified in the advanced screen.

When a Smart Filter doesn’t have a lot of relevant terms, then excluding terms will simply promote the next worst term. In those cases (when the concept is smaller than the default) it makes sense to reduce the number of terms.

In other cases where the top 30 terms are all highly relevant, it makes sense to increase the number and see if there are more relevant terms to be displayed.

Manual Smart Filters#

The taxonomy for Manual Smart Filters uses a CSV format. The syntax for each entry is:

query, weight, language, label
  • query: the searched term. This is the only mandatory value.

  • weight: the importance of the entry amongst all the other terms. The default is 1.

  • language: the language for this entry. If left empty, this uses the default language - which can be defined in the taxonomy screen as well.

  • label: the title of this entry, as displayed to the user by the system. If left empty, the query is displayed.

Fingerprint Stemming#

The query must be stemmed based on the rules of the Squirro index.

For example, the following manual Smart Filter does not return any results:

“annual returns”,1,en,”annual returns”

Instead, this has to be converted to this entry:

“annual return”,1,en,”annual returns”

Having to do this manually is quite cumbersome and the next iteration of Smart Filters will handle this automatically. In the meantime to avoid having to do this manually, the stem_fingerprint.py script can be used. Download that script to the Squirro server - and on the command line execute it as follows:

python stem_fingerprint.py fingerprint.csv

This assumes that the fingerprint file has been downloaded from the web interface and is stored in the “fingerprint.csv” file. The script outputs the stemmed queries, and the result can then be pasted into the taxonomy window.

Locking#

Smart Filters can be locked in the advanced screen. At that point no changes can be done to the Smart Filter without unlocking it first. Additionally only project administrators can change the lock status of a Smart Filter.

This is a good way of ensuring that Smart Filters that are important to a project can’t be changed without thinking about it. Consider locking Smart Filters that are used for dashboards or search taggings.

Tags#

In the advanced properties of a Smart Filter, tags can be assigned. This is used by the bulk scorer to limit the Smart Filters that are exported. The tags are also exposed in the API. So this property can be used to organize Smart Filters into categories for API usage.

Smart Filter Configuration#

The behavior of Smart Filters can be changed in the fingerprint service (that name is a legacy version of Smart Filters). See the fingerprint.ini file for the available options.

GDFS Files#

When training a Smart Filter, Squirro compares the terms in the training documents to the expected term frequency in the given language. For example if the English sentence “the annual results are here” is used, then the terms “the”, “are” and “here” should probably not be considered to be interesting terms for the Smart Filter. The language model behind this is called Global Document Frequencies (GDFS). Squirro comes with pre-built GDFS files for the supported languages.

In some use cases it may make sense to build the GDFS files based on the data seen in a specific project. This allows Squirro to normalize for the usage of industry terms or company-internal jargon. To create and use such custom GDFS files, Squirro provides the Global Document Frequency Tool. See that page for instructions on how to use this tool.

Bulk Scoring#

The scores of any given document in Squirro against any of the Smart Filters can be calculated and exported using the bulk scoring command line tool and used for further analysis in 3rd party tools, such as Business Intelligence solutions.

The Squirro Bulk Scoring tool is provided with the Squirro Toolbox. It generates a CSV file with a cross-product of each document in the index and the score it has with each Smart Filter.

Languages#

A Smart Filter can be trained with documents of multiple languages. Squirro detects the language of each document and will create a cluster of the top entities for each language.

During a query, the entities for each language will be used to filter only documents from the corresponding language.

Out of the box, Squirro Smart Filters support the following languages:

  • Chinese

  • Dutch

  • English

  • French

  • German

  • Italian

  • Portuguese

  • Russian

  • Spanish

However, the Smart Filter concept works for almost any language. Squirro needs to be trained once to understand the word frequencies for a new language. This is done by creating a GDFS database and the following languages are supported out of the box for this process (in addition to the ones already listed above):

  • Arabic

  • Armenian

  • Basque

  • Bengali

  • Bulgarian

  • Catalan

  • Czech

  • Finnish

  • Galician

  • Hindi

  • Hungarian

  • Indonesian

  • Irish

  • Latvian

  • Lithuanian

  • Norwegian

  • Romanian

  • Sorani

  • Swedish

  • Turkish

Language Codes#

The language is the two-letter language code, as seen in the following table:

Language

Syntax

English

en

German

de

Italian

it

French

fr

Spanish

es

Russian

ru

Portuguese

pt

Chinese

zh

Global Document Frequency Tool#

Warning

This tool is deprecated and is no longer actively supported.

Intro#

For Squirro SmartFilters to produce accurate results, a list of term frequency of all documents in your various indexes has to be maintained.

This is called GDF or Global Document Frequency or GDFS

Squirro offers a good starting GDF set for all supported languages.

If you’re indexing generic news items, then the starting GDF will yield great results.

If you are however indexing very specific content, then it is highly recommended to frequently recalculate the GFS .

Download#

Squirro Version

Download

Squirro 2.4.3

squirro_gdfs_util_2.4.3-1.zip

Squirro 2.5.3

squirro_gdfs_util_2.5.3.zip

Usage Example#

The utility is configured through a ini file.

It needs to run on a squirro storage node. (This is where the elasticsearch process is running.)

Tip: If you have multiple storage nodes, you only need to run it on one.

Here is an example:

#location of the es data folder
elasticsearch_data_folder = /var/lib/elasticsearch

#space separated list of indexes, or all
indexes = all

#where the data will be saved
target_folder = /tmp

#how many files per language should be created
files_per_language = 8

#should numbers and floats be removed?
remove_numbers = true

#terms with less than this amount of documents will be deleted from the gfds list
frequency_lower_limit = 10

#languages to extract
languages = en

Once this is setup, invoke the utility like so:

./create_gdfs.py

Sample Output#

2016-09-14 09:08:17,476 create_gdfs.py[22561] INFO     Starting process (version 2.4.3).
2016-09-14 09:08:17,476 create_gdfs.py[22561] INFO     Looking for shards in '/apps/squirro/elasticsearch/'
2016-09-14 09:08:17,476 create_gdfs.py[22561] INFO     Blacklisted: ['squirro_v7_fp', 'squirro_v7_filter']
2016-09-14 09:08:17,495 create_gdfs.py[22561] INFO     Using all indexes to create gfds files
2016-09-14 09:08:17,495 create_gdfs.py[22561] INFO     Using these shard folders:
2016-09-14 09:08:17,495 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7/0/index
2016-09-14 09:08:17,495 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_6c3apqtdt1eg64djsbi3nw/2/index
2016-09-14 09:08:17,495 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_6c3apqtdt1eg64djsbi3nw/5/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gca/0/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gca/1/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gca/2/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gca/3/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gca/4/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gca/5/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gcabla/2/index
...
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     Using these languages:
2016-09-14 09:08:17,497 create_gdfs.py[22561] INFO     -> en
2016-09-14 09:08:17,497 create_gdfs.py[22561] INFO     Executing: java -jar global_dfs.jar /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7/0/index en -o /tmp/en > /dev/null
2016-09-14 09:08:18,042 create_gdfs.py[22561] INFO     Processed 42696 documents
2016-09-14 09:08:18,042 create_gdfs.py[22561] INFO     Found 1046 terms
2016-09-14 09:08:18,042 create_gdfs.py[22561] INFO     Executing: java -jar global_dfs.jar /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_6c3apqtdt1eg64djsbi3nw/2/index en -o /tmp/en > /dev/null
2016-09-14 09:08:18,680 create_gdfs.py[22561] INFO     Processed 42696 documents
2016-09-14 09:08:18,680 create_gdfs.py[22561] INFO     Found 1046 terms
2016-09-14 09:08:18,681 create_gdfs.py[22561] INFO     Executing: java -jar global_dfs.jar /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_6c3apqtdt1eg64djsbi3nw/5/index en -o /tmp/en > /dev/null
...
2016-09-14 09:08:39,535 create_gdfs.py[22561] INFO     Processed 1769 documents
2016-09-14 09:08:39,536 create_gdfs.py[22561] INFO     Found 19733 terms
2016-09-14 09:08:39,549 create_gdfs.py[22561] INFO     Found 168327 unfiltered terms
2016-09-14 09:08:39,597 create_gdfs.py[22561] INFO     8368 terms left after filtering
2016-09-14 09:08:39,597 create_gdfs.py[22561] INFO      - 155060 got removed due to too low frequenzy
2016-09-14 09:08:39,597 create_gdfs.py[22561] INFO      - 4899 got removed due to being numbers
2016-09-14 09:08:39,597 create_gdfs.py[22561] INFO     Creating 8 files per language
2016-09-14 09:08:39,597 create_gdfs.py[22561] INFO     Up to 1046 terms per file
2016-09-14 09:08:39,599 create_gdfs.py[22561] INFO     Writing 1046 terms into /tmp/en0.json
2016-09-14 09:08:39,601 create_gdfs.py[22561] INFO     Writing 1046 terms into /tmp/en1.json
2016-09-14 09:08:39,604 create_gdfs.py[22561] INFO     Writing 1046 terms into /tmp/en2.json
2016-09-14 09:08:39,607 create_gdfs.py[22561] INFO     Writing 1046 terms into /tmp/en3.json
2016-09-14 09:08:39,610 create_gdfs.py[22561] INFO     Writing 1046 terms into /tmp/en4.json
2016-09-14 09:08:39,612 create_gdfs.py[22561] INFO     Writing 1046 terms into /tmp/en5.json
2016-09-14 09:08:39,615 create_gdfs.py[22561] INFO     Writing 1046 terms into /tmp/en6.json
2016-09-14 09:08:39,618 create_gdfs.py[22561] INFO     Writing 1046 terms into /tmp/en7.json
2016-09-14 09:08:39,620 create_gdfs.py[22561] INFO     All done!

Update the SmartFilter (aka Fingerprint) Service#

The final step is to update the fingerprint service on all Squirro Cluster nodes:
The default location for the files is:
/var/lib/squirro/fingerprint/gdfs

Always backup the existing files before overwriting them

Once you’ve updated the files, you need to restart the fingerprint service:

service sqfingerprintd restart