Profiles: Project Creator, Search User
Smart Filters make it possible to search for concepts, not just keywords.
They are configured by the project creator as outlined below.
Smart Filters are trained with text documents. Training documents can be paragraphs of text or entire documents. Various formats from plain text to PDF and Microsoft Word are supported.
The Smart Filter algorithm looks at the frequency of terms in the training documents and correlates them with normal occurrence of these terms (see Global Document Frequencies - GDFS). If a word occurs more often in the training text than in the GDFS definition it receives a higher score.
When using a Smart Filter, the index is searched using all the entities that the Smart Filter was trained with. Results get ranked higher based on the number of matching entities and the score of the matching entity.
The strictness of a Smart Filter can be controlled by the Noise Level - set to 1.0 (the highest Noise level) all results that match at least one entity are returned. At a lower level (e.g. 0.2) the matching results get ordered by relevance and low relevance matches are eliminated from the result set.
Note: The result elimination is not linear; Noise level 0.1 is much stricter than 0.2 for example - this is easy to inspect by adjusting the Noise level in the Squirro UI.
When matching documents to a Smart Filter, each document is compared to the Smart Filter. The more closely the document matches the concept, the higher its score.
The noise level determines which documents are returned based on their score. The lower the noise level, the more precise does the match have to be. Setting the noise level to 1.0 (the highest noise level) all results that match at least one entity are returned. At lower levels the matching results are ordered by relevance and low relevance matches are eliminated from the result set. This result elimination is not linear - noise level 0.1 is much stricter than 0.2 for example.
Finding the Right Noise Level#
Which noise level is right for a given Smart Filter and use case depends on the requirements of precision vs. recall. Is it more important that the returned results are all relevant or is it preferable to have everything potentially relevant included?
As the default sorting is by relevance, it will often not be readily apparent how a different noise level changes the result set. To understand the relationship between the result set and the noise level in a given Smart Filter, sort the results by date. Then it will be much easier to judge the result relevance.
Setting the Noise Level#
The noise level can be set in three ways:
Use the noise level slide in the Smart Filter drop-down
Use the same slider in the Smart Filter edit view header
Change it directly in the query - the query syntax is
Multiple Smart Filters#
Multiple Smart Filters can be combined in a single query in which case the filters get combined with an AND operator. All results that match both filters are returned. Other boolean operators are not supported for Smart Filters, so OR and NOT will not work.
To work around this, search tagging can be used with Smart Filters. For this, assign keywords by creating a search tagging rule for each Smart Filter. Those keywords then support boolean search as described in the Query Syntax.
Explain mode can be enabled in the Smart Filter dropdown.
With explain mode enabled each matching document contains some details on why the document matched the Smart Filter.
At maximum, the most recent 10,000 results are returned as the result of a Smart Filter query. This is due to performance considerations, because the noise level calculations need to be done for every matching document.
If more than 10,000 results should be reliably returned in a project, combine Smart Filters with search tagging rules.
When editing a Smart Filter in the user interface, it is not written to the system unless the Save button is pressed. This way it is possible to experiment with adding training, excluding tokens, etc. without fear of messing up the Smart Filter for the users.
When training Smart Filters from documents, terms that do not appear often (or at all) in the language model, will gain a lot of importance. This can often happen with names (people, companies, places) or even mis-spelled words. There are a few strategies to improve the situation when that happens:
min_feature_countconfiguration setting (see fingerprint.ini). This way a term needs to appear more often in the training documents to be considered. The default value of min_feature_count is 1, which means that a term is potentially included in the Smart Filter even if it appears only once in all the training documents together. This can happen especially for terms which are not present in the language model and because of that are calculated to have a high weight. A common example of this happening is names or spelling mistakes.
Exclude irrelevant terms. Note that this may simply promote the next worst term and reducing the number of entities should also be considered.
Remove irrelevant content from the training documents. For example headers or footers. This can often be achieved with enrichments, e.g. a Pipelets.
When training a Smart Filter items can be added both as positive or negative items. The negative content trains the Smart Filters to detect content that should be excluded.
The considerations on smart-filter-reference-improving-relevance apply even more when using Negative Training or you quickly end up with very nonsensical concepts.
Max Number of Entities#
By default 30 terms are extracted from the training content. This can be modified in the advanced screen.
When a Smart Filter doesn’t have a lot of relevant terms, then excluding terms will simply promote the next worst term. In those cases (when the concept is smaller than the default) it makes sense to reduce the number of terms.
In other cases where the top 30 terms are all highly relevant, it makes sense to increase the number and see if there are more relevant terms to be displayed.
Manual Smart Filters#
The taxonomy for Manual Smart Filters uses a CSV format. The syntax for each entry is:
query, weight, language, label
query: the searched term. This is the only mandatory value.
weight: the importance of the entry amongst all the other terms. The default is 1.
language: the language for this entry. If left empty, this uses the default language - which can be defined in the taxonomy screen as well.
label: the title of this entry, as displayed to the user by the system. If left empty, the query is displayed.
The query must be stemmed based on the rules of the Squirro index.
For example, the following manual Smart Filter does not return any results:
“annual returns”,1,en,”annual returns”
Instead, this has to be converted to this entry:
“annual return”,1,en,”annual returns”
Having to do this manually is quite cumbersome and the next iteration of Smart Filters will handle this automatically. In the meantime to avoid having to do this manually, the stem_fingerprint.py script can be used. Download that script to the Squirro server - and on the command line execute it as follows:
python stem_fingerprint.py fingerprint.csv
This assumes that the fingerprint file has been downloaded from the web interface and is stored in the “fingerprint.csv” file. The script outputs the stemmed queries, and the result can then be pasted into the taxonomy window.
Smart Filters can be locked in the advanced screen. At that point no changes can be done to the Smart Filter without unlocking it first. Additionally only project administrators can change the lock status of a Smart Filter.
This is a good way of ensuring that Smart Filters that are important to a project can’t be changed without thinking about it. Consider locking Smart Filters that are used for dashboards or search taggings.
Smart Filter Configuration#
The behavior of Smart Filters can be changed in the fingerprint service (that name is a legacy version of Smart Filters). See the fingerprint.ini file for the available options.
When training a Smart Filter, Squirro compares the terms in the training documents to the expected term frequency in the given language. For example if the English sentence “the annual results are here” is used, then the terms “the”, “are” and “here” should probably not be considered to be interesting terms for the Smart Filter. The language model behind this is called Global Document Frequencies (GDFS). Squirro comes with pre-built GDFS files for the supported languages.
In some use cases it may make sense to build the GDFS files based on the data seen in a specific project. This allows Squirro to normalize for the usage of industry terms or company-internal jargon. To create and use such custom GDFS files, Squirro provides the Global Document Frequency Tool. See that page for instructions on how to use this tool.
The scores of any given document in Squirro against any of the Smart Filters can be calculated and exported using the bulk scoring command line tool and used for further analysis in 3rd party tools, such as Business Intelligence solutions.
The Squirro Bulk Scoring tool is provided with the Squirro Toolbox. It generates a CSV file with a cross-product of each document in the index and the score it has with each Smart Filter.
A Smart Filter can be trained with documents of multiple languages. Squirro detects the language of each document and will create a cluster of the top entities for each language.
During a query, the entities for each language will be used to filter only documents from the corresponding language.
Out of the box, Squirro Smart Filters support the following languages:
However, the Smart Filter concept works for almost any language. Squirro needs to be trained once to understand the word frequencies for a new language. This is done by creating a GDFS database and the following languages are supported out of the box for this process (in addition to the ones already listed above):
The language is the two-letter language code, as seen in the following table:
Global Document Frequency Tool#
This tool is deprecated and is no longer actively supported.
For Squirro SmartFilters to produce accurate results, a list of term frequency of all documents in your various indexes has to be maintained.
This is called GDF or Global Document Frequency or GDFS
Squirro offers a good starting GDF set for all supported languages.
If you’re indexing generic news items, then the starting GDF will yield great results.
If you are however indexing very specific content, then it is highly recommended to frequently recalculate the GFS .
The utility is configured through a ini file.
It needs to run on a squirro storage node. (This is where the elasticsearch process is running.)
Tip: If you have multiple storage nodes, you only need to run it on one.
Here is an example:
#location of the es data folder elasticsearch_data_folder = /var/lib/elasticsearch #space separated list of indexes, or all indexes = all #where the data will be saved target_folder = /tmp #how many files per language should be created files_per_language = 8 #should numbers and floats be removed? remove_numbers = true #terms with less than this amount of documents will be deleted from the gfds list frequency_lower_limit = 10 #languages to extract languages = en
Once this is setup, invoke the utility like so:
2016-09-14 09:08:17,476 create_gdfs.py INFO Starting process (version 2.4.3). 2016-09-14 09:08:17,476 create_gdfs.py INFO Looking for shards in '/apps/squirro/elasticsearch/' 2016-09-14 09:08:17,476 create_gdfs.py INFO Blacklisted: ['squirro_v7_fp', 'squirro_v7_filter'] 2016-09-14 09:08:17,495 create_gdfs.py INFO Using all indexes to create gfds files 2016-09-14 09:08:17,495 create_gdfs.py INFO Using these shard folders: 2016-09-14 09:08:17,495 create_gdfs.py INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7/0/index 2016-09-14 09:08:17,495 create_gdfs.py INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_6c3apqtdt1eg64djsbi3nw/2/index 2016-09-14 09:08:17,495 create_gdfs.py INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_6c3apqtdt1eg64djsbi3nw/5/index 2016-09-14 09:08:17,496 create_gdfs.py INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gca/0/index 2016-09-14 09:08:17,496 create_gdfs.py INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gca/1/index 2016-09-14 09:08:17,496 create_gdfs.py INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gca/2/index 2016-09-14 09:08:17,496 create_gdfs.py INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gca/3/index 2016-09-14 09:08:17,496 create_gdfs.py INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gca/4/index 2016-09-14 09:08:17,496 create_gdfs.py INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gca/5/index 2016-09-14 09:08:17,496 create_gdfs.py INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gcabla/2/index ... 2016-09-14 09:08:17,496 create_gdfs.py INFO Using these languages: 2016-09-14 09:08:17,497 create_gdfs.py INFO -> en 2016-09-14 09:08:17,497 create_gdfs.py INFO Executing: java -jar global_dfs.jar /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7/0/index en -o /tmp/en > /dev/null 2016-09-14 09:08:18,042 create_gdfs.py INFO Processed 42696 documents 2016-09-14 09:08:18,042 create_gdfs.py INFO Found 1046 terms 2016-09-14 09:08:18,042 create_gdfs.py INFO Executing: java -jar global_dfs.jar /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_6c3apqtdt1eg64djsbi3nw/2/index en -o /tmp/en > /dev/null 2016-09-14 09:08:18,680 create_gdfs.py INFO Processed 42696 documents 2016-09-14 09:08:18,680 create_gdfs.py INFO Found 1046 terms 2016-09-14 09:08:18,681 create_gdfs.py INFO Executing: java -jar global_dfs.jar /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_6c3apqtdt1eg64djsbi3nw/5/index en -o /tmp/en > /dev/null ... 2016-09-14 09:08:39,535 create_gdfs.py INFO Processed 1769 documents 2016-09-14 09:08:39,536 create_gdfs.py INFO Found 19733 terms 2016-09-14 09:08:39,549 create_gdfs.py INFO Found 168327 unfiltered terms 2016-09-14 09:08:39,597 create_gdfs.py INFO 8368 terms left after filtering 2016-09-14 09:08:39,597 create_gdfs.py INFO - 155060 got removed due to too low frequenzy 2016-09-14 09:08:39,597 create_gdfs.py INFO - 4899 got removed due to being numbers 2016-09-14 09:08:39,597 create_gdfs.py INFO Creating 8 files per language 2016-09-14 09:08:39,597 create_gdfs.py INFO Up to 1046 terms per file 2016-09-14 09:08:39,599 create_gdfs.py INFO Writing 1046 terms into /tmp/en0.json 2016-09-14 09:08:39,601 create_gdfs.py INFO Writing 1046 terms into /tmp/en1.json 2016-09-14 09:08:39,604 create_gdfs.py INFO Writing 1046 terms into /tmp/en2.json 2016-09-14 09:08:39,607 create_gdfs.py INFO Writing 1046 terms into /tmp/en3.json 2016-09-14 09:08:39,610 create_gdfs.py INFO Writing 1046 terms into /tmp/en4.json 2016-09-14 09:08:39,612 create_gdfs.py INFO Writing 1046 terms into /tmp/en5.json 2016-09-14 09:08:39,615 create_gdfs.py INFO Writing 1046 terms into /tmp/en6.json 2016-09-14 09:08:39,618 create_gdfs.py INFO Writing 1046 terms into /tmp/en7.json 2016-09-14 09:08:39,620 create_gdfs.py INFO All done!
Update the SmartFilter (aka Fingerprint) Service#
Always backup the existing files before overwriting them
Once you’ve updated the files, you need to restart the fingerprint service:
service sqfingerprintd restart