Document Relevancy#

Profiles: Project Creator, Search Engineer

This page explains Document Relevancy ranking within Squirro Cognitive Search.

If you’re looking for step-by-step guides to refine relevancy in different ways, see the following:

Search engineers can tune the relevancy ranking of documents in Squirro Cognitive Search and project creators can implement those changes in the Squirro UI.

Overview#

The baseline full-text search provides a good document relevancy score (BM25) out of the box.

But with different domains, users, and preferences, it’s important to consider the overall context and tune the search scoring to return the information that the user is truly looking for.

  • Personalization of search results

    • by promoting information from a specific datasource to certain users (e.g. department specific)

    • by promoting priorly updated or visited documents (e.g. last read item, popular among other users..)

  • Help non-expert users to find the relevant information, even if they do not know the actual technical domain language

    • by performing query expansion with domain specific synonyms

    • by finding semantically similar documents without the requirement of matching technical keywords

We differentiate two tuneable areas for relevancy tuning: data loading & query time

../../../_images/squirro_search_pipeline_components.png

Relevance Tuning: Document Features#

Enrich documents with relevant metadata (labels) during data loading time. Those labels can then be used to impact document relevancy scoring.

Source systems may already maintain important metadata like

  • Who is the document author / contributor?

  • When was the document created? Any updates?

  • Is the document already classified into predefined categories / ontologies?

  • Is it an official, important FAQ document or an internal call note?

But usually a majority of relevant insights are hidden in unstructured text, e.g. big PDFs, and first needs to be uncovered. This can be done by adding additional enrichment components to the data loading pipeline.

  • AI Studio Model: Train & apply your own ML classifier, e.g to detect overall sentiment, or document category

  • Custom Pipelets : Write your own plugin to enrich documents while data loading

Scoring Profiles#

Document metadata can be used as additional ranking signals to return the most relevant documents on top.

../../../_images/search-pipeline_scoring_profile_overview.drawio.png

For example, you might want to promote documents coming from a specific datasource - like internal FAQ material - over generic office documents. This can be achieved by defining a Scoring Profile that promotes documents tagged with a Label source:faq.

See How to Use Scoring Profiles to Customize Document Relevancy Scoring

Scoring Plugins#

Scoring plugins are specific implementations of scoring profiles used to inject custom business logic.

Each plugin is configured with a set of parameters that can be used to customize the plugin’s behavior.

Tip

You can write and upload your own Scoring Plugins using the PluginProfile.

Reference: To learn more, see Scoring Plugins.

Searchable Labels#

Apply Full Text Search on additional textual metadata in order to refine the Text Match Relevancy Score.

See How To Use Best-Bets Labels to Map Query Terms

Relevance Tuning: Query Enrichment#

Note

To analyze the query and understand what a user is actually looking for is as important as analyzing documents during data loading.

Query Understanding & Rewriting#

Cognitive Search offers a customizeable query processing workflow that parses, enriches, and expands queries before performing the actual search. The built-in features range from language detection, named entity recognition and semantic synonym expansion.

For more information see Query Processing.

../../../_images/QueryProcessing.default_workflow.drawio.png

Query Term Matching Strategy#

Tune document relevancy scoring by configuring how the user’s Search Query Terms are matched.

Example: Multi-Word Matching

When a user searches for a simple keyword query like global warming, it is expected that documents are ranked highest where both terms global and warming are found within very close proximity e.g. the same sentence. Documents that contain global on Page 1 but warming on Page 10 are not considered to be relevant.

  • Promote documents that match the query terms in close proximity. This can be done using a rescore function that applies loose phrase matching on the top N ranked documents (more expensive scoring applied only on top N to keep the response time low)

See How To Handle User Query Terms Correctly

Changelog#

Squirro 3.6.1: Initial Release of Scoring Profiles