Other People Ask#

Profiles: Project Creator, Search User

This page provides an overview of the Other People Ask feature in Squirro.

Project creators can configure the feature and choose where it’s displayed. Search users can use the feature to discover new search suggestions.

Reference: For project creator configuration instructions, see How to Use The Other People Ask Feature.

Overview#

Other People Ask is a feature that presents search suggestions to users via a selectable widget.

In the example screenshot below, the widget is accessed by clicking the person icon (highlighted in orange when in use), which opens a list of relevant search suggestions based on similar queries by other project users:

Other People Ask

When the user searches for water arsenic and clicks the Other People Ask icon, the widget opens and displays a list of relevant search suggestions based on similar queries by other project users. In the example above, it provides potentially useful suggestions of arsenic in drinking water and how much arsenic in water is dangerous.

Backend Operations#

When a user enters search terms in Squirro’s search bar, these search terms are logged and stored via Activity Tracking.

The platform then continuously analyzes these logs and extracts popular search queries, including search terms that are frequently used.

The search bar offers a typeahead feature that can suggest search terms based on popular queries and other sources, such as facets, saved searches, the user’s search history, etc. The common factor for all these suggestions is that they are based either on completing the already entered search terms (i.e. prefix search) or on fuzzy-matching similar combinations of characters.

Other People Ask approaches the problem of finding interesting suggestions from a different angle. Instead of just looking at the superficial similarity on the character level of words and sentences (lexical matches), it seeks to understand the content of the user-typed query and suggests popular queries that are semantically similar.

How Does It Help?#

Other People Ask can help a user to find relevant documents by suggesting other search terms that are semantically close to the entered search terms. In comparison to a simple synonym expansion or graph-based approach, this technique can suggest entire phrases of terms or sentences that have a similar meaning but different wording.

Using such suggestions can help find documents that would otherwise not be found.

How Does It Work?#

Finding Similar Documents#

To find semantically similar terms or phrases for a user’s search terms, we leverage the NLP technique of document embedding.

A document embedding is a representation of a document in a multi-dimensional vector space whereas a document is a natural language text formed out of constituent words and sentences. For the interested reader, there are many resources on the Internet that give a more in-depth background on document embeddings or word embeddings, which are conceptually similar but pre-date document embeddings.

Document embeddings allow to quantitatively formulate questions, such as:

How similar are documents \(A\) and \(B\)

using mathematical concepts.

To get a notion of similarity between two documents \(A\) and \(B\), we apply the cosine similarity. Cosine similarity assigns a value between -1 and 1 to the two embeddings \(\vec{v}_A\) and \(\vec{v}_B\) of documents \(A\) and \(B\),

\[\mathrm{sim}(A, B) = \cos \sphericalangle(\vec{v}_A, \vec{v}_B) := \frac{\vec{v}_A \cdot \vec{v}_B}{|\vec{v}_A| |\vec{v}_B|},\]

whereas a \(\mathrm{sim}(A, B)\) value of 1 means that the two documents \(A\) and \(B\) similar, i.e., they are parallely aligned in the embedding space. A value of 0 means that the two documents are dissimilar, i.e. they are aligned orthogonaly with respect to each other in the vector space. A value of -1 means that the two documents are antogonists, i.e. they are aligned in anti-parallel orientation.

We are now in the position to search for similar search queries that relate to a user’s query. To do this, we simply need to embed the user’s search terms and all the texts of the popular queries into our vector space. Technically, we realize this by using a pre-trained document embedding model trained on a very large set of pairs of input documents (sentences). In order to handle multiple languages with the same technique, we use a multilingual model.

After doing this, we can use the user’s query as reference document \(R\) to find the the nearest neighbours, i.e. the documents \(D_i\) that show the largest cosine similarity \(\mathrm{sim}(R, D_i)\) with respect to the reference document. We use Elasticsearch, the search engine that powers our search, to store and compare and retrieve the document embeddings fast and accurately.

Selecting Interesting Suggestions#

One issue with finding semantically similar queries is that many of the retrieved texts are indeed very similar, not only with respect to the original query, but also with respect to other queries from the popular query index, e.g.:

Query:

    "Who wants to become a shareholder at twitter?"

Similar Queries:

    "Who becomes a shareholder at twitter?"
    "Who is becoming shareholder at twitter?"
    "Who bought shares of twitter"
    "Who buys shares in twitter?"
    "who joined twitter board"
    "who joined twitter as what"
    "twitter largest shareholder"
    "twitter stakeholders"
    "twitter largest shareholder"
    "twitter stakeholders"
    ...

We do not want to suggest all of these texts to the user, since many of them have almost identical meaning. Instead, we want to make a selection of documents that are related to the original search query, but otherwise different from each other. To achieve this we apply a number of different steps, including fuzzy-matching based near-duplicate removal. The final and most distinct step is based on a clustering algorithm.

Clustering is an unsupervised method that allows to divide a number of data points from a population into different groups such that the points in one group are more similar to each other than that they are similar to data points from any other group. In simpler terms, the goal is to assign data points to clusters based on some notion of similarity between the data points.

The careful reader realizes that this echoes aspects from the previous section, namely using document embeddings and the cosine similarity to measure the similarity between pairs of documents. The basic idea is to interpret the similar queries as a list of \(N\) candidate documents where our goal is to select the most distinct or most interesting documents as a subset of the candidate documents.

In order to do this, we cluster the data points, now represented by the document embeddings of the candidate texts, into groups of similar topics. We do this in completely unsupervised fashion, which has the additional benefit of not interfering with the multilinguality of the document embedding space. To put it simply, documents that have semantically similar meaning are grouped into the same cluster or topic, even across different languages.

This now makes it possible to take an informed selection of documents from the list of candidates. For example, one can divide the candidates into two topics, i.e. an “inlier” topic and an “outlier” topic and only select the documents from the “inlier” topic. Or one can assign the documents to ten topics and select exactly one document from each topic to get a maximally diverse selection of documents. In addition to these two extremes, one can also decide on a moderate number of topics, e.g. three, and then select three documents from each topic to get a good balance between diversity and relevance.

The clustered set of similar queries produces then a much more diversified view, e.g.:

Top N queries per cluster:

    "Who buys shares in twitter?"
    "who joined twitter board"
    "twitter largest shareholder"
    "twitter's profits"
    "committee of investors"

All these options can be tuned by changing the default configuration of “Other People Ask”. This is documented in Advanced Configuration.

The Other People Ask feature can be configured via the Squirro Project Configuration.

Default Configuration#

Project Configuration topic.search.similar-searches-configuration

{
    "max_num_results": 5,
    "max_num_topics": 3,
    "max_num_candidates": 100,
    "max_num_items_per_topic": 3,
    "similarity_threshold": 0.6,
    "timeout_ms": 1500
    "return_embedding": false,
    "return_popularity": false,
    "return_score": false,
}

Configuration Reference#

max_num_results

Specifies the maximum number of results, i.e. Similar Queries, that are returned by the API.

Type: number (int)

max_num_candidates

Specifies how many similar queries are considered for finding interesting suggestion.

Type: number (int)

max_num_topics

Specifies the number of topics that are maximally detected within the remaining candidate queries after filtering. Topic detection can be disabled by setting this option to null.

Type: number (int)

max_num_topics

Specifies the number of topics that are maximally detected within the remaining candidate queries after filtering. Topic detection can be disabled by setting this option to null.

Type: number (int)

max_num_items_per_topic

Specifies the number of items that are maximally retained for each detected topic. Limiting items per topic can be disabled by setting this option to null.

Type: number (int)

similarity_threshold

Specifies a minimal similarity score for a Similar Query item to be considered as a suggestion.

Type: number (float)

timeout_ms

Specifies a timeout (measured in miliseconds) after which the API does no longer wait for the machine learning model to return a document embedding for the query.

Type: number (float)

return_embedding

Specifies if the embedding of a Similar Query result item should be returned.

Type: boolean

return_popularity

Specifies if the popularity score of a Similar Query result item should be returned.

Type: boolean

return_score

Specifies if the similarity score of a Similar Query result item should be returned.

Type: boolean