Discover (NLP Tagger) Pipeline Step#

Discover includes steps around topic modelling and clustering, as well as analysis for Typeahead Suggestions.

The NLP Tagger is the exposed Discover step.

NLP Keyphrase Tagger#

The built-in “Nlp Keyphrase Tagger” pipes items through a configurable SpaCy Pipeline to perform Key-Phrase Extraction and additionally Named Entity Recognition as well as Rule-Based Sentiment Analysis.


General Configuration#

The Pipelet is configurable within the pipeline Editor.

Input Fields#

What fields should be considered for further analysis.

  • fields_to_consider : Comma separated list of fields (default: title,body)

Reduce Processing Time for Large Documents#

To reduce the processing time of large PDFs, consider only a subset of pages.

  • process_pages :

    • dynamic : Chosen relative to document size (default)
      Take at least 10 pages, but at most √total_pages
    • all : Take all pages.

    • int : Take first N pages.

Additionally, it is possible to specify a hard limit of characters to be processed at most. This helps to reduce processing time, especially for large non-binary documents like HTML, or Emails (“flat-items”).

  • max_characters_to_process :

    • all : Analyse full content

    • int : Take first N characters, default is 50000

Language Support#

Per default english (en_core_web_sm) and german (de_core_news_sm) models are installed on Squirro instances.

  • Install additional language models, for example Japanese (see available Spacy Models)
    python -m spacy download ja_core_news_sm
  • language_models : Update SpaCy language model mapping (the language code is expected to be found in facet language, see Language Detection) .


Key Phrase Extraction#

Extract the highest-ranked key phrases based on the TextRank algorithm.

Key phrases are selected and ranked from a pool of recognized Noun Chunks and recognized Named Entities per item.


  • tag_phrases: Enable / Disable key-phrase tagging

  • tag_top_k_phrases: Amount of phrases to tag

    • dynamic : Total amount of phrases selected relative to document size (between 20 - 70)

    • 10 : Take N highest ranked phrases as specified

  • tag_topics: Enable simple topic-tagging based on key-phrases


Key phrases are stored within the nlp_tag__phrases facet.
The item’s Title is also added to the nlp_tag__phrases facet (as-is, without processing).


  • Content-based autocompletion, as part of Typeahead Suggestions.

  • Significant-terms aggregation on search results.

Simple Topic Detection#

With configuration tag_topics:True, the pool of ranked key-phrases is used to extract cleaned, deduplicated phrases referred to as “topics” (stored in the nlp_tag__topics facet).


1) Cleaning Steps:
  - Remove terms with specific Part-of-Speech (POS) tag, like `adjectives`, `determiners` or `punctuation`.
  - Remove terms containing (almost) only number characters, like `33120x`
  - De-Duplicate:
      - Do not use phrases that belong to a specific Named Entity, like ["PRODUCT", "EVENT", "PERSON"] (configurable)
      - Do not use phrases that have overlapping terms as already stored "topics"
2) Select 20 phrases evenly across all ranks (as determined via TextRank)

Named Entity Recognition#

Store recognised entities within their corresponding facet, like .


  • tag_entities : Enable entity (NER) tagging.

  • collect_entities : Specify NER tags to be added. (Check support on installed Label Scheme).

  • tag_entities_per_type : Amount of entities (per type) to be added to their corresponding facet.


One facet per entity, like Location = [Europe, London]

Sentiment Analysis#

Applies rule-based sentiment analysis (vaderSentiment) that is specifically attuned to sentiments expressed in social media or domains like NY Times editorials, movie reviews, and product reviews.

It doesn’t require any training data but is constructed from a generalizable, valence-based, human-curated gold standard sentiment lexicon.


  • tag_sentiment : Enable rule-based sentiment tagging (for english language only)


  • Overall Sentiment Label
    One sentiment label (neutral, positive, negative) per document.
    • Sentiment analysis is applied per sentence

    • Sentences with neutral sentiment are skipped

  • Overall Sentiment Score
    Float value within [-1,+1]
  • Sentiment Assessment
    facet:positive_terms, facet:negative_terms
    A sentiment phrase consists of the valence-term and it’s context. \


Positive Product Feedback#

  • Input

“The tech provides insight into unstructured email content, it allows me to truly understand the conversation between the business and our customers. The insight gained from this analysis is significantly deeper than cam be achieved from structured data analysis

  • Output

  'sentiment_pretrained': ['positive'],
  'positive_terms': ['truly understand', 'insight gained'],
  'negative_terms': [],
  'nlp_tag__phrases': ['structured data analysis', 'unstructured email content' ]

→ That review showcases the combined insights gained through sentiment-assessment and key-phrase extraction.

Negative Feedback#

  • Input

“This was not a good experience”

  • Output

  'sentiment_pretrained': ['negative'],
  'positive_terms': [],
  'negative_terms': ['not a good experience']