libNLP

libNLP#

This page introduces libNLP and provides information on libNLP apps, base types, and troubleshooting.

Introduction#

libNLP is a proprietary Squirro text processing and understanding library powered by machine learning (ML).

It is the core of ML-powered applications at Squirro such as AI Studio, Query Processing, and Q&A.

Installation#

To get started, follow the instrucitons below:

Download it from the libNLP Download external wiki page.
Install it using pip:

pip install squirro.lib.nlp-SQUIRRO_VERSION-py2.py3-none-any.whl

Structure and Configuration#

libNLP is structured as a pipeline where a user can specify a sequence of steps to load and transform unstructured data. That data is then classified and saved either within Squirro or to disk (in CSV or JSON format).

The pipeline configuration itself is specified in JSON format. See below for an example config.json.

Example JSON Configuration File

{
      "dataset": {
          "items": [{
            "id": "0",
            "label": ["fake"],
            "body": "<html><body><p>This is a fake Squirro Item. It is composed of a couple fake sentences.</p></body></html>"
          },{
            "id": "1",
            "label": ["not fake"],
            "body": "<html><body><p>This is not a fake Squirro Item. It is composed of a couple not fake sentences.</p></body></html>"
          },{
            "id": "2",
            "label": ["fake"],
            "body": "<html><body><p>This is a fake Squirro Item. It is composed of a couple fake sentences.</p></body></html>"
          },{
            "id": "3",
            "label": ["not fake"],
            "body": "<html><body><p>This is not a fake Squirro Item. It is composed of a couple not fake sentences.</p></body></html>"
          },{
            "id": "4",
            "label": ["fake"],
            "body": "<html><body><p>This is a fake Squirro Item. It is composed of a couple fake sentences.</p></body></html>"
          }]
      },
      "pipeline": [
          {
            "fields": [
              "body",
              "label"
            ],
            "step": "loader",
            "type": "squirro_item"
          },
          {
            "fields": [
              "body"
            ],
            "step": "filter",
            "type": "empty"
          },
          {
            "input_fields": [
              "extract_sentences"
            ],
            "output_fields": [
              "normalized_extract"
            ],
            "step": "normalizer",
            "type": "html"
          },
          {
            "fields": [
              "normalized_extract"
            ],
            "step": "normalizer",
            "type": "punctuation"
          },
          {
            "fields": [
              "normalized_extract"
            ],
            "mark_as_skipped": true,
            "step": "filter",
            "type": "regex",
            "whitelist_regexes": [
              "^.{20,}$"
            ]
          },
          {
            "step": "embedder",
            "type": "transformers",
            "transformer": "huggingface",
            "model_name": "https://tfhub.dev/google/universal-sentence-encoder/4",
            "input_field": "body",
            "output_field": "embedded_extract"
          },
          {
            "step": "randomizer",
            "type": "randomizer"
          },
          {
            "input_field": "embedded_extract",
            "label_field": "label",
            "output_field": "prediction",
            "step": "classifier",
            "type": "cosine_similarity"
          },
          {
            "step": "debugger",
            "type": "log_fields",
            "fields": [
              "extract_sentences",
              "prediction"
            ],
            "log_level": "warning"
          }
      ]
  }

Running Configuration#

You can run the Pipeline configuration by either of the following methods:

Publish a ML workflow via SquirroClient (Python SDK) as described in How To Publish ML Models Using the Squirro Client.
Adjust the ML workflow published from AI Studio as described in AI Studio.

Note that a ML workflow created from the frontend ML workflow page is run purely as-it-is. A published ML workflow used in the enrich pipeline is adjusted on-the-fly during the run phase. For example, when applying optimization for execution. See ML Enrichments for Pipeline Workflows to learn more.

Query-Processing App#

For information on Query Processing app classes within libNLP, see Query Processing.

For step-by-step instructions on how a python engineer can build custom query processing steps, see How to Create Custom Query-Processing Steps.

Base types#

The fundament of libNLP, base types include the following classes:

Document
Pipeline
Runner

For detailed information on libNLP base types, see Base Types.

Step Types#

These provide access to all predefined steps in libNLP, including the following:

For detailed information on libNLP step types, see Step Types.

Utils#

Utils defined in libNLP include the following:

For more detailed information, see Utils.

Troubleshooting#

For answers to common troubleshooting questions, including those listed below, see Troubleshooting & FAQ.

Something is not working properly. How can I debug it?
How can I test my ML Worklflow before uploading to the Squirro Platform?
How can I upload my ML Worklflow to the Squirro Platform?
How can I use a ML Workflow in the enrich pipeline?
I have developed my own machine learning approach and would like to know how to use it for data enrichment in Squirro.