Base Types#

Introduces the fundament of libNLP.

Document#

Document class

class Document(doc_id, fields=None, skipped=False, copy_fields=True, fully_processed=True)#

A Document is the internal representation of data inside libNLP. Documents are streamed through a Pipeline with each Step acting on it in-memory accordingly.

Parameters
  • doc_id (str or int) – Given Document id

  • fields (dict, None) – Dictionary of fields

  • skipped (bool, False) – Skip document during processing

  • copy_fields (bool, True) – Deep copy of fields

  • fully_processed (bool, True) – Indicates whether the document has been fully processed. It allows marking that a step has not been prepared to process document.

Example:

from squirro.lib.nlp.document import Document

document = Document(0, {"a_field": "this is a field"})
abort_processing()#

Handle document state for steps that stopped execution pre-maturely. Set metadata that is used outside, e.g. for pipeline-level caching.

Return type

Document

Pipeline#

Pipeline class

class Pipeline(configs, path=None, cache_client=None, ml_workflow_id=None, project_id=None)#

The Pipeline class is defined by a sequential list of Step. It handles Document streaming through the each Step, as well as loading and saving Step configurations.

Parameters
  • configs (list) – sorted list of Step configs

  • path (str, '.') – Path to Step storage

  • cache_client (CacheWithExpiration, None) – Cache client

  • ml_workflow_id (str, None) – Machine learning workflow ID

  • project_id (str, None) – Project ID

Example:

from squirro.lib.nlp.pipeline import Pipeline
from squirro.lib.nlp.document import Document

steps = [{
    "step": "normalizer",
    "type": "punctuation",
    "fields": ["body"]
}]

pipeline = Pipeline(steps, path='.')

documents = [Document(0, {"body": "this is a field!"})]
documents = pipeline.process(documents)  # returns a generator
document = list(documents)  # run the pipeline
print(documents)
save()#

Save the steps

Return type

None

load()#

Load the steps

Return type

None

clean()#

Clean up after steps

Return type

None

terminate()#

Terminate any extra running processes in the steps

process(x)#

Process the steps

Parameters

x – input for first Step of the steps

Returns

generator of Documents from the last Step of the steps

Return type

generator(Document)

train(x)#

Train the steps

Parameters

x – input for first Step of the steps

Returns

generator of Documents from the last Step of the steps

Return type

generator(Document)

Runner#

Runner class

class Runner(config, ml_workflow_id=None, project_id=None)#

The Runner controls libNLP runs. It provides train, test, infer, and clean functions.

Parameters
  • analyzer (dict) –

    Dictionary defining #Analyzer for the Pipeline

    Deprecated since version 3.6.3.

  • dataset (dict) – Dictionary defining train, test, and infer datasets

  • path (str) – Path to model storage

  • pipeline (dict) – Dictionary defining the Pipeline

Example:

from squirro.lib.nlp.runner import Runner

config = {
    "dataset": {
        "items": [{
          "id": "0",
          "label": ["fake"],
          "body": "<html><body><p>This is a fake Squirro Item. It is composed of a couple fake sentences.</p></body></html>"
        },{
          "id": "1",
          "label": ["not fake"],
          "body": "<html><body><p>This is not a fake Squirro Item. It is composed of a couple not fake sentences.</p></body></html>"
        },{
          "id": "2",
          "label": ["fake"],
          "body": "<html><body><p>This is a fake Squirro Item. It is composed of a couple fake sentences.</p></body></html>"
        },{
          "id": "3",
          "label": ["not fake"],
          "body": "<html><body><p>This is not a fake Squirro Item. It is composed of a couple not fake sentences.</p></body></html>"
        },{
          "id": "4",
          "label": ["fake"],
          "body": "<html><body><p>This is a fake Squirro Item. It is composed of a couple fake sentences.</p></body></html>"
        }]
      },
      "pipeline": [
        {
          "fields": [
            "body",
            "label"
          ],
          "step": "loader",
          "type": "squirro_item"
        },
        {
          "fields": [
            "body"
          ],
          "step": "filter",
          "type": "empty"
        },
        {
          "input_fields": [
            "extract_sentences"
          ],
          "output_fields": [
            "normalized_extract"
          ],
          "step": "normalizer",
          "type": "html"
        },
        {
          "fields": [
            "normalized_extract"
          ],
          "step": "normalizer",
          "type": "punctuation"
        },
        {
          "fields": [
            "normalized_extract"
          ],
          "mark_as_skipped": true,
          "step": "filter",
          "type": "regex",
          "whitelist_regexes": [
            "^.{20,}$"
          ]
        },
        {
          "step": "embedder",
          "type": "transformers",
          "transformer": "huggingface",
          "model_name": "https://tfhub.dev/google/universal-sentence-encoder/4",
          "input_field": "body",
          "output_field": "embedded_extract"
        },
        {
          "step": "randomizer",
          "type": "randomizer"
        },
        {
          "input_field": "embedded_extract",
          "label_field": "label",
          "output_field": "prediction",
          "step": "classifier",
          "type": "cosine_similarity"
        },
        {
          "step": "debugger",
          "type": "log_fields",
          "fields": [
            "extract_sentences",
            "prediction"
          ],
          "log_level": "warning"
        }
    ]
}

runner = Runner(config)
try:
    for doc in runner.train():
        print(doc)
    for doc in runner.infer():
        print(doc)
finally:
    runner.clean()
property cache_client: Optional[squirro.lib.nlp.utils.cache.base.CacheWithExpiration]#

Placeholder for the Runner’s cache client.

Subclasses can implement their own cache client using the property.

Return type

Optional[CacheWithExpiration]

train()#

Train pipeline

Returns

Generator of processed documents

Return type

generator(Document)

test()#

Validate trained pipeline

Returns

Test summary from analyzer

Return type

dict

infer()#

Infer with pipeline

Returns

Generator of processed documents

Return type

generator(Document)

clean()#

Clean the pipeline

Return type

None

terminate()#

Terminates the pipeline

Return type

None