Base Types

Base Types#

Introduces the fundament of libNLP.

Document#

Document class

class Document(doc_id, fields=None, skipped=False, copy_fields=True, fully_processed=True)#

A Document is the internal representation of data inside libNLP. Documents are streamed through a Pipeline with each Step acting on it in-memory accordingly.

Parameters:

doc_id (str or int) – Given Document id
fields (dict, None) – Dictionary of fields
skipped (bool, False) – Skip document during processing
copy_fields (bool, True) – Deep copy of fields
fully_processed (bool, True) – Indicates whether the document has been fully processed. It allows marking that a step has not been prepared to process document.

Example:

from squirro.lib.nlp.document import Document

document = Document(0, {"a_field": "this is a field"})

abort_processing()#

Handle document state for steps that stopped execution pre-maturely. Set metadata that is used outside, e.g. for pipeline-level caching.

Return type:: Document

Pipeline#

Pipeline class

class Pipeline(configs, path=None, cache_client=None, ml_workflow_id=None, project_id=None)#

The Pipeline class is defined by a sequential list of Step. It handles Document streaming through the each Step, as well as loading and saving Step configurations.

Parameters:

configs (list) – sorted list of Step configs
path (str, '.') – Path to Step storage
cache_client (CacheWithExpiration, None) – Cache client
ml_workflow_id (str, None) – Machine learning workflow ID
project_id (str, None) – Project ID

Example:

from squirro.lib.nlp.pipeline import Pipeline
from squirro.lib.nlp.document import Document

steps = [{
    "step": "normalizer",
    "type": "punctuation",
    "fields": ["body"]
}]

pipeline = Pipeline(steps, path='.')

documents = [Document(0, {"body": "this is a field!"})]
documents = pipeline.process(documents)  # returns a generator
document = list(documents)  # run the pipeline
print(documents)

save()#

Save the steps

Return type:: None

load()#

Load the steps

Return type:: None

clean()#

Clean up after steps

Return type:: None

terminate()#: Terminate any extra running processes in the steps

process(x)#

Process the steps

Parameters:: x – input for first Step of the steps
Returns:: generator of Documents from the last Step of the steps
Return type:: generator(Document)

train(x)#

Train the steps

Parameters:: x – input for first Step of the steps
Returns:: generator of Documents from the last Step of the steps
Return type:: generator(Document)

Runner#

Runner class

class Runner(config, ml_workflow_id=None, project_id=None)#

The Runner controls libNLP runs. It provides train, test, infer, and clean functions.

Parameters:

analyzer (dict) –
Dictionary defining #Analyzer for the Pipeline

Deprecated since version 3.6.3.
dataset (dict) – Dictionary defining train, test, and infer datasets
path (str) – Path to model storage
pipeline (dict) – Dictionary defining the Pipeline

Example:

from squirro.lib.nlp.runner import Runner

config = {
    "dataset": {
        "items": [{
          "id": "0",
          "label": ["fake"],
          "body": "<html><body><p>This is a fake Squirro Item. It is composed of a couple fake sentences.</p></body></html>"
        },{
          "id": "1",
          "label": ["not fake"],
          "body": "<html><body><p>This is not a fake Squirro Item. It is composed of a couple not fake sentences.</p></body></html>"
        },{
          "id": "2",
          "label": ["fake"],
          "body": "<html><body><p>This is a fake Squirro Item. It is composed of a couple fake sentences.</p></body></html>"
        },{
          "id": "3",
          "label": ["not fake"],
          "body": "<html><body><p>This is not a fake Squirro Item. It is composed of a couple not fake sentences.</p></body></html>"
        },{
          "id": "4",
          "label": ["fake"],
          "body": "<html><body><p>This is a fake Squirro Item. It is composed of a couple fake sentences.</p></body></html>"
        }]
      },
      "pipeline": [
        {
          "fields": [
            "body",
            "label"
          ],
          "step": "loader",
          "type": "squirro_item"
        },
        {
          "fields": [
            "body"
          ],
          "step": "filter",
          "type": "empty"
        },
        {
          "input_fields": [
            "extract_sentences"
          ],
          "output_fields": [
            "normalized_extract"
          ],
          "step": "normalizer",
          "type": "html"
        },
        {
          "fields": [
            "normalized_extract"
          ],
          "step": "normalizer",
          "type": "punctuation"
        },
        {
          "fields": [
            "normalized_extract"
          ],
          "mark_as_skipped": true,
          "step": "filter",
          "type": "regex",
          "whitelist_regexes": [
            "^.{20,}$"
          ]
        },
        {
          "step": "embedder",
          "type": "transformers",
          "transformer": "huggingface",
          "model_name": "https://tfhub.dev/google/universal-sentence-encoder/4",
          "input_field": "body",
          "output_field": "embedded_extract"
        },
        {
          "step": "randomizer",
          "type": "randomizer"
        },
        {
          "input_field": "embedded_extract",
          "label_field": "label",
          "output_field": "prediction",
          "step": "classifier",
          "type": "cosine_similarity"
        },
        {
          "step": "debugger",
          "type": "log_fields",
          "fields": [
            "extract_sentences",
            "prediction"
          ],
          "log_level": "warning"
        }
    ]
}

runner = Runner(config)
try:
    for doc in runner.train():
        print(doc)
    for doc in runner.infer():
        print(doc)
finally:
    runner.clean()

property cache_client: CacheWithExpiration | None#

Placeholder for the Runner’s cache client.

Subclasses can implement their own cache client using the property.

train()#

Train pipeline

Returns:: Generator of processed documents
Return type:: generator(Document)

test()#

Validate trained pipeline

Returns:: Test summary from analyzer
Return type:: dict

infer()#

Infer with pipeline

Returns:: Generator of processed documents
Return type:: generator(Document)

clean()#

Clean the pipeline

Return type:: None

terminate()#

Terminates the pipeline

Return type:: None

Base Types

Contents

Base Types#

Document#

Pipeline#

Runner#