Base types

Introduces the fundament of libNLP.

Document

Document class

class squirro.lib.nlp.document.Document(doc_id, fields=None, skipped=False, copy_fields=True, fully_processed=True)

A Document is the internal representation of data inside libNLP. Documents are streamed through a Pipeline with each Step acting on it in-memory accordingly.

Parameters
  • doc_id (str or int) – Given Document id

  • fields (dict, None) – Dictionary of fields

  • skipped (bool, False) – Skip document during processing

  • copy_fields (bool, True) – Deep copy of fields

  • fully_processed (bool, True) – Indicates whether the document has been fully processed. It allows marking that a step has not been prepared to process document.

Example:

from squirro.lib.nlp.document import Document

document = Document(0, {"a_field": "this is a field"})
abort_processing()

Handle document state for steps that stopped execution pre-maturely. Set metadata that is used outside, e.g. for pipeline-level caching.

Return type

Document

Pipeline

Pipeline class

class squirro.lib.nlp.pipeline.Pipeline(configs, path=None, cache_client=None, ml_workflow_id=None)

The Pipeline class is defined by a sequential list of Step. It handles Document streaming through the each Step, as well as loading and saving Step configurations.

Parameters

Example:

from squirro.lib.nlp.pipeline import Pipeline

steps = [{
  "step": "normalizer",
  "type": "punctuation",
  "fields": ["a"]
}]

pipeline = Pipeline(steps, path='.')

documents = [Document(0, {"a_field": "this is a field!"})]
documents = pipeline.process(documents)
print(documents)
save()

Save the steps

Return type

None

load()

Load the steps

Return type

None

clean()

Clean up after steps

Return type

None

terminate()

Terminate any extra running processes in the steps

process(x)

Process the steps

Parameters

x – input for first Step of the steps

Returns

generator of Documents from the last Step of the steps

Return type

generator(Document)

train(x)

Train the steps

Parameters

x – input for first Step of the steps

Returns

generator of Documents from the last Step of the steps

Return type

generator(Document)

Runner

Runner class

class squirro.lib.nlp.runner.Runner(config, ml_workflow_id=None)

The Runner controls libNLP runs. It provides train, test, infer, and clean functions.

Parameters
  • analyzer (dict) – Dictionary defining #Analyzer for the Pipeline

  • dataset (dict) – Dictionary defining train, test, and infer datasets

  • path (str) – Path to model storage

  • pipeline (dict) – Dictionary defining the Pipeline

Example:

from squirro.lib.nlp.runner import Runner

config = {
    "dataset": {
        "train": "data/train",
        "test": "data/test",
        "infer": "data/infer"
    },
    "analyzer": {
        "type": "classification",
        "tag_field": "pred_class",
        "label_field": "class"
    },
    "pipeline": [
        {
            "step": "loader",
            "type": "csv",
            "fields": ["sepal length", "sepal width", "petal length", "class"]
        },
        {
            "step": "classifier",
            "type": "sklearn",
            "input_fields": ["sepal length", "sepal width", "petal length", "petal width"],
            "label_field": "class",
            "model_type": "SVC",
            "model_kwargs": {"probability": True},
            "output_field": "pred_class",
            "explanation_field": "explanation"
        }
    ]
}

runner = Runner(config)
try:
    for doc in runner.train():
        print(doc)
    print(runner.test())
    for doc in runner.infer():
        print(doc)
finally:
    runner.clean()
property cache_client

Placeholder for the Runner’s cache client.

Subclasses can implement their own cache client using the property.

Return type

Optional[CacheWithExpiration]

train()

Train pipeline

Returns

Generator of processed documents

Return type

generator(Document)

test()

Validate trained pipeline

Returns

Test summary from analyzer

Return type

dict

infer()

Infer with pipeline

Returns

Generator of processed documents

Return type

generator(Document)

clean()

Clean the pipeline

Return type

None

terminate()

Terminates the pipeline

Return type

None