Base Types#
Introduces the fundament of libNLP.
Document#
Document class
- class Document(doc_id, fields=None, skipped=False, copy_fields=True, fully_processed=True)#
A Document is the internal representation of data inside libNLP. Documents are streamed through a
Pipeline
with eachStep
acting on it in-memory accordingly.- Parameters:
fields (dict, None) – Dictionary of fields
skipped (bool, False) – Skip document during processing
copy_fields (bool, True) – Deep copy of fields
fully_processed (bool, True) – Indicates whether the document has been fully processed. It allows marking that a step has not been prepared to process document.
Example:
from squirro.lib.nlp.document import Document document = Document(0, {"a_field": "this is a field"})
Pipeline#
Pipeline class
- class Pipeline(configs, path=None, cache_client=None, ml_workflow_id=None, project_id=None)#
The
Pipeline
class is defined by a sequential list ofStep
. It handlesDocument
streaming through the eachStep
, as well as loading and savingStep
configurations.- Parameters:
Example:
from squirro.lib.nlp.pipeline import Pipeline from squirro.lib.nlp.document import Document steps = [{ "step": "normalizer", "type": "punctuation", "fields": ["body"] }] pipeline = Pipeline(steps, path='.') documents = [Document(0, {"body": "this is a field!"})] documents = pipeline.process(documents) # returns a generator document = list(documents) # run the pipeline print(documents)
- terminate()#
Terminate any extra running processes in the steps
Runner#
Runner class
- class Runner(config, ml_workflow_id=None, project_id=None)#
The
Runner
controls libNLP runs. It provides train, test, infer, and clean functions.- Parameters:
Example:
from squirro.lib.nlp.runner import Runner config = { "dataset": { "items": [{ "id": "0", "label": ["fake"], "body": "<html><body><p>This is a fake Squirro Item. It is composed of a couple fake sentences.</p></body></html>" },{ "id": "1", "label": ["not fake"], "body": "<html><body><p>This is not a fake Squirro Item. It is composed of a couple not fake sentences.</p></body></html>" },{ "id": "2", "label": ["fake"], "body": "<html><body><p>This is a fake Squirro Item. It is composed of a couple fake sentences.</p></body></html>" },{ "id": "3", "label": ["not fake"], "body": "<html><body><p>This is not a fake Squirro Item. It is composed of a couple not fake sentences.</p></body></html>" },{ "id": "4", "label": ["fake"], "body": "<html><body><p>This is a fake Squirro Item. It is composed of a couple fake sentences.</p></body></html>" }] }, "pipeline": [ { "fields": [ "body", "label" ], "step": "loader", "type": "squirro_item" }, { "fields": [ "body" ], "step": "filter", "type": "empty" }, { "input_fields": [ "extract_sentences" ], "output_fields": [ "normalized_extract" ], "step": "normalizer", "type": "html" }, { "fields": [ "normalized_extract" ], "step": "normalizer", "type": "punctuation" }, { "fields": [ "normalized_extract" ], "mark_as_skipped": true, "step": "filter", "type": "regex", "whitelist_regexes": [ "^.{20,}$" ] }, { "step": "embedder", "type": "transformers", "transformer": "huggingface", "model_name": "https://tfhub.dev/google/universal-sentence-encoder/4", "input_field": "body", "output_field": "embedded_extract" }, { "step": "randomizer", "type": "randomizer" }, { "input_field": "embedded_extract", "label_field": "label", "output_field": "prediction", "step": "classifier", "type": "cosine_similarity" }, { "step": "debugger", "type": "log_fields", "fields": [ "extract_sentences", "prediction" ], "log_level": "warning" } ] } runner = Runner(config) try: for doc in runner.train(): print(doc) for doc in runner.infer(): print(doc) finally: runner.clean()
- property cache_client: CacheWithExpiration | None#
Placeholder for the Runner’s cache client.
Subclasses can implement their own cache client using the property.