Semantic Tagger Pipelet

Semantic Tagger Pipelet#

The LLM Semantic Tagger pipelet uses an LLM to annotate documents in your corpus with concepts from an existing Synaptica taxonomy. Unlike NLP-based tagging, which relies on string matching, the LLM Semantic Tagger leverages taxonomy semantics such as concept definitions and hierarchies to identify relevant concepts even when the document text does not contain an exact term match. Tags are written as entities in Squirro and are not written back to Synaptica. If you are new to pipelets, see the Pipelets page for an overview of what pipelets are, how they fit into the Squirro pipeline, and how to configure them using the Pipeline Editor.

An Azure OpenAI variant of this pipelet is also available for organizations that use Azure OpenAI Service instead of the standard OpenAI API. Visit the Squirro Support website and submit a technical support request for access to the Azure OpenAI variant.

How It Works#

You define one or more schemes of interest from your Synaptica project. The pipelet fetches all concepts from those schemes via the Synaptica API. To limit token usage, the taxonomy is reduced to labels, synonyms, and definitions and converted into a Markdown-like text representation.
The document text and the fetched taxonomy are provided to the LLM, which identifies concepts that apply to the document on both a lexical and a semantic basis.
The identified concepts are written as entities to Squirro.

Prerequisites#

Before setting up this pipelet, ensure you have:

An existing taxonomy project in Synaptica. If you do not have one, established public domain vocabularies are a good starting point. Visit the Squirro Support website and submit a technical support request for assistance with setting one up.
An OpenAI API key. The pipelet accesses the LLM programmatically through the OpenAI API and requires valid API credentials. Customers and partners are expected to provide their own LLM access and are responsible for any associated usage costs.
A representative sample of your documents and taxonomy to validate results before running the pipelet at scale. Scalability depends on several factors: the choice of LLM, the size and complexity of the taxonomy, and the volume of documents in your data sources. Squirro recommends starting with sample data to assess quality and performance before scaling up. For guidance on throughput planning and production sizing, visit the Squirro Support website and submit a technical support request for guidance from subject matter experts.
The Squirro server IP address added to the allowlist for your Synaptica instance. Visit the Squirro Support website and submit a technical support request. That is a one-time task per Synaptica instance and Squirro server pair.
Server-level administrator access to the Squirro server to deploy the pipelet.
A valid Synaptica license and the appropriate Squirro add-on entitlement. Visit the Squirro Support website and submit a technical support request for licensing information.

Setup#

Synaptica Setup#

The pipelet requires an existing taxonomy project in Synaptica.
Create a dedicated Synaptica user account for the pipelet. That account requires at least viewer permissions on the taxonomy schemes of interest.

Squirro Setup#

Visit the Squirro Support website and submit a technical support request to obtain the pipelet file (a .py file). Once received, upload it to the Squirro server using the pipelet upload command from the Squirro Toolbox. That is a one-time task per server. For full upload instructions, see the Development Workflow page.
To display entity properties in the Squirro UI, navigate to Setup → Settings → Project Configuration and add the following to the frontend.userapp.entities-configuration field:
```
{
  "entitiesToShow": [
    {"name": "Concept"},
    {"name": "Concept URI"}
  ],
  "relations": {}
}
```
To add the pipelet to a pipeline:
1. Navigate to Setup → Pipeline and click the red Edit button to open the Pipeline Editor.
2. Search for LLM Semantic Tagger in the pipelet list on the left.
3. Drag the pipelet into the Relate section of your pipeline.
4. Click the pencil icon on the pipelet to open its configuration panel. Fill in all required fields and click Save, then Exit.

The pipelet processes only documents ingested after it is added to the pipeline. Documents already indexed before that point are not automatically reprocessed.

Verifying Results#

After the pipelet runs on ingested documents, verify the results in Squirro:

Open a processed document and check in the Labels tab that entities with the Concept and Concept URI values are present. The pipelet does not write anything back to Synaptica.

If no results appear, use the Squirro Monitoring page to inspect pipeline execution logs and identify errors. For server-level troubleshooting, see the Troubleshooting and FAQ page.

Configuration Reference#

Tip

To find the URI of any item in Synaptica (project, scheme, collection, or template), open the item in Graphite and look in the Metadata section of the right-hand panel. The value is listed under Resource URI.

Name

A display name for the pipelet instance.
LLM API key

The API key used to authenticate requests to the OpenAI API. The pipelets were originally designed and validated using GPT-4o, which offered the best cost-to-quality ratio at the time. Newer model versions may perform better depending on the type and language of content in your data sources. Squirro recommends testing several models against a representative sample of your documents and taxonomy before running the pipelet at scale. Other models, including cloud-based and on-premises alternatives, may also work but require testing and validation. If you need assistance with benchmarking or evaluating alternative models, visit the Squirro Support website and submit a technical support request for guidance from subject matter experts.
Highlight entities

When turned on (default), matched concepts are highlighted in the document text in the Squirro UI. When turned off, concepts are tagged but not visually highlighted.
Number of pages per batch

The number of document pages processed together as a batch. Lower values (for example, 1) increase processing time and token usage but return more thorough results.
Max Concurrent API Requests

The number of concurrent processes. Higher values (for example, 20) reduce overall processing time but increase system resource requirements (for example, RAM). LLM API rate limits may also apply.
Schemes of interest

One or more taxonomy schemes from which concepts are fetched for tagging. Provide as a JSON array:
```
[
    {
        "label": "Food Ingredients",
        "uri": "https://your-instance.synaptica.net/concept_scheme/your-scheme-id"
    }
]
```
Each entry accepts the following fields:
- label
  
  The label of the scheme.
- uri
  
  The URI of the scheme.
Extraction prompt

The LLM prompt used to extract tags. Pre-filled with a tested default, but editable.
Graphite base URL

The base URL of your Synaptica instance, for example https://your-instance.synaptica.net. Do not include a trailing /.
Graphite username

The username of the Synaptica API user account.
Graphite password

The password of the Synaptica API user account.
Graphite project URI

The URI of the project in Synaptica.