Concept Extraction Pipelet

Concept Extraction Pipelet#

The Concept Extraction pipelet is a Squirro pipeline plugin that uses a large language model (LLM) to enrich an existing taxonomy in Synaptica by proposing new candidate concepts extracted from your document corpus. Candidate concepts are always narrower concepts of existing taxonomy concepts, and come with candidate labels, definitions, and synonyms. The pipelet also records provenance data as mentions and documents in dedicated Synaptica schemes. If you are new to pipelets, see the Pipelets page for an overview of what pipelets are, how they fit into the Squirro pipeline, and how to configure them using the Pipeline Editor.

How It Works#

The pipelet processes each document through two LLM prompts:

The pipelet fetches the concepts in your Enrichment Targets collection from Synaptica, including their labels, synonyms, and definitions. It then provides the document text and those enrichment targets to the LLM, which identifies subject terms that could potentially be sub-concepts of the targets.
Because LLMs tend to over-generate in creative tasks, the results of the first prompt are passed to a second, filtering prompt. That prompt discards semantically irrelevant suggestions without re-reading the document text, reducing noise.

The final set of candidate concepts is written back to a separate Synaptica scheme for human review. Candidate concepts associate with their enrichment targets using the predicates gca:hasOriginalBroader and gca:hasCandidateNarrower.

A taxonomy editor then reviews the candidate concepts directly in Synaptica. Approved concepts are manually moved to the original taxonomy scheme, and the necessary relationships to existing concepts are created by hand. Concepts that are not approved remain in the candidate scheme and are not promoted.

Standalone enrichment target concepts lack hierarchical context and may produce weaker results. Target a meaningful subtree of your taxonomy rather than individual concepts. For example, instead of targeting only Spices and herbs, provide the full subtree that includes Salt, Oregano, and similar concepts. The LLM avoids proposing concepts you already have and proposes sub-concepts such as Pink Salt as a narrower concept of Salt.

Prerequisites#

Before setting up this pipelet, ensure you have:

An existing taxonomy project in Synaptica. If you do not have one, established public domain vocabularies are a good starting point. Visit the Squirro Support website and submit a technical support request for assistance with setting one up.
An OpenAI API key. The pipelet accesses the LLM programmatically through the OpenAI API and requires valid API credentials. Customers and partners are expected to provide their own LLM access and are responsible for any associated usage costs.
A representative sample of your documents and taxonomy to validate results before running the pipelet at scale. Scalability depends on several factors: the choice of LLM, the size and complexity of the taxonomy, and the volume of documents in your data sources. Squirro recommends starting with sample data to assess quality and performance before scaling up. For guidance on throughput planning and production sizing, visit the Squirro Support website and submit a technical support request for guidance from subject matter experts.
Server-level administrator access to the Squirro server to deploy the pipelet.
A valid Synaptica license and the appropriate Squirro add-on entitlement. Visit the Squirro Support website and submit a technical support request for licensing information.

Setup#

Synaptica Setup#

In your Synaptica project, create three new schemes: CandidateConcepts, Mentions, and Documents. These names are indicative and can differ if needed. Each scheme requires a specific configuration of namespaces, templates, properties, and relationships. Visit the Squirro Support website and submit a technical support request for assistance with the required configuration.
Create a collection named Enrichment Targets and add the concepts for which the pipelet should propose sub-concepts.
Create a dedicated Synaptica user account for the pipelet. That account requires Editor II permissions on both the original taxonomy schemes and the three new schemes.

Squirro Setup#

Visit the Squirro Support website and submit a technical support request to obtain the pipelet file (a .py file). Once received, upload it to the Squirro server using the pipelet upload command from the Squirro Toolbox. That is a one-time task per server. For full upload instructions, see the Development Workflow page.

To display entity properties in the Squirro UI, navigate to Setup → Settings → Project Configuration and add the following to the frontend.userapp.entities-configuration field:

{
  "entitiesToShow": [
    {"name": "subject"},
    {"name": "subject_term"},
    {"name": "definition"},
    {"name": "alternative_labels"}
  ],
  "relations": {}
}

To add the pipelet to a pipeline:
1. Navigate to Setup → Pipeline and click the red Edit button to open the Pipeline Editor.
2. Search for LLM Concept Extraction in the pipelet list on the left.
3. Drag the pipelet into the Relate section of your pipeline.
4. Click the pencil icon on the pipelet to open its configuration panel. Fill in all required fields and click Save, then Exit.

The pipelet processes only documents ingested after it is added to the pipeline. Documents already indexed before that point are not automatically reprocessed.

Verifying Results#

After the pipelet runs on ingested documents, verify the results in both Synaptica and Squirro:

In Synaptica, open the CandidateConcepts scheme. New candidate concepts extracted from your documents should appear there, associated with their enrichment targets via the gca:hasOriginalBroader predicate. Alternatively, visit an enrichment target concept and review the extractions under the property gca:hasCandidateNarrower. The Mentions and Documents schemes should also contain provenance entries.
In Squirro, open a processed document and check that entities with subject, alternative_label, and definition fields are present.

If no results appear, use the Squirro Monitoring page to inspect pipeline execution logs and identify errors. For server-level troubleshooting, see the Troubleshooting and FAQ page.

Configuration Reference#

Tip

To find the URI of any item in Synaptica (project, scheme, collection, or template), open the item in Graphite and look in the Metadata section of the right-hand panel. The value is listed under Resource URI.

Name

A display name for the pipelet instance.
LLM API key

The API key used to authenticate requests to the OpenAI API. The pipelets were originally designed and validated using GPT-4o, which offered the best cost-to-quality ratio at the time. Newer model versions may perform better depending on the type and language of content in your data sources. Squirro recommends testing several models against a representative sample of your documents and taxonomy before running the pipelet at scale. Other models, including cloud-based and on-premises alternatives, may also work but require testing and validation. If you need assistance with benchmarking or evaluating alternative models, visit the Squirro Support website and submit a technical support request for guidance from subject matter experts.
Extraction prompt

The first LLM prompt used to extract candidate concepts. Pre-filled with a tested default, but editable.
Filter prompt

The second LLM prompt used to validate and filter the extraction results. Pre-filled with a tested default, but editable.
Graphite base URL

The base URL of your Synaptica instance, for example https://your-instance.synaptica.net.
Graphite username

The username of the Synaptica API user account.
Graphite password

The password of the Synaptica API user account.
Graphite project URI

The URI of the project in Synaptica.
Graphite CandidateConcepts scheme URI

The URI of the scheme where candidate concepts are written.
Graphite Documents scheme URI

The URI of the scheme where documents are written.
Graphite Mentions scheme URI

The URI of the scheme where mentions are written.
Graphite template URI

The URI of the SKOS (or Taxonomy) template in Synaptica. Applied to all new concepts written to Synaptica.
Graphite template URI 2

The URI of the Content Annotation template in Synaptica. Applied to all new concepts written to Synaptica.
Graphite template URI 3

Optional. The URI of a third template to apply to all new concepts.
Graphite Enrichment Targets collection URI

The URI of the Synaptica collection containing the enrichment target concepts.
Namespace for new concepts

The namespace prefix used for concepts written to Synaptica, for example http://example.org/food/. The value must end with either / or #.