Relationship Extraction Pipelet

Relationship Extraction Pipelet#

The LLM Relationship Extraction pipelet uses an LLM to enrich an existing taxonomy in Synaptica by proposing new relationships between known concepts, based on evidence found in your document corpus. Candidate relationships follow a set of predicates you define. The pipelet also records provenance data as mentions and documents in dedicated Synaptica schemes. If you are new to pipelets, see the Pipelets page for an overview of what pipelets are, how they fit into the Squirro pipeline, and how to configure them using the Pipeline Editor.

How It Works#

The pipelet processes each document as follows:

You define one or more predicates of interest. For each predicate, you specify a domain and a range, which constrain the compatible concept pools for the left-side and right-side participants of the relationship.
The pipelet fetches the entire taxonomy (all schemes in the project) from Synaptica via the API. To limit token usage, the taxonomy is reduced to labels, synonyms, and definitions and converted into a Markdown-like text representation.
The document text, the taxonomy, and the predicates are provided to the LLM, which identifies candidate relationships expressed in the source text. Each relationship is a triple: subject, predicate, and object, along with the text span that indicates it.
The pipelet then validates the results against the fetched taxonomy, enforcing the defined domains and ranges and discarding relationships that do not comply.

The final set of candidate relationships is written back to a separate Synaptica scheme for human review. A taxonomy editor then reviews the candidate relationships directly in Synaptica. Approved relationships are manually moved to the original taxonomy scheme, and the necessary connections to existing concepts are created by hand. Relationships that are not approved remain in the candidate scheme and are not promoted.

Prerequisites#

Before setting up this pipelet, ensure you have:

An existing taxonomy project in Synaptica. If you do not have one, established public domain vocabularies are a good starting point. Visit the Squirro Support website and submit a technical support request for assistance with setting one up.
An OpenAI API key. The pipelet accesses the LLM programmatically through the OpenAI API and requires valid API credentials. Customers and partners are expected to provide their own LLM access and are responsible for any associated usage costs.
A representative sample of your documents and taxonomy to validate results before running the pipelet at scale. Scalability depends on several factors: the choice of LLM, the size and complexity of the taxonomy, and the volume of documents in your data sources. Squirro recommends starting with sample data to assess quality and performance before scaling up. For guidance on throughput planning and production sizing, visit the Squirro Support website and submit a technical support request for guidance from subject matter experts.
Server-level administrator access to the Squirro server to deploy the pipelet.
A valid Synaptica license and the appropriate Squirro add-on entitlement. Visit the Squirro Support website and submit a technical support request for licensing information.

Setup#

Synaptica Setup#

In your Synaptica project, create three new schemes: Candidate Relationships, Mentions, and Documents. These names are indicative and can differ if needed. Each scheme requires a specific configuration of namespaces, templates, properties, and relationships. Visit the Squirro Support website and submit a technical support request for assistance with the required configuration.
Create a dedicated Synaptica user account for the pipelet. That account requires Editor II permissions on both the original taxonomy schemes and the three new schemes.

Squirro Setup#

Visit the Squirro Support website and submit a technical support request to obtain the pipelet file (a .py file). Once received, upload it to the Squirro server using the pipelet upload command from the Squirro Toolbox. That is a one-time task per server. For full upload instructions, see the Development Workflow page.
To add the pipelet to a pipeline:
1. Navigate to Setup → Pipeline and click the red Edit button to open the Pipeline Editor.
2. Search for LLM Relationship Extraction in the pipelet list on the left.
3. Drag the pipelet into the Relate section of your pipeline.
4. Click the pencil icon on the pipelet to open its configuration panel. Fill in all required fields and click Save, then Exit.

The pipelet processes only documents ingested after it is added to the pipeline. Documents already indexed before that point are not automatically reprocessed.

Verifying Results#

After the pipelet runs on ingested documents, verify the results in Synaptica:

In Synaptica, open the Candidate Relationships scheme. New candidate relationships extracted from your documents should appear there as triples, associated with their source concepts via rdf:subject and rdf:object. The Mentions and Documents schemes should also contain provenance entries.

If no results appear, use the Squirro Monitoring page to inspect pipeline execution logs and identify errors. For server-level troubleshooting, see the Troubleshooting and FAQ page.

Configuration Reference#

Tip

To find the URI of any item in Synaptica (project, scheme, collection, or template), open the item in Graphite and look in the Metadata section of the right-hand panel. The value is listed under Resource URI.

Name

A display name for the pipelet instance.
LLM API key

The API key used to authenticate requests to the OpenAI API. The pipelets were originally designed and validated using GPT-4o, which offered the best cost-to-quality ratio at the time. Newer model versions may perform better depending on the type and language of content in your data sources. Squirro recommends testing several models against a representative sample of your documents and taxonomy before running the pipelet at scale. Other models, including cloud-based and on-premises alternatives, may also work but require testing and validation. If you need assistance with benchmarking or evaluating alternative models, visit the Squirro Support website and submit a technical support request for guidance from subject matter experts.
Predicates of interest

One or more predicates that define the relationships to extract. Provide as a JSON array:
```
[
    {
        "uri": "http://example.org/food/isCookedWith",
        "label": "isCookedWith",
        "definition": "Associates a food ingredient with some type of oil or fat, based on their co-existence in recipes.",
        "rdfs:domain": "Vegetables",
        "rdfs:range": "Oils and Fats"
    }
]
```
Each predicate accepts the following fields:
- uri
  
  The URI of the predicate.
- label
  
  The label of the predicate.
- definition
  
  A short definition. Use this field to provide specific guidance to the LLM on what to look for in the source text.
- rdfs:domain
  
  The label of a taxonomy scheme or concept. Only concepts from that scheme, or descendants of that concept, are proposed as left-side participants in candidate relationships.
- rdfs:range
  
  The label of a taxonomy scheme or concept. Only concepts from that scheme, or descendants of that concept, are proposed as right-side participants in candidate relationships.
Extraction prompt

The LLM prompt used to extract candidate relationships. Pre-filled with a tested default, but editable.
Graphite base URL

The base URL of your Synaptica instance, for example https://your-instance.synaptica.net. Do not include a trailing /.
Graphite username

The username of the Synaptica API user account.
Graphite password

The password of the Synaptica API user account.
Graphite project URI

The URI of the project in Synaptica.
Graphite CandidateRelationships scheme URI

The URI of the scheme where candidate relationships are written.
Graphite Documents scheme URI

The URI of the scheme where documents are written.
Graphite Mentions scheme URI

The URI of the scheme where mentions are written.
Graphite template URI

The URI of the SKOS (or Taxonomy) template in Synaptica. Applied to all new concepts written to Synaptica.
Graphite template URI 2

The URI of the Content Annotation template in Synaptica. Applied to all new concepts written to Synaptica.
Graphite template URI 3

Optional. The URI of a third template to apply to all new concepts.
Namespace for new concepts

The namespace prefix used for concepts written to Synaptica, for example http://example.org/food/. The value must end with either / or #.