Privacy Layer#
The widespread adoption of conversational AI systems raises concerns about data privacy. To mitigate the risks, it is essential to prevent the transfer of personally identifiable information (PII) to external IT infrastructures. The Squirro Privacy Layer add-on protects organizations by filtering out PII from the content sent to third-party language model providers.
Understanding the Privacy Layer#
The privacy layer is a set of technologies and protocols designed to protect PII by removing it from the content transferred to the large language model (LLM), with minimal impact. That layer transforms the combined inputs from the user with the context and background information passed by the retrieval augmented generation (RAG) mechanism, into content where temporary placeholders replace PII occurrences. That process, often called de-identification or data masking, is reversible, allowing the system to restore the original PII values inside the output from the LLM.

The add-on provides a wrapper class chat_wrapper.PIIMaskingChatWrapper
that allows for encapsulating the standard LLM chat class derived from langchain_openai.chat_models.base.BaseChatModel
.
Notes
Data transmitted to Tools is excluded from masking.
Integration#
The squirro.service.genai.components.builtins.chat_model.ChatModelFactory
mechanism automatically detects and uses the PIIMaskingChatWrapper
class to encapsulate the LLM chat class whenever the masking_config
parameter is set in the LLM configuration. This means that for every API endpoint, if the masking_config
parameter is set in the llm section of the configuration file src/squirro/service/genai/deployments/default.yaml
, the wrapper is automatically used to encapsulate the LLM chat class. If the masking_config
parameter is not set, the standard LLM chat class is used (no PII masking).
Notes
The
masking_config
must be specified for each API endpoint, and for each language model individually.The configuration must match the schema defined by the
squirro.extensions.langchain.pii.config.PIIConfig
class.
Data Flow#
The data masking and unmasking process involves the following steps:
Each message (excluded SystemMessage) is masked before being transmitted to the LLM
- Masking
Detection of message type
Message parsing and extract text
Detection of PII (via defined in configuration detector)
Create tokens for PII text (tokens are store only in memory)
Replace sensitive text with tokens
Each message returned by the LLM is unmasked
Unmasking
Replace tokens with original text
Configuration#
The configuration must match the schema defined by the squirro.extensions.langchain.pii.config.PIIConfig class.
The configuration must be placed in the src/squirro/service/genai/deployments/default.yaml configuration file in the llm section for each API endpoint for each language model (llm) for which masking is to be enabled.
Example configuration for endpoint /v0/arbitrary_prompt:
prefix: "/v0"
description: "Endpoint for arbitrary prompts"
stream_endpoint: null
invoke_endpoint: "arbitrary_prompt"
extra_input_mapping:
...
chain_config:
$runnable: squirro/chains/arbitrary_prompt
llm:
$runnable: generic_chat_models/chat_models/generic_chat_model
masking_config:
detector_name: "squirro"
unmasking_case_sensitive: false
sanitizer:
name: "sanitizer"
replace_map:
- string: "\\"
replacement: " "
- string: ";"
replacement: " "
- string: ":"
replacement: " "
- string: ","
replacement: " "
extra:
nlp_urls:
en: "https://nlp-testing.squirro.com/spacy/en/fast/v0/_invoke"
fr: "https://nlp-testing.squirro.com/spacy/fr/accurate/v0/_invoke"
de: "https://nlp-testing.squirro.com/spacy/de/accurate/v0/_invoke"
masking_allow_types:
- PERSON # Names of people.
- LOC # Locations (geographical and other).
- EMAIL # Email addresses (depending on the model).
- PHONE_NUMBER # Phone numbers (in some models).
custom_regex_patterns:
- pattern: "([0-9]{3}-[0-9]{2}-[0-9]{4})"
type: "SSN"
PIIConfig#
detector_name: (str) - PII detector name, default
squirro
, available detectors:presidio
squirro
masking_min_score: (float) - minimum PII detection efficiency threshold, default
0.0
. Not all detectors use this parameter.unmasking_case_sensitive: (bool) - flag specifying whether token unmasking should be case-sensitive, default is
False
.False
means that the unmasking is case-insensitive,True
means that the unmasking is case-sensitive.sanitizer: (SanitizerConfig) - configuration of the text sanitizer. The sanitizer is responsible for replacing characters in the text before PII detection. The sanitizer configuration must match the schema defined by the
squirro.extensions.langchain.pii.sanitizer.SanitizerConfig
class.extra: (dict) - additional configuration parameters required by the PII detector, default is
{}
. Each PII detector may require different configuration parameters. See the Detectors section for more details.
SanitizerConfig#
name: (str) - name of the sanitizer, default is
sanitizer
replace_map: (list[ReplaceMapItem]) - list of dictionaries with the following keys:
pattern: (str) - string to be replaced
replacement: (str) - replacement string
extra: (dict) - additional configuration parameters required by the sanitizer, default is
{}
Example of simple configuration:
name: "sanitizer"
replace_map:
- {string: ";", replacement: " "}
- {string: ":", replacement: " "}
- {string: ",", replacement: " "}
extra: {}
Detectors#
PII detectors are responsible for detecting personally identifiable information (PII) in text. It is possible to add your own PII detectors, which will be used by the PII Masking extension. By default, the PresidioPIIDetector
detector based on the Presidio
library is available.
SquirroPiiDetector#
Detector name:
squirro
Detector class:
src.squirro.extensions.langchain.pii.detectors.squirro_pii_detector.SquirroPiiDetector
Flow of detection:
Sanitization of the text.
Languages detection.
For each detected language detection of PII data using external NLP SpaCy service (if
nlp_url
for language is given in configuration).Detection of PII data using regular expressions.
The SquirroPiiDetector
detector is a hybrid detector that uses external NLP SpaCy service (via http) and set of regular expressions to detect personally identifiable information (PII) in text. This model automatically detects the language of the text and uses the appropriate language model for PII detection.
The langdetect
library is used to detect the language of the text.
Warning
If nlp_url
is not defined for detected language, the detection via external NLP service will not be performed. In this case, only the regular expressions will be used for PII detection.
Notes
Currently
SquirroPiiDetector
contains regular expressions for next countries:Great Britain
Germany
Switzerland
USA
Quality of detection mostly depends on the quality of the chosen NLP model. The
fast
models are faster but less accurate, theaccurate
models are slower but more accurate.
More about spacy: https://spacy.io/models
List of all SpaCy types for EN: https://spacy.io/models/en
List of all SpaCy types for DE: https://spacy.io/models/de
List of all SpaCy types for FR: https://spacy.io/models/fr
More about langdetect: Mimino666/langdetect
Configuration#
The squirro
detector requires additional configuration parameters, in masking_config.extra
:
nlp_urls: Dictionary of NLP URLs for each language. Example:
{"en": "https://nlp-testing.squirro.com/spacy/en/fast/v0/_invoke"}
masking_allow_types: List of PII types to be detected and masked, by default. Example types:
["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER"]
, more Presidio types you can find here.custom_regex_patterns: Optional dictionary of custom regex patterns for PII detection. Example:
{"PESEL": "\\d{11}", "PHONE_NUMBER": "\\d{9}", "AGE": "\\d{1,3}"}
Note
General configuration parameter masking_min_score
is not used by the squirro
detector.
Example of simple configuration:
masking_config:
detector_name: "squirro"
unmasking_case_sensitive: false
sanitizer:
name: "simple_replacer"
replace_map:
- string: "\\"
replacement: " "
- string: ","
replacement: " "
- string: '"'
replacement: " "
- string: "'"
replacement: " "
- string: ";"
replacement: " "
- string: "`"
replacement: " "
extra:
nlp_urls:
en: "https://nlp-testing.squirro.com/spacy/en/fast/"
fr: "https://nlp-testing.squirro.com/spacy/fr/accurate/"
de: "https://nlp-testing.squirro.com/spacy/de/accurate/"
masking_allow_types:
- "PERSON"
- "EMAIL_ADDRESS"
custom_regex_patterns:
PESEL: "\\d{11}"
PHONE_NUMBER: "\\d{9}"
AGE: "\\d{1,3}"
Limitations and Edge Cases#
User/developer must be aware of the limitations of the squirro
detector. The SpaCy
external service may not detect all PII types. It is highly recommended to properly configure the sanitizer to clean the text before PII detection.
Supported languages and models depend on the external NLP service used.
Detection of PII data starting with special characters may not be detected, example:
**John Doe**
. This case is related to specific document formats, like:markdown
,html
, etc.In cases of content generation based on context, a new token may be generated based on the masked value. Example: Token in context is
PERSON_XYZ
, LLM asked to generate an example email address may generate a new token:EMAIL_XYZ@example.com
(the expected should bePERSON_XYZ@example.com
). In this case, a new tokenEMAIL_XYZ@example.com
will not be unmasked.
Built-in Regular Expressions#
Below are types you can use in masking_config.extra.masking_allow_types
.
Example:
extra:
masking_allow_types:
- SQ_EMAIL_ADDRESS
- SQ_CREDIT_CARD
- SQ_DE_STEUER_ID
Global#
Warning: The regular expressions are not perfect and may not detect all PII types or detect false positives. The regular expressions are based on the most common formats of PII data. You can find more information about the regular expressions in the src/squirro/extensions/langchain/pii/detectors/squirro_pii_detector.regex_patterns.py
file.
Here is the code with all bold formatting removed:
Type |
Description |
---|---|
SQ_EMAIL_ADDRESS |
Email address |
SQ_CREDIT_CARD |
Number of credit card, supported: VISA, MasterCard, American Express, Diners Club, Discover, JCB, UnionPay, Maestro, UATP |
SQ_PHONE_NUMBER |
Phone number |
SQ_IPV4 |
IPv4 |
SQ_IPV6 |
IPv6, not work for all edge cases, valid for most common formats |
SQ_MAC_ADDRESS |
MAC address |
Great Britain#
Here is the code with all bold formatting removed:
Type |
Description |
---|---|
SQ_UK_IBAN |
UK IBAN number |
SQ_UK_NHS_NUMBER |
NHS number |
SQ_UK_NINO |
National Insurance Number |
SQ_UK_PASSPORT |
UK Passport number |
Germany#
Here is the RST code with all bold formatting removed:
Type |
Description |
---|---|
SQ_DE_STEUER_ID |
Steuer-ID (German Tax ID) |
SQ_DE_SOZIALVERSICHERUNGSNUMMER |
German social insurance number |
SQ_DE_ID |
German ID identification number |
SQ_DE_PASSPORT |
German Passport number |
SQ_GERMAN_IBAN |
German IBAN number |
Switzerland#
Type |
Description |
---|---|
SQ_SWISS_PASSPORT |
Swiss Passport |
SQ_SWISS_ID |
Swiss ID, the same for AVS and AHV |
SQ_SWISS_IBAN |
Swiss IBAN |
USA#
Type |
Description |
---|---|
SQ_US_SSN |
USA Social Security Number |
SQ_US_PASSPORT |
USA Passport Number |
SQ_US_MBI |
USA Medicare Beneficiary Identifier |
SQ_US_ITIN |
USA Individual Taxpayer Identification Number |
Getting Started#
To activate the Privacy Layer add-on for your Squirro instance, contact Squirro Support and submit a technical support request. Once your system is ready, our solutions engineers will assist with the setup process to ensure a smooth integration and optimal configuration.