Privacy Layer

Privacy Layer#

The widespread adoption of conversational AI systems raises concerns about data privacy. To mitigate the risks, it is essential to prevent the transfer of personally identifiable information (PII) to external IT infrastructures. The Squirro Privacy Layer add-on protects organizations by filtering out PII from the content sent to third-party language model providers.

Understanding the Privacy Layer#

The privacy layer is a set of technologies and protocols designed to protect PII by removing it from the content transferred to the large language model (LLM), with minimal impact. That layer transforms the combined inputs from the user with the context and background information passed by the retrieval augmented generation (RAG) mechanism, into content where temporary placeholders replace PII occurrences. That process, often called de-identification or data masking, is reversible, allowing the system to restore the original PII values inside the output from the LLM.

The add-on provides a wrapper class chat_wrapper.PIIMaskingChatWrapper that allows for encapsulating the standard LLM chat class derived from langchain_openai.chat_models.base.BaseChatModel.

Notes

Data transmitted to Tools is excluded from masking.

Integration#

The squirro.service.genai.components.builtins.chat_model.ChatModelFactory mechanism automatically detects and uses the PIIMaskingChatWrapper class to encapsulate the LLM chat class whenever the masking_config parameter is set in the LLM configuration. This means that for every API endpoint, if the masking_config parameter is set in the llm section of the configuration file src/squirro/service/genai/deployments/default.yaml, the wrapper is automatically used to encapsulate the LLM chat class. If the masking_config parameter is not set, the standard LLM chat class is used (no PII masking).

Notes

The masking_config must be specified for each API endpoint, and for each language model individually.
The configuration must match the schema defined by the squirro.extensions.langchain.pii.config.PIIConfig class.

Data Flow#

The data masking and unmasking process involves the following steps:

Each message (excluded SystemMessage) is masked before being transmitted to the LLM
Masking
1. Detection of message type
2. Message parsing and extract text
3. Detection of PII (via defined in configuration detector)
4. Create tokens for PII text (tokens are store only in memory)
5. Replace sensitive text with tokens
Each message returned by the LLM is unmasked
Unmasking
1. Replace tokens with original text

Configuration#

The configuration must match the schema defined by the squirro.extensions.langchain.pii.config.PIIConfig class.
The configuration must be placed in the src/squirro/service/genai/deployments/default.yaml configuration file in the llm section for each API endpoint for each language model (llm) for which masking is to be enabled.

Example configuration for endpoint /v0/arbitrary_prompt:

prefix: "/v0"
description: "Endpoint for arbitrary prompts"
stream_endpoint: null
invoke_endpoint: "arbitrary_prompt"
extra_input_mapping:
    ...
chain_config:
    $runnable: squirro/chains/arbitrary_prompt
    llm:
        $runnable: generic_chat_models/chat_models/generic_chat_model
        masking_config:
            detector_name: "squirro"
            unmasking_case_sensitive: false
            sanitizer:
                name: "sanitizer"
                replace_map:
                    - string: "\\"
                    replacement: " "
                    - string: ";"
                    replacement: " "
                    - string: ":"
                    replacement: " "
                    - string: ","
                    replacement: " "
            extra:
                nlp_urls:
                    en: "https://nlp-testing.squirro.com/spacy/en/fast/v0/_invoke"
                    fr: "https://nlp-testing.squirro.com/spacy/fr/accurate/v0/_invoke"
                    de: "https://nlp-testing.squirro.com/spacy/de/accurate/v0/_invoke"
                masking_allow_types:
                    - PERSON # Names of people.
                    - LOC # Locations (geographical and other).
                    - EMAIL # Email addresses (depending on the model).
                    - PHONE_NUMBER # Phone numbers (in some models).
                custom_regex_patterns:
                    - pattern: "([0-9]{3}-[0-9]{2}-[0-9]{4})"
                    type: "SSN"

PIIConfig#

detector_name: (str) - PII detector name, default squirro, available detectors:
- presidio
- squirro
masking_min_score: (float) - minimum PII detection efficiency threshold, default 0.0. Not all detectors use this parameter.
unmasking_case_sensitive: (bool) - flag specifying whether token unmasking should be case-sensitive, default is False. False means that the unmasking is case-insensitive, True means that the unmasking is case-sensitive.
sanitizer: (SanitizerConfig) - configuration of the text sanitizer. The sanitizer is responsible for replacing characters in the text before PII detection. The sanitizer configuration must match the schema defined by the squirro.extensions.langchain.pii.sanitizer.SanitizerConfig class.
extra: (dict) - additional configuration parameters required by the PII detector, default is {}. Each PII detector may require different configuration parameters. See the Detectors section for more details.

SanitizerConfig#

name: (str) - name of the sanitizer, default is sanitizer
replace_map: (list[ReplaceMapItem]) - list of dictionaries with the following keys:
- pattern: (str) - string to be replaced
- replacement: (str) - replacement string
extra: (dict) - additional configuration parameters required by the sanitizer, default is {}

Example of simple configuration:

name: "sanitizer"
replace_map:
    - {string: ";", replacement: " "}
    - {string: ":", replacement: " "}
    - {string: ",", replacement: " "}
extra: {}

Detectors#

PII detectors are responsible for detecting personally identifiable information (PII) in text. It is possible to add your own PII detectors, which will be used by the PII Masking extension. By default, the PresidioPIIDetector detector based on the Presidio library is available.

SquirroPiiDetector#

Detector name: squirro
Detector class: src.squirro.extensions.langchain.pii.detectors.squirro_pii_detector.SquirroPiiDetector

Flow of detection:

Sanitization of the text.
Languages detection.
For each detected language detection of PII data using external NLP SpaCy service (if nlp_url for language is given in configuration).
Detection of PII data using regular expressions.

The SquirroPiiDetector detector is a hybrid detector that uses external NLP SpaCy service (via http) and set of regular expressions to detect personally identifiable information (PII) in text. This model automatically detects the language of the text and uses the appropriate language model for PII detection. The langdetect library is used to detect the language of the text.

Warning

If nlp_url is not defined for detected language, the detection via external NLP service will not be performed. In this case, only the regular expressions will be used for PII detection.

Notes

Currently SquirroPiiDetector contains regular expressions for next countries:
- Great Britain
- Germany
- Switzerland
- USA
Quality of detection mostly depends on the quality of the chosen NLP model. The fast models are faster but less accurate, the accurate models are slower but more accurate.

More about spacy: https://spacy.io/models
List of all SpaCy types for EN: https://spacy.io/models/en
List of all SpaCy types for DE: https://spacy.io/models/de
List of all SpaCy types for FR: https://spacy.io/models/fr
More about langdetect: Mimino666/langdetect

Configuration#

The squirro detector requires additional configuration parameters, in masking_config.extra:

nlp_urls: Dictionary of NLP URLs for each language. Example: {"en": "https://nlp-testing.squirro.com/spacy/en/fast/v0/_invoke"}
masking_allow_types: List of PII types to be detected and masked, by default. Example types: ["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER"], more Presidio types you can find here.
custom_regex_patterns: Optional dictionary of custom regex patterns for PII detection. Example: {"PESEL": "\\d{11}", "PHONE_NUMBER": "\\d{9}", "AGE": "\\d{1,3}"}

Note

General configuration parameter masking_min_score is not used by the squirro detector.

Example of simple configuration:

masking_config:
    detector_name: "squirro"
    unmasking_case_sensitive: false
    sanitizer:
        name: "simple_replacer"
        replace_map:
            - string: "\\"
              replacement: " "
            - string: ","
              replacement: " "
            - string: '"'
              replacement: " "
            - string: "'"
              replacement: " "
            - string: ";"
              replacement: " "
            - string: "`"
              replacement: " "
    extra:
        nlp_urls:
            en: "https://nlp-testing.squirro.com/spacy/en/fast/"
            fr: "https://nlp-testing.squirro.com/spacy/fr/accurate/"
            de: "https://nlp-testing.squirro.com/spacy/de/accurate/"
        masking_allow_types:
            - "PERSON"
            - "EMAIL_ADDRESS"
        custom_regex_patterns:
            PESEL: "\\d{11}"
            PHONE_NUMBER: "\\d{9}"
            AGE: "\\d{1,3}"

Limitations and Edge Cases#

User/developer must be aware of the limitations of the squirro detector. The SpaCy external service may not detect all PII types. It is highly recommended to properly configure the sanitizer to clean the text before PII detection.

Supported languages and models depend on the external NLP service used.
Detection of PII data starting with special characters may not be detected, example: **John Doe**. This case is related to specific document formats, like: markdown, html, etc.
In cases of content generation based on context, a new token may be generated based on the masked value. Example: Token in context is PERSON_XYZ, LLM asked to generate an example email address may generate a new token: EMAIL_XYZ@example.com (the expected should be PERSON_XYZ@example.com). In this case, a new token EMAIL_XYZ@example.com will not be unmasked.

Built-in Regular Expressions#

Below are types you can use in masking_config.extra.masking_allow_types.

Example:

extra:
    masking_allow_types:
        - SQ_EMAIL_ADDRESS
        - SQ_CREDIT_CARD
        - SQ_DE_STEUER_ID

Global#

Warning: The regular expressions are not perfect and may not detect all PII types or detect false positives. The regular expressions are based on the most common formats of PII data. You can find more information about the regular expressions in the src/squirro/extensions/langchain/pii/detectors/squirro_pii_detector.regex_patterns.py file.

Here is the code with all bold formatting removed:

Type	Description
SQ_EMAIL_ADDRESS	Email address
SQ_CREDIT_CARD	Number of credit card, supported: VISA, MasterCard, American Express, Diners Club, Discover, JCB, UnionPay, Maestro, UATP
SQ_PHONE_NUMBER	Phone number
SQ_IPV4	IPv4
SQ_IPV6	IPv6, not work for all edge cases, valid for most common formats
SQ_MAC_ADDRESS	MAC address

Great Britain#

Here is the code with all bold formatting removed:

Type	Description
SQ_UK_IBAN	UK IBAN number
SQ_UK_NHS_NUMBER	NHS number
SQ_UK_NINO	National Insurance Number
SQ_UK_PASSPORT	UK Passport number

Germany#

Here is the RST code with all bold formatting removed:

Type	Description
SQ_DE_STEUER_ID	Steuer-ID (German Tax ID)
SQ_DE_SOZIALVERSICHERUNGSNUMMER	German social insurance number
SQ_DE_ID	German ID identification number
SQ_DE_PASSPORT	German Passport number
SQ_GERMAN_IBAN	German IBAN number

Switzerland#

Type	Description
SQ_SWISS_PASSPORT	Swiss Passport
SQ_SWISS_ID	Swiss ID, the same for AVS and AHV
SQ_SWISS_IBAN	Swiss IBAN

USA#

Type	Description
SQ_US_SSN	USA Social Security Number
SQ_US_PASSPORT	USA Passport Number
SQ_US_MBI	USA Medicare Beneficiary Identifier
SQ_US_ITIN	USA Individual Taxpayer Identification Number

Getting Started#

To activate the Privacy Layer add-on for your Squirro instance, contact Squirro Support and submit a technical support request. Once your system is ready, our solutions engineers will assist with the setup process to ensure a smooth integration and optimal configuration.

Privacy Layer

Contents

Privacy Layer#

Understanding the Privacy Layer#

Integration#

Data Flow#

Configuration#

PIIConfig#

SanitizerConfig#

Detectors#

SquirroPiiDetector#

Configuration#

Limitations and Edge Cases#

Built-in Regular Expressions#

Global#

Great Britain#

Germany#

Switzerland#

USA#

Getting Started#