Privacy Layer#

The widespread adoption of conversational AI systems raises concerns about data privacy. To mitigate the risks, it is essential to prevent the transfer of personally identifiable information (PII) to external IT infrastructures. The Squirro Privacy Layer add-on protects organizations by filtering out PII from the content sent to third-party language model providers.

Understanding the Privacy Layer#

The privacy layer is a set of technologies and protocols designed to protect PII by removing it from the content transferred to the large language model (LLM), with minimal impact. That layer transforms the combined inputs from the user with the context and background information passed by the retrieval augmented generation (RAG) mechanism, into content where temporary placeholders replace PII occurrences. That process, often called de-identification or data masking, is reversible, allowing the system to restore the original PII values inside the output from the LLM.

Squirro Chat Data Processing Overview

The add-on provides a wrapper class chat_wrapper.PIIMaskingChatWrapper that allows for encapsulating the standard LLM chat class derived from langchain_openai.chat_models.base.BaseChatModel.

Notes

Data transmitted to Tools is excluded from masking.

Integration#

The squirro.service.genai.components.builtins.chat_model.ChatModelFactory mechanism automatically detects and uses the PIIMaskingChatWrapper class to encapsulate the LLM chat class whenever the masking_config parameter is set in the LLM configuration. This means that for every API endpoint, if the masking_config parameter is set in the llm section of the configuration file src/squirro/service/genai/deployments/default.yaml, the wrapper is automatically used to encapsulate the LLM chat class. If the masking_config parameter is not set, the standard LLM chat class is used (no PII masking).

Notes

  • The masking_config must be specified for each API endpoint, and for each language model individually.

  • The configuration must match the schema defined by the squirro.extensions.langchain.pii.config.PIIConfig class.

Data Flow#

The data masking and unmasking process involves the following steps:

  1. Each message (excluded SystemMessage) is masked before being transmitted to the LLM

  2. Masking
    1. Detection of message type

    2. Message parsing and extract text

    3. Detection of PII (via defined in configuration detector)

    4. Create tokens for PII text (tokens are store only in memory)

    5. Replace sensitive text with tokens

  3. Each message returned by the LLM is unmasked

  4. Unmasking

    1. Replace tokens with original text

Configuration#

  1. The configuration must match the schema defined by the squirro.extensions.langchain.pii.config.PIIConfig class.

  2. The configuration must be placed in the src/squirro/service/genai/deployments/default.yaml configuration file in the llm section for each API endpoint for each language model (llm) for which masking is to be enabled.

Example configuration for endpoint /v0/arbitrary_prompt:

prefix: "/v0"
description: "Endpoint for arbitrary prompts"
stream_endpoint: null
invoke_endpoint: "arbitrary_prompt"
extra_input_mapping:
    ...
chain_config:
    $runnable: squirro/chains/arbitrary_prompt
    llm:
        $runnable: generic_chat_models/chat_models/generic_chat_model
        masking_config:
            detector_name: "squirro"
            unmasking_case_sensitive: false
            sanitizer:
                name: "sanitizer"
                replace_map:
                    - string: "\\"
                    replacement: " "
                    - string: ";"
                    replacement: " "
                    - string: ":"
                    replacement: " "
                    - string: ","
                    replacement: " "
            extra:
                nlp_urls:
                    en: "https://nlp-testing.squirro.com/spacy/en/fast/v0/_invoke"
                    fr: "https://nlp-testing.squirro.com/spacy/fr/accurate/v0/_invoke"
                    de: "https://nlp-testing.squirro.com/spacy/de/accurate/v0/_invoke"
                masking_allow_types:
                    - PERSON # Names of people.
                    - LOC # Locations (geographical and other).
                    - EMAIL # Email addresses (depending on the model).
                    - PHONE_NUMBER # Phone numbers (in some models).
                custom_regex_patterns:
                    - pattern: "([0-9]{3}-[0-9]{2}-[0-9]{4})"
                    type: "SSN"

PIIConfig#

  • detector_name: (str) - PII detector name, default squirro, available detectors:

    • presidio

    • squirro

  • masking_min_score: (float) - minimum PII detection efficiency threshold, default 0.0. Not all detectors use this parameter.

  • unmasking_case_sensitive: (bool) - flag specifying whether token unmasking should be case-sensitive, default is False. False means that the unmasking is case-insensitive, True means that the unmasking is case-sensitive.

  • sanitizer: (SanitizerConfig) - configuration of the text sanitizer. The sanitizer is responsible for replacing characters in the text before PII detection. The sanitizer configuration must match the schema defined by the squirro.extensions.langchain.pii.sanitizer.SanitizerConfig class.

  • extra: (dict) - additional configuration parameters required by the PII detector, default is {}. Each PII detector may require different configuration parameters. See the Detectors section for more details.

SanitizerConfig#

  • name: (str) - name of the sanitizer, default is sanitizer

  • replace_map: (list[ReplaceMapItem]) - list of dictionaries with the following keys:

    • pattern: (str) - string to be replaced

    • replacement: (str) - replacement string

  • extra: (dict) - additional configuration parameters required by the sanitizer, default is {}

Example of simple configuration:

name: "sanitizer"
replace_map:
    - {string: ";", replacement: " "}
    - {string: ":", replacement: " "}
    - {string: ",", replacement: " "}
extra: {}

Detectors#

PII detectors are responsible for detecting personally identifiable information (PII) in text. It is possible to add your own PII detectors, which will be used by the PII Masking extension. By default, the PresidioPIIDetector detector based on the Presidio library is available.

SquirroPiiDetector#

  • Detector name: squirro

  • Detector class: src.squirro.extensions.langchain.pii.detectors.squirro_pii_detector.SquirroPiiDetector

Flow of detection:

  1. Sanitization of the text.

  2. Languages detection.

  3. For each detected language detection of PII data using external NLP SpaCy service (if nlp_url for language is given in configuration).

  4. Detection of PII data using regular expressions.

The SquirroPiiDetector detector is a hybrid detector that uses external NLP SpaCy service (via http) and set of regular expressions to detect personally identifiable information (PII) in text. This model automatically detects the language of the text and uses the appropriate language model for PII detection. The langdetect library is used to detect the language of the text.

Warning

If nlp_url is not defined for detected language, the detection via external NLP service will not be performed. In this case, only the regular expressions will be used for PII detection.

Notes

  • Currently SquirroPiiDetector contains regular expressions for next countries:

    • Great Britain

    • Germany

    • Switzerland

    • USA

  • Quality of detection mostly depends on the quality of the chosen NLP model. The fast models are faster but less accurate, the accurate models are slower but more accurate.

Configuration#

The squirro detector requires additional configuration parameters, in masking_config.extra:

  • nlp_urls: Dictionary of NLP URLs for each language. Example: {"en": "https://nlp-testing.squirro.com/spacy/en/fast/v0/_invoke"}

  • masking_allow_types: List of PII types to be detected and masked, by default. Example types: ["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER"], more Presidio types you can find here.

  • custom_regex_patterns: Optional dictionary of custom regex patterns for PII detection. Example: {"PESEL": "\\d{11}", "PHONE_NUMBER": "\\d{9}", "AGE": "\\d{1,3}"}

Note

General configuration parameter masking_min_score is not used by the squirro detector.

Example of simple configuration:

masking_config:
    detector_name: "squirro"
    unmasking_case_sensitive: false
    sanitizer:
        name: "simple_replacer"
        replace_map:
            - string: "\\"
              replacement: " "
            - string: ","
              replacement: " "
            - string: '"'
              replacement: " "
            - string: "'"
              replacement: " "
            - string: ";"
              replacement: " "
            - string: "`"
              replacement: " "
    extra:
        nlp_urls:
            en: "https://nlp-testing.squirro.com/spacy/en/fast/"
            fr: "https://nlp-testing.squirro.com/spacy/fr/accurate/"
            de: "https://nlp-testing.squirro.com/spacy/de/accurate/"
        masking_allow_types:
            - "PERSON"
            - "EMAIL_ADDRESS"
        custom_regex_patterns:
            PESEL: "\\d{11}"
            PHONE_NUMBER: "\\d{9}"
            AGE: "\\d{1,3}"

Limitations and Edge Cases#

User/developer must be aware of the limitations of the squirro detector. The SpaCy external service may not detect all PII types. It is highly recommended to properly configure the sanitizer to clean the text before PII detection.

  • Supported languages and models depend on the external NLP service used.

  • Detection of PII data starting with special characters may not be detected, example: **John Doe**. This case is related to specific document formats, like: markdown, html, etc.

  • In cases of content generation based on context, a new token may be generated based on the masked value. Example: Token in context is PERSON_XYZ, LLM asked to generate an example email address may generate a new token: EMAIL_XYZ@example.com (the expected should be PERSON_XYZ@example.com). In this case, a new token EMAIL_XYZ@example.com will not be unmasked.

Built-in Regular Expressions#

Below are types you can use in masking_config.extra.masking_allow_types.

Example:

extra:
    masking_allow_types:
        - SQ_EMAIL_ADDRESS
        - SQ_CREDIT_CARD
        - SQ_DE_STEUER_ID
Global#

Warning: The regular expressions are not perfect and may not detect all PII types or detect false positives. The regular expressions are based on the most common formats of PII data. You can find more information about the regular expressions in the src/squirro/extensions/langchain/pii/detectors/squirro_pii_detector.regex_patterns.py file.

Here is the code with all bold formatting removed:

Type

Description

SQ_EMAIL_ADDRESS

Email address

SQ_CREDIT_CARD

Number of credit card, supported: VISA, MasterCard, American Express, Diners Club, Discover, JCB, UnionPay, Maestro, UATP

SQ_PHONE_NUMBER

Phone number

SQ_IPV4

IPv4

SQ_IPV6

IPv6, not work for all edge cases, valid for most common formats

SQ_MAC_ADDRESS

MAC address

Great Britain#

Here is the code with all bold formatting removed:

Type

Description

SQ_UK_IBAN

UK IBAN number

SQ_UK_NHS_NUMBER

NHS number

SQ_UK_NINO

National Insurance Number

SQ_UK_PASSPORT

UK Passport number

Germany#

Here is the RST code with all bold formatting removed:

Type

Description

SQ_DE_STEUER_ID

Steuer-ID (German Tax ID)

SQ_DE_SOZIALVERSICHERUNGSNUMMER

German social insurance number

SQ_DE_ID

German ID identification number

SQ_DE_PASSPORT

German Passport number

SQ_GERMAN_IBAN

German IBAN number

Switzerland#

Type

Description

SQ_SWISS_PASSPORT

Swiss Passport

SQ_SWISS_ID

Swiss ID, the same for AVS and AHV

SQ_SWISS_IBAN

Swiss IBAN

USA#

Type

Description

SQ_US_SSN

USA Social Security Number

SQ_US_PASSPORT

USA Passport Number

SQ_US_MBI

USA Medicare Beneficiary Identifier

SQ_US_ITIN

USA Individual Taxpayer Identification Number

Getting Started#

To activate the Privacy Layer add-on for your Squirro instance, contact Squirro Support and submit a technical support request. Once your system is ready, our solutions engineers will assist with the setup process to ensure a smooth integration and optimal configuration.