Configure PII Detection and Redaction¶

ZettelForge uses Microsoft Presidio (open-source, MIT license) to detect and optionally redact PII (Personally Identifiable Information) from content before it is stored in the vector database and knowledge graph.

PII detection is disabled by default with no new core dependencies. It is a fully optional feature -- pip install zettelforge[pii] activates it.

Prerequisites¶

ZettelForge installed
pip install zettelforge[pii] to install presidio-analyzer, presidio-anonymizer, and spaCy
About ~12-50 MB of disk space for the spaCy model (auto-downloads on first use)

How It Works¶

Presidio runs in-process as a validation step inside GovernanceValidator, invoked before every remember() operation:

remember(content)
  -> GovernanceValidator.validate_remember(content)
    -> PIIValidator.validate(content)
      -> presidio-analyzer scans for 20+ PII types
    -> Returns (passed, processed_content, detections)
  -> Returns processed_content (possibly redacted)
-> MemoryStore.save(processed_content)

Three actions control what happens when PII is detected:

Action	Behavior	Use Case
`log`	Detect, log a warning, pass content through unchanged	Discovery -- see what PII is in your pipeline
`redact`	Replace PII with `[REDACTED]` before storage	Compliance -- prevent PII persistence
`block`	Raise an exception, storage is cancelled	Strict environments -- no PII allowed through

Configuration¶

Add a pii: section under governance: in your config.yaml:

governance:
  enabled: true
  pii:
    enabled: true       # enable PII detection
    action: log          # log | redact | block
    redact_placeholder: "[REDACTED]"
    entities: []         # empty = all PII types
    language: en
    nlp_model: en_core_web_sm

Entity Filtering¶

The entities list lets you scope detection to specific PII types. When empty (default), all supported types are detected.

Common entity types:

Entity	Example	Notes
`EMAIL_ADDRESS`	`user@example.com`	Enables spam from phishing reports
`PHONE_NUMBER`	`(555) 123-4567`
`PERSON`	`John Smith`
`CREDIT_CARD`	`4111-1111-1111-1111`
`SSN`	`123-45-6789`
`CRYPTO`	`1A1zP1eP5QGefi2DMP`	Bitcoin addresses
`LOCATION`	`New York City`
`ORGANIZATION`	`Microsoft Corp`

IP addresses, URLs, and domain names are exempt from detection by default -- these are legitimate CTI indicators (IOCs), not PII in the threat intelligence context. To include them, set entities explicitly:

pii:
  enabled: true
  entities: ["IP_ADDRESS", "EMAIL_ADDRESS"]   # IPs will now be detected

Example Configurations¶

1. Log-Only (Discovery Mode)¶

Use this first to understand what PII flows through your pipeline without changing any data:

governance:
  pii:
    enabled: true
    action: log

Every PII detection is logged as a structured pii_detected log event with count, entity types, and scores. Content is stored unchanged.

2. Redact (Compliance Mode)¶

Automatically replace PII with placeholders before storage:

governance:
  pii:
    enabled: true
    action: redact
    redact_placeholder: "[PII REMOVED]"

The redacted content is what gets stored and indexed. The original content with PII is never persisted.

3. Block (Strict Mode)¶

Reject any content containing PII entirely:

governance:
  pii:
    enabled: true
    action: block

If PII is detected, remember() raises a PIIBlockedError and the operation is cancelled. The calling code receives the exception and can handle it (e.g., ask the user to retry without PII).

4. Targeted Detection (Only Emails and Phones)¶

Scope detection to specific entity types to reduce noise:

governance:
  pii:
    enabled: true
    action: redact
    entities: ["EMAIL_ADDRESS", "PHONE_NUMBER"]

5. Complete Compliance Setup (FedRAMP-aligned)¶

governance:
  enabled: true
  min_content_length: 1
  pii:
    enabled: true
    action: redact
    redact_placeholder: "[REDACTED]"
    entities: []
    language: en
    nlp_model: en_core_web_sm

Environment Variables¶

Variable	Maps To	Default
`ZETTELFORGE_PII_ENABLED`	`governance.pii.enabled`	`false`
`ZETTELFORGE_PII_ACTION`	`governance.pii.action`	`log`

spaCy Model Download¶

The spaCy NLP model downloads automatically on the first remember() call after PII is enabled. The download is a one-time cost:

Model	Size	Speed	Notes
`en_core_web_sm`	~12 MB	Fast	Default. Good accuracy for standard PII
`en_core_web_md`	~40 MB	Medium	Better person/location disambiguation
`en_core_web_lg`	~560 MB	Slow	Best accuracy, word vectors
`en_core_web_trf`	~400 MB	Slowest	Transformer-based, best for context

To pre-download (recommended for air-gapped deployments):

python -m spacy download en_core_web_sm

Verification¶

After configuration, test that PII detection is working:

from zettelforge import MemoryManager

mm = MemoryManager()

# This should trigger a PII warning if action=log
note, status = mm.remember(
    "Contact analyst John Smith at john@example.com or 555-1234 for details."
)

With action=log, you will see a pii_detected structured log event.

With action=redact, the stored content will have PII replaced:

print(note.content.raw)
# "Contact analyst [REDACTED] at [REDACTED] or [REDACTED] for details."

With action=block, remember() will raise PIIBlockedError.

Performance Impact¶

First call: ~2-3 seconds (spaCy model loading). Subsequent calls are fast.
Detection latency: ~50-200ms per remember() depending on content length and model size.
No network calls (all detection is local).
No impact when governance.pii.enabled: false (disabled by default).

Configuration Reference -- all config.yaml keys
Governance Controls -- GOV-013 PII enforcement
Microsoft Presidio -- upstream project
RFC-013: PII Detection and Redaction via Microsoft Presidio