Configure PII Detection and Redaction¶
ZettelForge uses Microsoft Presidio (open-source, MIT license) to detect and optionally redact PII (Personally Identifiable Information) from content before it is stored in the vector database and knowledge graph.
PII detection is disabled by default with no new core dependencies. It is a fully optional feature -- pip install zettelforge[pii] activates it.
Prerequisites¶
- ZettelForge installed
pip install zettelforge[pii]to install presidio-analyzer, presidio-anonymizer, and spaCy- About ~12-50 MB of disk space for the spaCy model (auto-downloads on first use)
How It Works¶
Presidio runs in-process as a validation step inside GovernanceValidator, invoked before every remember() operation:
remember(content)
-> GovernanceValidator.validate_remember(content)
-> PIIValidator.validate(content)
-> presidio-analyzer scans for 20+ PII types
-> Returns (passed, processed_content, detections)
-> Returns processed_content (possibly redacted)
-> MemoryStore.save(processed_content)
Three actions control what happens when PII is detected:
| Action | Behavior | Use Case |
|---|---|---|
log |
Detect, log a warning, pass content through unchanged | Discovery -- see what PII is in your pipeline |
redact |
Replace PII with [REDACTED] before storage |
Compliance -- prevent PII persistence |
block |
Raise an exception, storage is cancelled | Strict environments -- no PII allowed through |
Configuration¶
Add a pii: section under governance: in your config.yaml:
governance:
enabled: true
pii:
enabled: true # enable PII detection
action: log # log | redact | block
redact_placeholder: "[REDACTED]"
entities: [] # empty = all PII types
language: en
nlp_model: en_core_web_sm
Entity Filtering¶
The entities list lets you scope detection to specific PII types. When empty (default), all supported types are detected.
Common entity types:
| Entity | Example | Notes |
|---|---|---|
EMAIL_ADDRESS |
user@example.com |
Enables spam from phishing reports |
PHONE_NUMBER |
(555) 123-4567 |
|
PERSON |
John Smith |
|
CREDIT_CARD |
4111-1111-1111-1111 |
|
SSN |
123-45-6789 |
|
CRYPTO |
1A1zP1eP5QGefi2DMP |
Bitcoin addresses |
LOCATION |
New York City |
|
ORGANIZATION |
Microsoft Corp |
IP addresses, URLs, and domain names are exempt from detection by default -- these are legitimate CTI indicators (IOCs), not PII in the threat intelligence context. To include them, set entities explicitly:
Example Configurations¶
1. Log-Only (Discovery Mode)¶
Use this first to understand what PII flows through your pipeline without changing any data:
Every PII detection is logged as a structured pii_detected log event with count, entity types, and scores. Content is stored unchanged.
2. Redact (Compliance Mode)¶
Automatically replace PII with placeholders before storage:
The redacted content is what gets stored and indexed. The original content with PII is never persisted.
3. Block (Strict Mode)¶
Reject any content containing PII entirely:
If PII is detected, remember() raises a PIIBlockedError and the operation is cancelled. The calling code receives the exception and can handle it (e.g., ask the user to retry without PII).
4. Targeted Detection (Only Emails and Phones)¶
Scope detection to specific entity types to reduce noise:
5. Complete Compliance Setup (FedRAMP-aligned)¶
governance:
enabled: true
min_content_length: 1
pii:
enabled: true
action: redact
redact_placeholder: "[REDACTED]"
entities: []
language: en
nlp_model: en_core_web_sm
Environment Variables¶
| Variable | Maps To | Default |
|---|---|---|
ZETTELFORGE_PII_ENABLED |
governance.pii.enabled |
false |
ZETTELFORGE_PII_ACTION |
governance.pii.action |
log |
spaCy Model Download¶
The spaCy NLP model downloads automatically on the first remember() call after PII is enabled. The download is a one-time cost:
| Model | Size | Speed | Notes |
|---|---|---|---|
en_core_web_sm |
~12 MB | Fast | Default. Good accuracy for standard PII |
en_core_web_md |
~40 MB | Medium | Better person/location disambiguation |
en_core_web_lg |
~560 MB | Slow | Best accuracy, word vectors |
en_core_web_trf |
~400 MB | Slowest | Transformer-based, best for context |
To pre-download (recommended for air-gapped deployments):
Verification¶
After configuration, test that PII detection is working:
from zettelforge import MemoryManager
mm = MemoryManager()
# This should trigger a PII warning if action=log
note, status = mm.remember(
"Contact analyst John Smith at john@example.com or 555-1234 for details."
)
With action=log, you will see a pii_detected structured log event.
With action=redact, the stored content will have PII replaced:
With action=block, remember() will raise PIIBlockedError.
Performance Impact¶
- First call: ~2-3 seconds (spaCy model loading). Subsequent calls are fast.
- Detection latency: ~50-200ms per
remember()depending on content length and model size. - No network calls (all detection is local).
- No impact when
governance.pii.enabled: false(disabled by default).
Related¶
- Configuration Reference -- all
config.yamlkeys - Governance Controls -- GOV-013 PII enforcement
- Microsoft Presidio -- upstream project
- RFC-013: PII Detection and Redaction via Microsoft Presidio