RFC-001: Conversational Entity Extractor¶

Field	Value
Author	Patrick Roland
Status	Partially Implemented
Created	2026-04-09
Last updated	2026-04-16
ZettelForge version	v2.7.0
Related RFCs	RFC-002 Universal LLM Provider (pending)

Context¶

ZettelForge's entity extractor recognizes CTI entities: CVEs, APT groups, tools, campaigns, IOCs. The LOCOMO benchmark tests conversational memory — person names, locations, hobbies, life events, and temporal references. Those entity types were invisible to the regex-only extractor.

At v1.5.0, ZettelForge scored 15% on LOCOMO. Root cause analysis identified entity extraction mismatch as the primary blocker: retrieved context contained the answer (75.9% keyword presence), but graph traversal returned nothing because no recognized entities matched LOCOMO queries. Multi-hop and temporal accuracy were 0%.

Proposal¶

Replace regex-only entity extraction with a hybrid pipeline:

Regex fast-path for CTI and IOC types (reliable, deterministic, no LLM required)
LLM NER for conversational types (person, location, organization, event, activity, temporal) where regex fails

This gives 19 entity types total: 6 CTI, 7 IOC, 6 conversational.

Target: 80%+ LOCOMO accuracy.

What shipped¶

Entity types (Steps 1–3 — complete)¶

EntityExtractor in entity_indexer.py is the single source of truth for all extraction. NoteConstructor.extract_entities() delegates to it. The 19-type schema is live:

Category	Entity types	Extraction method
CTI	`cve`, `intrusion_set`, `actor`, `tool`, `campaign`, `attack_pattern`	Regex
IOC (STIX cyber observables)	`ipv4`, `domain`, `url`, `md5`, `sha1`, `sha256`, `email`	Regex
Conversational	`person`, `location`, `organization`, `event`, `activity`, `temporal`	LLM NER (built; see note below)

The IOC types were not in the original proposal — they were added during implementation to support CTIBench and real-world CTI workflows.

For entity type reference and the get_entity_relationships() / traverse_graph() API, see KG Edge Schema and Memory Manager API.

Hash false-positive filter (complete)¶

Hash IOC patterns (md5/sha1/sha256) match any 32/40/64-character hex string, producing false positives on git commit SHAs and code assignments. EntityExtractor._CODE_CONTEXT_PATTERN detects code or VCS context per line and excludes all hex strings on matching lines from final results. This is a compiled re.VERBOSE pattern that handles assignments, commit entries, merge lines, code fences, and function definitions.

`_PERSON_PATTERN` and `_LOCATION_PATTERN` (complete, limited)¶

Two regex patterns handle conversational entities without an LLM call:

_PERSON_PATTERN matches dialogue format Name: text (e.g., transcript speaker labels), with a stopword filter excluding 40+ common words and day/month names.
_LOCATION_PATTERN matches 40+ major world cities by name.

These are supplementary. They do not cover implicit references ("my friend from college") or locations not in the hardcoded list.

`extract_llm()` method (complete; not activated)¶

EntityExtractor.extract_llm() is implemented and functional when called directly. It uses llm_client.generate() with a structured JSON prompt, truncates input to 2,000 characters, sets max_tokens=300, temperature=0.0, and retries once on JSON parse failure with json_mode=True and temperature=0.3. On exception, it returns an empty dict and logs exc_info=True.

As of v2.7.0, no production code path calls use_llm=True. remember(), build(), and NoteConstructor all use extract_all(use_llm=False) (the default). The LLM NER path is implemented but dormant. Conversational entity types (person, location, organization, event, activity, temporal) are populated only by the limited regex fallbacks above.

EntityIndexer persistence (complete)¶

EntityIndexer maintains an inverted index: {entity_type -> {entity_value -> Set[note_id]}}. Writes are batched via a 5-second threading.Timer (deferred flush). An atexit handler flushes on process exit. build() reindexes all notes synchronously with regex-only extraction.

As of v2.2.0, entity data exists in two stores: entity_index.json (the primary runtime store) and a SQLite entity_index table. The two are not kept in sync by a single authoritative writer. This is a known inconsistency.

Tests¶

tests/test_conversational_entities.py has 27 tests across 8 classes (5 more than the 22 named at RFC write time — TestFreeTextPersonNotExtracted was added after the adversarial review):

Class	Tests	Notes
`TestRegexExtraction`	5	CVE, actor, tool, campaign, no-match
`TestLLMExtraction`	3	CI-skipped (llama-cpp segfaults in CI)
`TestHybridExtraction`	2	CI-skipped
`TestNERParsing`	3	JSON parse logic with mocked output
`TestNoteConstructorDelegation`	2	Delegation wiring
`TestInferEntityType`	5	CVE, APT, tool, type hints, unknown
`TestEntityIndexerConversational`	3	All types present, add/lookup, cross-type search
`TestFreeTextPersonNotExtracted`	4	Stopword filter regression tests (added post-RFC)

LLM extraction tests remain CI-skipped. There are no automated regression tests for IOC extraction, the hash false-positive filter, remove_note(), or search_entities().

What did not ship¶

Step 4 — SynthesisGenerator in benchmark. locomo_benchmark.py still returns raw context as the answer. The answer extraction distillation step has not been implemented. This is the single highest-expected-impact remaining item.

LLM NER activation. use_llm=True has never been passed by any code path. The conversational entity types remain effectively regex-only in production.

Async batching. At 2–3 seconds per LLM NER call, batch ingestion with LLM NER enabled would be slow. No async batching implementation has been scheduled.

EntityIndexer backend unification. The JSON/SQLite dual-write inconsistency has not been resolved.

LOCOMO benchmark results¶

The RFC targeted 80%+ LOCOMO accuracy. Actual results:

Version	Score	Judge	Primary driver
v1.5.0	15%	Keyword overlap	Baseline (regex-only CTI)
v2.0.0	18%	Keyword overlap	Entity schema expansion
v2.1.1	22%	Ollama cloud model	Supersession perf fix, file locking, cloud judge
v2.7.0	7%	Keyword judge	Regression vs v2.1.1 Ollama judge; methodology change

The 80% target was not achieved. Analysis:

Entity extraction alone is insufficient. Without Step 4 (SynthesisGenerator distilling focused answers), the judge evaluates a wall of context and scores low even when the answer is present.
The 3B model ceiling. Qwen2.5-3B handles common entities adequately but struggles with implicit references and abstract events. LOCOMO relies heavily on these.
LLM NER was never activated. Conversational entity extraction has not been deployed, so the core RFC value proposition (LLM-powered NER improving multi-hop and temporal accuracy) remains untested in production.
Most of the v2.1.1 gain came from performance fixes and a better judge, not from entity extraction.

The v2.7.0 7% keyword-judge baseline is not directly comparable to the v2.1.1 22% Ollama-judge result. The methodology change (keyword judge vs LLM judge) accounts for most of the apparent regression. See Reproduce Benchmarks for how to replicate either result.

Decision¶

Accepted (2026-04-09). Adversarial review completed 2026-04-16. Two blockers identified and actioned: load() log event name corrected; attack_pattern regex capture group fixed. Remaining findings documented as open issues.

Status as of v2.7.0: Partially Implemented. Steps 1–3 are complete. Step 4 (SynthesisGenerator in benchmark) is not started. LLM NER activation is deferred.

Adversarial review findings (2026-04-16)¶

#	Severity	Finding	Status
1	BLOCKER	`use_llm=True` never called — LLM NER is dead code	Open — fix pending
2	BLOCKER	`load()` logs `entity_index_save_failed` instead of `load_failed`	Fixed
3	WARNING	`attack_pattern` regex lacked capture group	Fixed
4	WARNING	`save()` truncates before flock — race condition	Open
5	WARNING	`remove_note()` deletes entity type keys	Open
6	WARNING	`_flush_sync()` not thread-safe on `self.index`	Open
7	WARNING	`extract_all(use_llm=True)` overwrites regex results	Open
8	WARNING	No tests for IOCs, hash filter, `remove_note`, `search_entities`	Open
9	WARNING	Root cause analysis omitted LLM NER not activated	Fixed in RFC
10	NIT	Test count stated as 18, actual was 23 (now 27 in v2.7.0)	Fixed in RFC

Recommendations for future work¶

Wire use_llm=True into the remember() ingest path (highest-priority fix — unlocks the core RFC value proposition).
Implement Step 4 (SynthesisGenerator in locomo_benchmark.py) — highest expected LOCOMO impact.
Add mock-based NER regression tests that cover extract_llm() without a live model. TestNERParsing already covers parse logic; extend to the full extraction path.
Unify EntityIndexer storage to write through StorageBackend and eliminate the JSON/SQLite dual-write inconsistency.
Evaluate RFC-002 (Universal LLM Provider) impact: once a larger model is accessible through a stable provider interface, re-run LOCOMO to measure the NER quality ceiling independent of the 3B model constraint.