Entity indexer concurrency¶

Module: zettelforge.entity_indexer
ZettelForge 2.7.0

The EntityIndexer maintains an in-memory index mapping entity values to note IDs, persisted as a JSON file on disk. It is thread-safe for concurrent reads and writes. This page documents the locking model, atomicity guarantees, and the regression history (RFC-001 Warnings 4, 5, and 6) that drove the current implementation.

from zettelforge.entity_indexer import EntityIndexer, EntityExtractor

Index structure¶

The index is a three-level dict, serialized to entity_index.json in the ZettelForge data directory:

entity_type  →  entity_value  →  set[note_id]

Example after indexing a note about APT28 using Cobalt Strike:

{
    "actor":  {"apt28": {"note_abc123"}},
    "tool":   {"cobalt-strike": {"note_abc123"}},
    "cve":    {},
    # ... 16 more entity-type buckets (all 19 ENTITY_TYPES present)
}

All 19 entity-type buckets are present from __init__ onward, even when empty. See 19-type invariant.

Concurrency guarantees¶

All mutations and persistence operations are serialized through a single threading.RLock (_flush_lock). Read-only lookup methods are not locked and rely on the CPython GIL for safety.

Operation	Locked	Notes
`add_note(note_id, entities)`	Yes (RLock)	Full mutation + schedules flush
`remove_note(note_id)`	Yes (RLock)	Full mutation + schedules flush
`save()`	Yes (RLock + `fcntl.flock`)	Snapshot under RLock; atomic rename
`load()`	Yes (RLock)	Called once in `__init__`
`_flush_sync()`	Yes (RLock)	Dirty-check + save + clear as one atomic block
`get_note_ids(entity_type, entity_value)`	No	Read-only; safe under CPython GIL
`search_entities(query, limit)`	No	Read-only; safe under CPython GIL
`stats()`	No	Read-only dict comprehension; safe under CPython GIL
`build()`	No	Sequential rebuild; not concurrent-safe

Why RLock and not Lock?
add_note() and remove_note() call _schedule_flush() while already holding _flush_lock. _schedule_flush() also acquires _flush_lock to check timer state. A plain threading.Lock would deadlock on this re-entrance. RLock allows the same thread to re-acquire it.

Cross-process safety
File writes use fcntl.flock(LOCK_EX) on the temp file to prevent two processes from clobbering each other's serialized index. The final write uses os.replace(), which is atomic on POSIX.

Locking flow¶

add_note() / remove_note()
  │
  ├─ acquire _flush_lock (RLock)
  │    mutate self.index
  │    set self._dirty = True
  └─ release _flush_lock
       │
       └─ _schedule_flush()
            acquire _flush_lock (re-entrant, no deadlock)
              start Timer(5.0, _flush_sync) if not already running
            release _flush_lock

_flush_sync()  ← called by Timer or atexit
  │
  ├─ acquire _flush_lock
  │    if self._dirty:
  │      self.save()    ← snapshot + atomic rename
  │      self._dirty = False
  └─ release _flush_lock

save()
  │
  ├─ acquire _flush_lock
  │    snapshot: {k: {kk: list(vv) for ...} for ...}
  └─ release _flush_lock
       │
       ├─ tempfile.mkstemp(prefix=".entity_index.", dir=...)
       ├─ fcntl.flock(fd, LOCK_EX)
       │    json.dump(data, f)
       │    os.fsync(f.fileno())
       │    fcntl.flock(fd, LOCK_UN)
       └─ os.replace(tmp_path, index_path)   ← atomic on POSIX

Atomicity¶

In-process¶

The dict comprehension in save() runs inside _flush_lock:

with self._flush_lock:
    data = {k: {kk: list(vv) for kk, vv in v.items()} for k, v in self.index.items()}

A concurrent add_note() blocks on the same lock and cannot modify self.index while this snapshot is in progress.

File write¶

save() uses write-to-temp then atomic rename:

tempfile.mkstemp(prefix=".entity_index.", suffix=".json.tmp", dir=...) — temp file in the same directory as index_path, so the rename stays on one filesystem.
fcntl.flock(fd, LOCK_EX) on the temp file fd.
json.dump(data, f) then os.fsync(f.fileno()).
fcntl.flock(fd, LOCK_UN).
os.replace(tmp_path, index_path) — atomic rename on POSIX.

Crash between steps 3 and 5: index_path is untouched. The temp file is an orphan harmless to subsequent loads.

Prior behavior (RFC-001 Warning 4): the old save() opened index_path directly in "w" mode, which truncates the file before acquiring flock. A crash mid-write left index_path empty or partial. The current implementation never touches index_path directly.

If json.dump raises an exception, the temp file is deleted (os.unlink) and the exception propagates. No partial data is left on disk.

Background flush timer¶

The indexer batches writes to avoid thrashing disk during burst indexing:

def _schedule_flush(self) -> None:
    with self._flush_lock:
        if self._flush_timer is None or not self._flush_timer.is_alive():
            self._flush_timer = threading.Timer(5.0, self._flush_sync)
            self._flush_timer.daemon = True
            self._flush_timer.start()

Debounce window: 5 seconds after the last mutation.
The timer is a daemon thread; it does not block process exit.
atexit.register(self._flush_sync) ensures a final flush on clean shutdown even when the timer has not fired.
build() cancels any pending timer and calls save() synchronously before returning, so an index rebuild is always fully persisted when build() returns.

19-type invariant¶

The constructor initializes all 19 entity-type buckets as empty dicts:

self.index: dict[str, dict[str, set[str]]] = {
    etype: {} for etype in EntityExtractor.ENTITY_TYPES
}

The invariant: set(self.index.keys()) == set(EntityExtractor.ENTITY_TYPES) holds for the lifetime of the object.

RFC-001 Warning 5: the previous remove_note() deleted the entity-type key when its value dict emptied. add_note() relied on the invariant to detect unknown entity types (entity_type not in self.index). Deleting the key broke that check and allowed ghost types to appear in later writes.

Current behavior: remove_note() prunes empty per-value sets but preserves the parent bucket:

def remove_note(self, note_id: str) -> None:
    with self._flush_lock:
        for entity_type in list(self.index.keys()):
            for entity_value in list(self.index[entity_type].keys()):
                self.index[entity_type][entity_value].discard(note_id)
                if not self.index[entity_type][entity_value]:
                    del self.index[entity_type][entity_value]   # prune value set
            # parent entity_type dict is NOT deleted even if empty

EntityExtractor thread safety¶

EntityExtractor is stateless. Every method takes text as input and returns a new dict. It holds no mutable instance state.

Method	Thread-safe	Notes
`extract_regex(text)`	Yes	Pure regex, no side effects
`extract_llm(text)`	Yes	Calls LLM client; no shared state beyond the call
`extract_all(text, use_llm)`	Yes	Orchestrates the above

A single EntityExtractor instance is safe to share across threads without locking.

False-positive hash filtering¶

_filter_false_positive_hashes removes MD5/SHA1/SHA256 candidates that appear in code or VCS contexts. It is a deterministic, single-threaded operation with no concurrency implications.

Strategy: for each line of the input text, if the line matches _CODE_CONTEXT_PATTERN, every hex string (32-64 chars) on that line is excluded from results. Patterns covered:

Variable assignment: var = "hexstring"
Git log lines: commit, merge, tree, parent, Author:
Code fences: ```
Function definitions: def func_name
Function calls with hash arguments

Entity types¶

The 19 recognized types in three categories:

CTI entities (regex fast-path)¶

Type	Example	Pattern notes
`cve`	`CVE-2024-3094`	`CVE-\d{4}-\d{4,}` case-insensitive
`intrusion_set`	`APT28`, `UNC2452`, `TA505`	Prefixes: apt/unc/ta/fin/temp
`actor`	`lazarus`, `sandworm`, `volt-typhoon`	Named match list
`tool`	`cobalt-strike`, `mimikatz`, `bloodhound`	Named match list
`campaign`	`operation-midnight`	`operation \w+`
`attack_pattern`	`T1059`, `T1059.001`	`T\d{4}(\.\d{3})?`

IOC / STIX Cyber Observables (regex fast-path)¶

Type	Example
`ipv4`	`192.168.1.1`
`domain`	`evil.example.com`
`url`	`https://malware.example/payload`
`md5`	`d41d8cd98f00b204e9800998ecf8427e`
`sha1`	`a9993e364706816aba3e25717850c26c9cd0d89d`
`sha256`	`e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855`
`email`	`analyst@example.com`

Conversational entities (LLM NER, optional)¶

Active when extract_all(text, use_llm=True). Person names are also extracted from dialogue format (Name: text) via regex even without LLM.

Type	Description
`person`	Named individuals
`location`	Cities, regions, countries
`organization`	Company, agency, or group names
`event`	Named events
`activity`	Named activities
`temporal`	Time expressions

Public API summary¶

EntityIndexer¶

Method	Signature	Thread-safe
`__init__`	`(index_path: str \| None = None)`	No (constructor)
`load`	`() -> bool`	No (called once)
`save`	`() -> None`	Yes
`add_note`	`(note_id: str, entities: dict[str, list[str]]) -> None`	Yes
`remove_note`	`(note_id: str) -> None`	Yes
`get_note_ids`	`(entity_type: str, entity_value: str) -> list[str]`	Yes (GIL)
`search_entities`	`(query: str, limit: int = 10) -> dict[str, list[str]]`	Yes (GIL)
`stats`	`() -> dict`	Yes (GIL)
`build`	`() -> dict`	No (sequential)

search_entities returns a dict mapping entity types to lists of entity values that start with query (case-insensitive prefix match, up to limit results per type).

stats returns per-type counts:

{
    "actor": {"unique_entities": 3, "total_mappings": 5},
    "cve":   {"unique_entities": 1, "total_mappings": 2},
    ...
}

EntityExtractor¶

Method	Signature	LLM required
`extract_regex`	`(text: str) -> dict[str, list[str]]`	No
`extract_llm`	`(text: str) -> dict[str, list[str]]`	Yes
`extract_all`	`(text: str, use_llm: bool = False) -> dict[str, list[str]]`	Optional

Regression tests¶

tests/test_entity_indexer_races.py — 8 tests covering RFC-001 Warnings 4, 5, and 6. All 8 passed on ZettelForge 2.7.0 (run time: 0.09s).

Warning 4: atomic save¶

Verifies that save() uses a temp-then-rename pattern and leaves no orphan files on success or on serialization failure:

def test_save_uses_atomic_rename_pattern(self, indexer, monkeypatch):
    observed_replaces = []
    real_replace = os.replace

    def _spy(src, dst):
        observed_replaces.append((str(src), str(dst)))
        return real_replace(src, dst)

    monkeypatch.setattr("zettelforge.entity_indexer.os.replace", _spy)
    indexer.add_note("note_a", {"actor": ["APT28"]})
    indexer.save()

    assert any(str(indexer.index_path) == dst for _, dst in observed_replaces)

def test_save_cleans_up_temp_file_on_serialize_failure(self, indexer, tmp_path, monkeypatch):
    indexer.add_note("note_a", {"actor": ["APT28"]})

    def _boom(*_args, **_kwargs):
        raise RuntimeError("simulated FS error")

    monkeypatch.setattr("zettelforge.entity_indexer.json.dump", _boom)
    with pytest.raises(RuntimeError):
        indexer.save()

    leftovers = [p for p in tmp_path.iterdir() if p.name.startswith(".entity_index.")]
    assert leftovers == []

Warning 5: 19-type invariant¶

def test_remove_note_preserves_empty_type_bucket(self, indexer):
    indexer.add_note("note_a", {"actor": ["APT28"], "tool": ["Cobalt Strike"]})
    indexer.remove_note("note_a")

    assert "actor" in indexer.index          # bucket preserved
    assert indexer.index["actor"] == {}      # empty but present
    assert set(indexer.index.keys()) == set(EntityExtractor.ENTITY_TYPES)

def test_remove_note_prunes_empty_per_value_sets(self, indexer):
    indexer.add_note("note_a", {"actor": ["APT28"]})
    indexer.remove_note("note_a")
    assert "apt28" not in indexer.index["actor"]   # per-value set pruned

Warning 6: thread-safe save + concurrent add¶

def test_save_during_concurrent_add_does_not_raise(self, indexer):
    errors = []
    stop = threading.Event()

    def writer():
        i = 0
        while not stop.is_set() and i < 500:
            indexer.add_note(f"note_{i}", {"actor": [f"APT{i % 5}"], "tool": [f"tool{i % 7}"]})
            i += 1

    def saver():
        j = 0
        while not stop.is_set() and j < 50:
            indexer.save()
            j += 1

    t1 = threading.Thread(target=writer)
    t2 = threading.Thread(target=saver)
    t1.start(); t2.start()
    t1.join(timeout=10); t2.join(timeout=10)
    stop.set()

    assert errors == []
    assert "actor" in indexer.index
    assert all(isinstance(v, dict) for v in indexer.index.values())