Reproduce benchmarks¶

ZettelForge ships benchmark scripts you can run against a fresh install and compare to the reference numbers below. All benchmarks use the same binary you installed via pip install zettelforge; no special build is required.

These reference numbers are a snapshot from prior ZettelForge versions (v2.0.0–v2.2.0 for most suites; a v2.7.0 performance session in June 2026). Your numbers will vary with hardware, LLM provider, and embedding model.

Prerequisites¶

Install ZettelForge and clone the source repository (the benchmark scripts live in benchmarks/):

pip install zettelforge
git clone https://github.com/rolandpg/zettelforge.git
cd zettelforge

Set the deterministic-ingestion flag before running any benchmark. The flag suppresses background LLM enrichment so results are reproducible across runs:

export ZETTELFORGE_ENRICHMENT_ENABLED=false

The benchmark scripts also set this flag programmatically, but setting it in the shell avoids subtle timing differences when the environment is not inherited.

1. CTI retrieval benchmark¶

This is the primary domain benchmark. It tests the queries an analyst would actually ask: threat actor attribution, CVE linkage, tool mapping, campaign tracking, and temporal reasoning across 8 real-world-style CTI reports and 20 queries.

No external data is required. The CTI corpus is embedded in the script.

python benchmarks/cti_retrieval_benchmark.py

Reference result (v2.0.0, 2026-04-10):

Category	Queries	Accuracy
Attribution	5	100%
Multi-hop	3	100%
CVE linkage	4	75%
Temporal	3	66.7%
Tool attribution	5	40%
Overall	20	75.0%

p50 latency: 620 ms — 8 notes stored

Tool attribution scores 40% because queries that expect multiple tools ("What tools does APT28 use?") require all tool names to appear in the retrieved context. When a report mentions tools across multiple sentences, the keyword judge scores partial matches as 0.5 rather than 1.0.

CTI p50 with v2.7.0 optimizations

A June 2026 performance session on DGX Spark GB10 (keyword judge, no LLM) measured v2.7.0 at 79 ms p50 on the CTI corpus; an optimized build reached 39 ms p50. Accuracy held at 75.0% in both cases.

2. LOCOMO benchmark¶

LOCOMO (ACL 2024) is a conversational memory benchmark — not ZettelForge's design domain. It tests recall over personal dialogue: person names, hobbies, life events. ZettelForge's entity extractor is tuned for CTI entities (CVEs, APT groups, tools), so graph traversal rarely fires on conversational queries.

Requires the LOCOMO dataset at:

~/.openclaw/workspace-nexus/Locomo-Plus/data/locomo10.json

Download the locomo10.json file from the LoCoMo project and place it at that path before running.

# Quick run (20 samples per category, keyword judge)
python benchmarks/locomo_benchmark.py

# Full dataset (100 QA pairs)
python benchmarks/locomo_benchmark.py --full

# Use Ollama as judge for more accurate scoring
python benchmarks/locomo_benchmark.py --judge ollama

Reference results:

Version	Judge	Accuracy	p50 latency
v1.3.0	keyword	14.0%	238 ms
v1.5.0	keyword	15.0%	344 ms
v2.0.0	keyword	18.0%	1,240 ms
v2.1.1	Ollama	22.0%	—

Why LOCOMO scores are low

Conversational entities (person names, hobbies) are not in ZettelForge's extraction vocabulary. Additionally, the supersession logic marks 264 of 272 LOCOMO sessions as superseded because sessions share speaker names. The benchmark runs with exclude_superseded=False to work around this. ZettelForge is built for CTI, not chatbot memory — the 75% CTI accuracy reflects the intended use case.

v2.7.0 baseline (keyword judge)

A June 2026 performance session on DGX Spark measured v2.7.0 at 7.0% accuracy with the keyword judge (no LLM) at 336 ms p50, and 11.0% after optimization at 170 ms p50. The 22% figure (v2.1.1) used an Ollama cloud judge — a different scoring path, not a direct comparison.

3. MemPalace comparison¶

Runs LOCOMO against MemPalace for a head-to-head comparison on the same dataset and scoring.

Requires the LOCOMO dataset (same path as above) and:

pip install chromadb mempalace

python benchmarks/mempalace_benchmark.py --samples 20

Reference result (v2.0.0 snapshot, 2026-04-10):

System	Accuracy	p50 latency
MemPalace	26%	130 ms
ZettelForge	18%	1,240 ms

MemPalace wins on LOCOMO because it chunks at 800 characters (MemPalace granularity) and uses pure ChromaDB vector search with no intent classification, graph traversal, or blending overhead. On conversational data without CTI entities, these extra stages add latency without accuracy.

ZettelForge wins on CTI queries where graph traversal, entity indexing, and typed relationships are load-bearing. MemPalace has no knowledge graph and cannot answer multi-hop attribution queries.

4. RAGAS retrieval quality¶

RAGAS measures retrieval quality, not answer quality. It computes keyword presence and string similarity between retrieved context and expected answers. No LLM judge is required.

Requires the LOCOMO dataset.

python benchmarks/ragas_benchmark.py

Reference result (v2.0.0, 2026-04-10):

Metric	v1.5.0	v2.0.0
Keyword presence	75.9%	78.1%
String similarity	17.7%	18.2%

The high keyword presence (78%) confirms that retrieved context contains relevant information. The LOCOMO accuracy gap is in answer extraction (the keyword judge scoring), not retrieval. ZettelForge retrieves the right passages; the local 3B model struggles to extract precise answers from them.

5. CTIBench ATE¶

CTIBench (NeurIPS 2024) tests ATT&CK Technique Extraction: given a natural-language description of an adversarial technique, retrieve the correct MITRE ATT&CK T-code.

Requires the datasets Python package:

pip install datasets

The script downloads the AI4Sec/cti-bench dataset from HuggingFace automatically on first run. The ATT&CK matrix JSON files (enterprise-attack.json, mobile-attack.json) are included in the benchmarks/ directory.

python benchmarks/ctibench_benchmark.py --task ate

Reference result (v2.2.0, 2026-04-16):

Version	F1
v2.0.0	0.000 (adapter bug: T-code regex only)
v2.2.0	0.146 (fixed ingestion + ICS noise removed)

F1 = 0.146 is a lower bound driven by low recall: many ATT&CK techniques map to multiple paraphrases and the scoring function rewards exact T-code matches. A semantic matcher over technique descriptions would lift this further.

6. MemoryAgentBench (optional)¶

MemoryAgentBench (ICLR 2026) tests memory systems on four splits: Accurate Retrieval (AR), Conflict Resolution (CR), Test-Time Learning (TTL), and Long-Range Understanding (LRU).

Requires the datasets package and a capable LLM provider:

pip install datasets

# AR + CR splits (fast)
python benchmarks/memoryagentbench.py

# All splits (significantly longer)
python benchmarks/memoryagentbench.py --split all

Reference result (2026-04-10):

Split	Qwen2.5-3B (local)	nemotron-3-super (Ollama)
Accurate Retrieval F1	0.012	0.328
Conflict Resolution F1	0.012	0.032
Overall F1	0.007	0.180

Retrieval latency was 128–333 ms for both models — the improvement is entirely in answer generation quality, not retrieval. This confirms the retrieval pipeline works; the local 3B model is the bottleneck on this benchmark.

CR scores remain low because Conflict Resolution requires multi-hop chain reasoning (e.g., "country of citizenship of the spouse of the author of...") that needs explicit graph traversal over entity relationships.

Reference numbers at a glance¶

Benchmark	Version	Key result
CTI Retrieval	v2.0.0	75.0% accuracy, 620 ms p50
LOCOMO	v2.0.0, keyword	18.0% accuracy
LOCOMO	v2.1.1, Ollama	22.0% accuracy
MemPalace vs ZettelForge	v2.0.0	MemPalace 26%, ZettelForge 18%
RAGAS keyword presence	v2.0.0	78.1%
CTIBench ATE	v2.2.0	F1 = 0.146
MemoryAgentBench	v2.7.0	F1 = 0.180 (cloud model)

What affects your results¶

Factor	Effect
LLM provider (local vs cloud judge)	LOCOMO accuracy varies widely; keyword judge is the reproducible baseline
`ZETTELFORGE_ENRICHMENT_ENABLED`	Must be `false` for deterministic ingestion
Embedding model	Results assume `nomic-ai/nomic-embed-text-v1.5-Q` (768-dim, ONNX)
Storage backend	SQLite is the default; all published results use SQLite
Hardware (thread count)	ONNX thread pinning affects latency; see BENCHMARK_REPORT.md section 0 for thread-tuning details

For ThreatRecall.ai SaaS deployments, benchmark numbers will differ because the SaaS uses managed infrastructure and optional TypeDB graph backend. Contact support for SaaS-specific benchmarking guidance.