Reproduce the Published Benchmarks¶
Every benchmark referenced in BENCHMARK_REPORT.md
ships as an adapter script in benchmarks/. This guide shows how to
run each one against a fresh install and compare your results.
Prerequisites¶
- A clone of the repo (the benchmarks live under
benchmarks/, not in the installed package). pip install -e .[dev]from the repo root.- Optional: Ollama running locally if you want to reproduce the v2.1.1 LOCOMO 22 % result with a cloud judge.
Shared setup¶
git clone https://github.com/rolandpg/zettelforge
cd zettelforge
python -m venv .venv && source .venv/bin/activate
pip install -e .[dev]
Each benchmark writes a JSON results file alongside its script so your output can be diffed against the reference files already in the repo.
1. CTI Retrieval (domain benchmark)¶
Published reference: 75.0 % accuracy, p50 620 ms over 8 reports / 20 queries.
2. LOCOMO (ACL 2024)¶
# Baseline with the local llama-cpp judge
python benchmarks/locomo_benchmark.py
# v2.1.1 cloud-judge run (requires Ollama + a suitable model pulled)
LLM_JUDGE=ollama python benchmarks/locomo_benchmark.py
Published references: 18.0 % with the local judge, 22.0 % with
the Ollama cloud judge. Both runs emit benchmarks/locomo_results.json.
3. MemPalace comparison¶
Published reference: MemPalace 26.0 % vs ZettelForge 18.0 % on LOCOMO.
4. RAGAS retrieval quality¶
Published reference: 78.1 % keyword presence.
5. CTIBench ATE (NeurIPS 2024)¶
Published reference: F1 = 0.146 (v2.2.0, after fixing the ingestion pipeline and dropping ICS matrix noise). The v2.0.0 result was F1 = 0.000 — see BENCHMARK_REPORT §5 for the methodology change.
6. MemoryAgentBench (ICLR 2026, optional)¶
Requires a cloud-grade judge (we used nemotron-3-super:cloud via
Ollama). Expect a day-scale run on larger splits.
Interpreting differences¶
If your numbers differ from the published ones, check:
- LLM backend — local llama-cpp runs will underperform cloud judges on LOCOMO by a wide margin (see BENCHMARK_REPORT §6).
- Embedding dimensions — all published runs use 768-dim
nomic-embed-text-v1.5-Q. Switching to another embedding changes the vector store and invalidates results until youpython scripts/rebuild_index.py. - Backend — SQLite vs a lingering JSONL data directory changes
entity-index behaviour. Ensure
ZETTELFORGE_BACKEND=sqlite(the v2.2.0 default). - Governance — set
governance.enabled: falseinconfig.yamlfor benchmark runs; otherwise some permissive inputs will raiseGovernanceViolationError.
Related¶
- BENCHMARK_REPORT.md — full methodology and published numbers
- LOCOMO_BENCHMARK_COMPARISON.md — head-to-head vs MemPalace / Mem0 / LangMem