LLM budgets, timeouts, and what they cost you¶

ZettelForge makes five distinct LLM calls: causal triple extraction, synthesis, fact extraction, conversational NER, and neighbor evolution. Each call site reads its own token budget from get_config().llm, and the Ollama provider sends that value as num_predict. The same llm.timeout value flows into ollama.Client(..., timeout=...).

Use this page to understand why the defaults are large, what the reasoning_model flag changes, and how to tune the numbers without editing source code. For the exact config keys, see the llm section of the Configuration Reference.

Set llm.model explicitly before you tune budgets. The upstream source default currently names an unresolved model tag, so this page does not treat it as a working reference model.

The token-starvation problem¶

ZettelForge cannot assume that every generated token becomes useful JSON. Reasoning-style models can spend generated tokens before the final answer, and long prompts can also consume the budget before a valid response is complete.

The observable failure is simple: the provider returns an empty or truncated response. The Ollama provider logs an empty completion as llm_call_empty_response and records the eval_count, done_reason, max_tokens, response_chars, and timing fields. When eval_count reaches max_tokens and done_reason is length, the generation hit the configured cap.

The pre-v2.5.2 defaults were smaller and used a 60 second timeout. The current defaults give generation more room and extend the read deadline so long causal extraction calls can finish instead of timing out.

Per-call-site budgets¶

Each call site reads a different config key, so you can raise the expensive path without increasing every LLM request.

Call site	Config key	Default	Env var	Module
Causal triple extraction	`max_tokens_causal`	8000	`ZETTELFORGE_LLM_MAX_TOKENS_CAUSAL`	`note_constructor.py`
Synthesis	`max_tokens_synthesis`	2500	`ZETTELFORGE_LLM_MAX_TOKENS_SYNTHESIS`	`synthesis_generator.py`
Fact extraction	`max_tokens_extraction`	2500	`ZETTELFORGE_LLM_MAX_TOKENS_EXTRACTION`	`fact_extractor.py`
Conversational NER	`max_tokens_ner`	2500	`ZETTELFORGE_LLM_MAX_TOKENS_NER`	`entity_indexer.py`
Neighbor evolution	`max_tokens_evolve`	2500	`ZETTELFORGE_LLM_MAX_TOKENS_EVOLVE`	`memory_evolver.py`
General fallback	`max_tokens`	400	`ZETTELFORGE_LLM_MAX_TOKENS`	`llm_client.py`

Why causal extraction gets the largest budget¶

Causal extraction asks the model to return a JSON array of subject, relation, and object triples from up to 2000 characters of note text. It also validates relation names against an allowlist. That is the broadest generation task in the current pipeline, so it gets the 8000 token default.

Why synthesis, extraction, NER, and evolution get 2500 tokens¶

These calls produce smaller outputs than causal extraction, but they still need enough budget for structured JSON. Conversational NER and neighbor evolution also retry parsing failures with the same per-call-site budget, so a too-small value can fail both the first attempt and the retry.

The shared timeout¶

llm.timeout defaults to 180.0 seconds. The Ollama provider passes it into ollama.Client(host=..., timeout=...), so it bounds the HTTP read for each Ollama generation call.

Raise token budgets and timeouts together. A larger token budget with a short timeout changes a token cap failure into a read timeout. A longer timeout with the same small token budget still leaves the model capped.

Override the timeout in config.yaml:

llm:
  timeout: 60.0

Or set it with an environment variable:

export ZETTELFORGE_LLM_TIMEOUT=60

The `reasoning_model` flag¶

When you set llm.reasoning_model: true, ZettelForge runs _apply_reasoning_model_scaling() after it loads config files and environment variables. The scaling step raises these values to minimum floors:

Field	Minimum when `reasoning_model` is true
`timeout`	180.0
`max_tokens_causal`	8000
`max_tokens_synthesis`	2500
`max_tokens_extraction`	2500
`max_tokens_ner`	2500
`max_tokens_evolve`	2500

Use it when your configured model needs reasoning headroom and you want to prevent a later config override from dropping the budgets below the verified floors.

llm:
  reasoning_model: true

The flag defaults to false. ZettelForge does not infer it from the model name.

When to override¶

Your configured model returns compact answers¶

If your model does not need reasoning headroom, start with smaller budgets and raise them only when the logs show cap exhaustion.

llm:
  timeout: 60.0
  max_tokens_causal: 2000
  max_tokens_synthesis: 600
  max_tokens_extraction: 600
  max_tokens_ner: 600
  max_tokens_evolve: 600

The matching environment variables let you test a change without editing config.yaml:

export ZETTELFORGE_LLM_TIMEOUT=60
export ZETTELFORGE_LLM_MAX_TOKENS_CAUSAL=2000
export ZETTELFORGE_LLM_MAX_TOKENS_SYNTHESIS=600

Your configured model still hits the cap¶

Raise only the call site that fails. For causal extraction, increase max_tokens_causal and give the request enough time to finish:

llm:
  max_tokens_causal: 12000
  timeout: 300.0

Verifying your budgets¶

The OCSF log at ~/.amem/logs/zettelforge.log records LLM call outcomes. Set AMEM_DATA_DIR to move the log directory.

At the default INFO log level, inspect empty completions:

grep '"event":"llm_call_empty_response"' ~/.amem/logs/zettelforge.log \
  | jq -r '"\(.model) eval=\(.eval_count) of max=\(.max_tokens) dur_ms=\(.duration_ms)"' \
  | tail -20

At DEBUG level, llm_call_complete records successful calls with the same token and timing fields. If successful calls use far fewer tokens than the configured maximum, you can lower the relevant budget for that call site.

What changed from v2.5.2 to v2.7.0¶

Release	Change
v2.5.2	`llm.timeout` increased from 60 seconds to 180 seconds. Per-call-site budgets increased to the current defaults.
v2.7.0	Per-call-site budgets became config-overridable through `LLMConfig` fields and environment variables. The `reasoning_model` flag now applies floor values after config load.
v2.7.0	Structured JSON parsing strips `<think>` and `<thinking>` blocks through `json_parse.strip_thinking_tags()`. Providers still return response text before JSON extraction, so non-JSON callers handle whatever text the provider returns.