Durable Execution
A headless agent can run for minutes and call an LLM dozens of times. Nodes get preempted, pods get evicted, processes crash. Durable execution is the guarantee that a crash mid-run does not lose work, re-spend tokens, or fire a side effect twice — the run is exactly-once, not at-least-once.
Event-sourced journal
Work is decomposed into steps, and every step transition is appended to an event-sourced journal before the next step begins. The journal — not in-memory state — is authoritative. Anything that can take longer than ~100ms or calls an LLM is a step, and each step is idempotent and replayable.
run step journal events
────────────────────────────────────────────────
run_1 step_a step_started → step_completed
run_1 step_b step_started → step_completed
run_1 step_c step_started ◀── crash here
(no step_completed written)Resume from the last completed step
On recovery the engine replays the journal and resumes from the last step_completed. In the trace above, step_a and step_b are not re-executed — their results are read back from the journal — and execution restarts at step_c. Completed LLM calls are not re-issued, so their tokens are not re-spent.
Side-effect dedup via idempotency keys
Replaying a step that performs an external side effect — a model API call, a webhook delivery, a Kubernetes create — must not duplicate it. Every external side effect carries an idempotency key derived from the tuple:
idempotency_key = (run_id, step_id, attempt)Because the key is stable across replays of the same step, a retried delivery is recognized and de-duplicated downstream rather than sent twice. This is what makes "resume from last step" safe even when a step had already reached out to the outside world before crashing.
Recovery watchdog
A run is leased to a worker. If that worker dies without releasing the lease, a recovery watchdog detects the expired lease and re-schedules the run onto a healthy node, where it replays the journal and continues. No human intervention, no lost run.
Scheduler HA
The scheduler itself is not a single point of failure. Placement state is durable, and a replacement scheduler instance picks up pending and in-flight work without re-placing what is already running. Combined with the watchdog, a node loss degrades to a brief resume rather than a failed run.
Why it matters
- No double-spend. Completed LLM steps are replayed from the journal, never re-billed.
- No double-send. Idempotency keys de-dup external side effects across retries.
- No babysitting. The watchdog + HA scheduler recover crashed runs automatically.
Each resume is still one trace per spawn — see Observability — and a completed run's journal is what the verifiable receipt is signed over.