Durable Execution

A headless agent can run for minutes and call an LLM dozens of times. Nodes get preempted, pods get evicted, processes crash. Durable execution is the guarantee that a crash mid-run does not lose work, re-spend tokens, or fire a side effect twice — the run is exactly-once, not at-least-once.

Event-sourced journal

Work is decomposed into steps, and every step transition is appended to an event-sourced journal before the next step begins. The journal — not in-memory state — is authoritative. Anything that can take longer than ~100ms or calls an LLM is a step, and each step is idempotent and replayable.

run        step           journal events
────────────────────────────────────────────────
run_1      step_a         step_started → step_completed
run_1      step_b         step_started → step_completed
run_1      step_c         step_started   ◀── crash here
                          (no step_completed written)

Resume from the last completed step

On recovery the engine replays the journal and resumes from the last step_completed. In the trace above, step_a and step_b are not re-executed — their results are read back from the journal — and execution restarts at step_c. Completed LLM calls are not re-issued, so their tokens are not re-spent.

Side-effect dedup via idempotency keys

Replaying a step that performs an external side effect — a model API call, a webhook delivery, a Kubernetes create — must not duplicate it. Every external side effect carries an idempotency key derived from the tuple:

idempotency_key = (run_id, step_id, attempt)

Because the key is stable across replays of the same step, a retried delivery is recognized and de-duplicated downstream rather than sent twice. This is what makes "resume from last step" safe even when a step had already reached out to the outside world before crashing.

Note: Steps must be written to be idempotent — same inputs, same effect. The idempotency key protects the delivery; authoring the step to tolerate replay is the other half of the contract.

Recovery watchdog

A run is leased to a worker. If that worker dies without releasing the lease, a recovery watchdog detects the expired lease and re-schedules the run onto a healthy node, where it replays the journal and continues. No human intervention, no lost run.

Scheduler HA

The scheduler itself is not a single point of failure. Placement state is durable, and a replacement scheduler instance picks up pending and in-flight work without re-placing what is already running. Combined with the watchdog, a node loss degrades to a brief resume rather than a failed run.

Why it matters

No double-spend. Completed LLM steps are replayed from the journal, never re-billed.
No double-send. Idempotency keys de-dup external side effects across retries.
No babysitting. The watchdog + HA scheduler recover crashed runs automatically.

Each resume is still one trace per spawn — see Observability — and a completed run's journal is what the verifiable receipt is signed over.