Observability

One OTel trace per spawn, GenAI token telemetry, real-time anomaly detection — wired through standard OpenTelemetry.

One trace per spawn

Every spawn opens a single trace correlated under:

(tenant_id, run_id, step_id, agent_instance_id, trace_id)

agent_instance_id is the per-spawn identity (see Identity & secrets), so two runs of the same agent never collide. A durable resume after a crash re-joins the same trace_id — the full lifecycle is one coherent timeline.

Trace spine

gateway.requesttenant_id

control-plane: run dispatchrun_id

model-router: routestep_id · model_used · tokens · cost_usd

runtime-manager: spawnvm_id · image · isolation_class

harness: step loopstep_id · tool_calls · reasoning_tokens · cache_tokens

W3C traceparent propagated at every boundary · durable resume re-joins the same trace_id

Enabling OTel

Export is env-gated. Set the endpoint and traces flow; leave it unset and tracing is a no-op (zero overhead):

OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
# or
LANTERN_OTEL_ENABLED=1   # uses default localhost endpoint

W3C traceparent is always active. Inbound trace context is forwarded correctly even without an exporter configured.

GenAI semantic conventions

LLM steps are annotated with OTel GenAI semantic-convention attributes — including reasoning tokens and cache tokens, not just plain input/output counts. Per-step cost attribution and model-usage breakdowns work out of the box with any OTel-compatible backend.

Real-time anomaly detection

The runtime watches the live event stream for pathological shapes — a tool-call loop, a step retrying without progress — and surfaces them in real time. This is the early-warning layer for runaway runs.

Metrics endpoint

Per-VM live stats for the caller's tenant:

GET /v1/runtime/metrics

Returns a vmMetricsDTO array with vmId, state, node, az, isolationClass, promMetrics (raw Prometheus text from the harness), and timestamps. Per-instance detail: GET /v1/runtime/vms/{id}. Live log stream: GET /v1/runtime/vms/{id}/logs (SSE). The dashboard runtime page renders all three.

Gateway and model-router traces

The gateway emits one span per HTTP request (gateway.request, tagged with tenant_id) via OTLP/HTTP. The model-router emits one span per routing call tagged with tenant_id, run_id, step_id, model_used, tokens_in/out, cost_usd, and escalated via OTLP/gRPC. Both honour inbound W3C traceparent, so spans join the caller's distributed trace automatically.

No Prometheus histograms yet for gateway / model-router. Latency SLOs live in your tracing backend (Tempo / Jaeger / Honeycomb). Alert rules that would cover p99 latency are parked in the lantern-TODO-needs-instrumentation group in infra/monitoring/prometheus/alerts.yml until the histogram metric ships.

Prometheus alerts, dashboards, runbooks

Production monitoring artifacts live in infra/monitoring/:

Group	Alerts	Source
`lantern-scheduler`	SchedulerDown · SchedulerNoLeader · SchedulerScheduleErrorRateHigh · SchedulerQuotaRejectionSurge · SchedulerNoRegisteredNodes	runtime-scheduler `:8085/metrics`
`lantern-liveness`	ControlPlaneDown · ControlPlaneNotReady · GatewayDown · ModelRouterDown	`up` scrape + blackbox `/readyz`
`lantern-postgres`	PostgresExporterDown · PostgresConnectionSaturation · DataPlaneHeartbeatStale · CronScheduleOverdue	postgres_exporter + custom queries

Eight operator runbooks cover every active alert plus the DB restore procedure, linked from each alert's runbook: annotation in alerts.yml. Grafana dashboards: grafana/platform-overview.json and grafana/data-plane-runtime.json.