Observability
One OTel trace per spawn, GenAI token telemetry, real-time anomaly detection — wired through standard OpenTelemetry.
One trace per spawn
Every spawn opens a single trace correlated under:
(tenant_id, run_id, step_id, agent_instance_id, trace_id)agent_instance_id is the per-spawn identity (see Identity & secrets), so two runs of the same agent never collide. A durable resume after a crash re-joins the same trace_id — the full lifecycle is one coherent timeline.
Enabling OTel
Export is env-gated. Set the endpoint and traces flow; leave it unset and tracing is a no-op (zero overhead):
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
# or
LANTERN_OTEL_ENABLED=1 # uses default localhost endpointGenAI semantic conventions
LLM steps are annotated with OTel GenAI semantic-convention attributes — including reasoning tokens and cache tokens, not just plain input/output counts. Per-step cost attribution and model-usage breakdowns work out of the box with any OTel-compatible backend.
Real-time anomaly detection
The runtime watches the live event stream for pathological shapes — a tool-call loop, a step retrying without progress — and surfaces them in real time. This is the early-warning layer for runaway runs.
Metrics endpoint
Per-VM live stats for the caller's tenant:
GET /v1/runtime/metricsReturns a vmMetricsDTO array with vmId, state, node, az, isolationClass, promMetrics (raw Prometheus text from the harness), and timestamps. Per-instance detail: GET /v1/runtime/vms/{id}. Live log stream: GET /v1/runtime/vms/{id}/logs (SSE). The dashboard runtime page renders all three.
Gateway and model-router traces
The gateway emits one span per HTTP request (gateway.request, tagged with tenant_id) via OTLP/HTTP. The model-router emits one span per routing call tagged with tenant_id, run_id, step_id, model_used, tokens_in/out, cost_usd, and escalated via OTLP/gRPC. Both honour inbound W3C traceparent, so spans join the caller's distributed trace automatically.
lantern-TODO-needs-instrumentation group in infra/monitoring/prometheus/alerts.yml until the histogram metric ships.Prometheus alerts, dashboards, runbooks
Production monitoring artifacts live in infra/monitoring/:
| Group | Alerts | Source |
|---|---|---|
lantern-scheduler | SchedulerDown · SchedulerNoLeader · SchedulerScheduleErrorRateHigh · SchedulerQuotaRejectionSurge · SchedulerNoRegisteredNodes | runtime-scheduler :8085/metrics |
lantern-liveness | ControlPlaneDown · ControlPlaneNotReady · GatewayDown · ModelRouterDown | up scrape + blackbox /readyz |
lantern-postgres | PostgresExporterDown · PostgresConnectionSaturation · DataPlaneHeartbeatStale · CronScheduleOverdue | postgres_exporter + custom queries |
Eight operator runbooks cover every active alert plus the DB restore procedure, linked from each alert's runbook: annotation in alerts.yml. Grafana dashboards: grafana/platform-overview.json and grafana/data-plane-runtime.json.