Evaluations & Observability

The Evaluations dashboard gives you a complete picture of how your agents are performing in production. Track success rates, cost attribution, latency, and model usage -- all from a single page in the dashboard.

Agent performance metrics

The Evaluations page (accessible from the dashboard sidebar under Evaluations) displays key metrics for each agent:

Success rate -- percentage of runs that completed without errors, tracked over time with trend indicators
Latency percentiles -- p50, p95, and p99 response times for agent runs, broken down by step type (LLM call, tool execution, connector request)
Throughput -- runs per hour/day/week with historical comparison
Error breakdown -- categorized failure reasons (model timeout, connector failure, guardrail block, user abort)

Note: Metrics are computed from the runs and journal_events tables. Every run automatically records cost, token counts, and timing -- no extra instrumentation required.

Cost attribution

Lantern tracks cost at every level of granularity:

Per-run cost -- total USD spent on LLM tokens, connector API calls, and compute for each run
Per-agent cost -- aggregated spend across all runs for an agent, with daily/weekly/monthly rollups
Per-model cost -- breakdown of spend by model provider and model tier (e.g., how much went to Claude Opus vs. GPT-4o-mini)
Per-tenant cost -- total platform spend for billing and chargeback

Cost data flows from the model router (which records token counts and pricing) and the billing service (which aggregates and attributes).

Model usage tracking

The model usage panel shows which models your agents are actually using after routing:

Model distribution -- pie chart of requests by concrete model (Claude 3.5 Sonnet, GPT-4o, Gemini Pro, etc.)
Routing decisions -- how the model router resolved capability requests (e.g., reasoning-large mapped to Claude Opus 72% of the time and GPT-4o 28% of the time)
Token consumption -- input and output tokens per model, per agent, per time period
Strategy effectiveness -- compare outcomes across the four routing strategies (balanced, cheap, quality, fast) to find the optimal setting for each agent

Quality signals

Beyond raw metrics, Lantern surfaces quality signals that help you understand whether your agents are doing the right thing:

Session satisfaction -- for interactive sessions, track whether users continue the conversation (engaged) or abandon it (dissatisfied)
Guardrail triggers -- how often guardrails fire, which rules trigger most, and whether blocked outputs indicate a prompt issue
Retry rate -- how often steps need to be retried due to transient failures, and which connectors or models are least reliable
Version comparison -- compare metrics between agent versions to validate that changes improve (or at least do not degrade) performance

Tip: Use version comparison before promoting a new agent version to production. Deploy the new version to a staging environment, run a batch of test inputs, and compare the evaluation metrics side by side.

Setting up alerts (future)

Alert configuration is on the roadmap. Planned capabilities include:

Success rate threshold -- alert when an agent's success rate drops below a configurable percentage
Cost spike -- alert when per-run or per-day cost exceeds a threshold
Latency degradation -- alert when p95 latency exceeds a target for a sustained period
Delivery channels -- email, Slack, PagerDuty, and webhook

Coming soon: Alerts are not yet available in the current release. Use the evaluations dashboard for manual monitoring, or export metrics to your existing observability stack via the OTel exporter.

OTel integration

Every service in Lantern emits OpenTelemetry traces with standard attributes: tenant_id, run_id, step_id, and agent_version. You can export these to any OTel-compatible backend (Jaeger, Datadog, Grafana Tempo, Honeycomb) for deep-dive debugging alongside the built-in evaluations dashboard.