Evaluations & Observability
The Evaluations dashboard gives you a complete picture of how your agents are performing in production. Track success rates, cost attribution, latency, and model usage -- all from a single page in the dashboard.
Agent performance metrics
The Evaluations page (accessible from the dashboard sidebar under Evaluations) displays key metrics for each agent:
- Success rate -- percentage of runs that completed without errors, tracked over time with trend indicators
- Latency percentiles -- p50, p95, and p99 response times for agent runs, broken down by step type (LLM call, tool execution, connector request)
- Throughput -- runs per hour/day/week with historical comparison
- Error breakdown -- categorized failure reasons (model timeout, connector failure, guardrail block, user abort)
runs and journal_events tables. Every run automatically records cost, token counts, and timing -- no extra instrumentation required.Cost attribution
Lantern tracks cost at every level of granularity:
- Per-run cost -- total USD spent on LLM tokens, connector API calls, and compute for each run
- Per-agent cost -- aggregated spend across all runs for an agent, with daily/weekly/monthly rollups
- Per-model cost -- breakdown of spend by model provider and model tier (e.g., how much went to Claude Opus vs. GPT-4o-mini)
- Per-tenant cost -- total platform spend for billing and chargeback
Cost data flows from the model router (which records token counts and pricing) and the billing service (which aggregates and attributes).
Model usage tracking
The model usage panel shows which models your agents are actually using after routing:
- Model distribution -- pie chart of requests by concrete model (Claude 3.5 Sonnet, GPT-4o, Gemini Pro, etc.)
- Routing decisions -- how the model router resolved capability requests (e.g.,
reasoning-largemapped to Claude Opus 72% of the time and GPT-4o 28% of the time) - Token consumption -- input and output tokens per model, per agent, per time period
- Strategy effectiveness -- compare outcomes across the four routing strategies (
balanced,cheap,quality,fast) to find the optimal setting for each agent
Quality signals
Beyond raw metrics, Lantern surfaces quality signals that help you understand whether your agents are doing the right thing:
- Session satisfaction -- for interactive sessions, track whether users continue the conversation (engaged) or abandon it (dissatisfied)
- Guardrail triggers -- how often guardrails fire, which rules trigger most, and whether blocked outputs indicate a prompt issue
- Retry rate -- how often steps need to be retried due to transient failures, and which connectors or models are least reliable
- Version comparison -- compare metrics between agent versions to validate that changes improve (or at least do not degrade) performance
Setting up alerts (future)
Alert configuration is on the roadmap. Planned capabilities include:
- Success rate threshold -- alert when an agent's success rate drops below a configurable percentage
- Cost spike -- alert when per-run or per-day cost exceeds a threshold
- Latency degradation -- alert when p95 latency exceeds a target for a sustained period
- Delivery channels -- email, Slack, PagerDuty, and webhook
OTel integration
Every service in Lantern emits OpenTelemetry traces with standard attributes: tenant_id, run_id, step_id, and agent_version. You can export these to any OTel-compatible backend (Jaeger, Datadog, Grafana Tempo, Honeycomb) for deep-dive debugging alongside the built-in evaluations dashboard.