Service Cards

Service Cards summarize operational health for each critical service. Read them first during incidents, rollout checks, and daily reliability review.

Card Anatomy

A service card should give a fast answer to one question: can this service safely support current traffic and release activity?

Each card should include: service name, current state, key metrics, last update timestamp, impact scope, and clear ownership.

Service Card Shape

{
  "service": "opta-daemon",
  "state": "degraded",
  "lastUpdated": "2026-03-04T10:58:00+11:00",
  "signals": {
    "errorRatePct": 3.1,
    "p95LatencyMs": 790,
    "availabilityPct": 99.2
  },
  "impact": "CLI sessions intermittently fail to start",
  "owner": "runtime-platform"
}

Health States

Healthy -- metrics within normal thresholds, no active user impact.
Degraded -- service is running but with measurable reliability or performance risk.
Down -- unavailable or non-functional for normal workflows.
Unknown -- status source missing, stale, or inconsistent.

Triage Flow

Check freshness first. If stale, re-validate before acting.
Confirm whether impact is internal-only or user-facing.
Compare metric trend with known baseline, not just current value.
Escalate if degraded/down persists beyond your service SLO window.

Recommended Actions

State	Do Now	Do Next
Healthy	Continue planned rollout	Monitor trend drift
Degraded	Pause risky deploys	Assign owner and mitigation ETA
Down	Declare incident, route traffic if possible	Begin restore + postmortem trail
Unknown	Collect direct health checks	Fix observability gap

Update Discipline

Keep updates short, timestamped, and ownership-explicit. If a card changes state, include one concrete next step so readers can immediately continue work without re-triaging the same issue.

Use explicit time windows

Prefer exact timestamps and expected re-check windows over words like "soon" or "later". Actionable status requires precise timing.

Overview