Service Cards
Service Cards summarize operational health for each critical service. Read them first during incidents, rollout checks, and daily reliability review.
Card Anatomy
A service card should give a fast answer to one question: can this service safely support current traffic and release activity?
Each card should include: service name, current state, key metrics, last update timestamp, impact scope, and clear ownership.
Health States
- Healthy -- metrics within normal thresholds, no active user impact.
- Degraded -- service is running but with measurable reliability or performance risk.
- Down -- unavailable or non-functional for normal workflows.
- Unknown -- status source missing, stale, or inconsistent.
Triage Flow
- Check freshness first. If stale, re-validate before acting.
- Confirm whether impact is internal-only or user-facing.
- Compare metric trend with known baseline, not just current value.
- Escalate if degraded/down persists beyond your service SLO window.
Recommended Actions
| State | Do Now | Do Next |
|---|---|---|
| Healthy | Continue planned rollout | Monitor trend drift |
| Degraded | Pause risky deploys | Assign owner and mitigation ETA |
| Down | Declare incident, route traffic if possible | Begin restore + postmortem trail |
| Unknown | Collect direct health checks | Fix observability gap |
Update Discipline
Keep updates short, timestamped, and ownership-explicit. If a card changes state, include one concrete next step so readers can immediately continue work without re-triaging the same issue.