Why Observability Is a Platform Product
Golden paths are incomplete without baked-in visibility. Treat logging, metrics, tracing, and alerting as core platform features so every workload ships with a consistent, diagnosable footprint. Platform teams own the defaults and guardrails; product teams extend them only when necessary.
Principles
- Convention over configuration: Every golden path emits the same baseline signals (HTTP metrics, error counts, latency percentiles, key business events).
- SLO-first thinking: Define SLO templates with suggested SLIs (availability, latency, freshness) per workload type (API, batch, event).
- Self-service with safety: Teams can add custom metrics and alerts, but platform policies prevent noisy or runaway configurations.
- Actionable by design: Alerts map to runbooks and dashboards; paging is rare and targeted.
GCP Building Blocks
| Capability | Service | Platform stance |
|---|---|---|
| Metrics & dashboards | Cloud Monitoring | Ship baseline dashboards per golden path; version via IaC. |
| Logs | Cloud Logging | Standardize labels (service, env, team, cost-center); set retention defaults. |
| Traces | Cloud Trace | Auto-enable for supported runtimes; require trace context propagation. |
| Errors | Error Reporting | Default alerting on new error types; tie to SLO burn alerts where relevant. |
| Alerting | Cloud Monitoring alerting | Opinionated policies: SLO burn-rate, high error rate, saturation signals. |
| SLOs | SLO + burn-rate alerts | Templates per workload archetype; codified in Terraform/Config Controller. |
Implementation Pattern
- Codify the contract
- Baseline telemetry schema: required labels (
service,env,team,owner,cost_center). - Default retention and export paths (e.g., logs to BigQuery or Pub/Sub for SIEM).
- SLO templates: API latency/availability, async job success/freshness, event consumer lag.
- Baseline telemetry schema: required labels (
- Ship with the golden paths
- Terraform modules create Cloud Monitoring dashboards, SLOs, and alert policies per service instance.
- Sidecars/SDK config bake in trace propagation and log correlation IDs.
- CI enforces required labels and rejects unstructured or oversized logs.
- Automate alert quality
- Use multi-window, multi-burn SLO alerts to reduce flapping.
- Require ownership tags and runbook links on every alert policy.
- Rate-limit non-page alerts to inbox/slack; page only on customer-impacting burn.
- Make it extensible
- Allow teams to add custom metrics and dashboards via the same Terraform modules.
- Provide patterns for business events (e.g., order_created) to be logged consistently and surfaced on team dashboards.
Success Metrics
- Coverage: % of services on golden paths with platform dashboards and SLOs.
- Signal quality: alert-to-incident ratio, paging noise reduction over time.
- MTTR/MTTA: improvement after adopting platform observability.
- Adoption: number of teams extending platform dashboards via modules, not ad-hoc configs.
- Compliance: % of logs with required labels; % of alerts with owners/runbooks.
Closing Thought
Observability is not a bolt-on—it is part of the platform’s product surface. By shipping curated telemetry, SLOs, and alerts with every golden path, you give teams fast diagnostics, consistent reliability signals, and fewer pages.