Observability (the ability to understand what your system is doing from its external outputs) has evolved from an afterthought to a core part of production engineering. Here is the current state of the practice and what actually matters.
The Three Pillars
Logs: structured records of individual events — what happened, when, and what data was associated with it. Key principle: structured logging (JSON output) rather than plain text is now the default for production systems. Structured logs are queryable; plain text is not. A log entry should contain: timestamp, log level (DEBUG/INFO/WARN/ERROR), service name, request ID, user ID if applicable, and the event data. The anti-pattern: overly verbose logging (logging every SQL query in production) creates cost without insight; logging too little means you can’t debug production issues. The practical level: INFO for significant events (request received, order placed), WARN for recoverable issues (retry succeeded), ERROR for unrecoverable failures (payment failed). Metrics: numerical measurements over time — request rates, error rates, latency percentiles, resource usage. The key insight: metrics are cheap to store and process at scale; they answer “is something wrong?” quickly. Standard metrics: request rate (requests per second), error rate (% of requests returning 5xx), latency (p50, p95, p99 — not average, which hides outliers), and saturation (are you near capacity?). These four (often called RED metrics — Rate, Errors, Duration — and USE — Utilisation, Saturation, Errors) cover the basics for any service. Traces: a record of a request’s path through a distributed system, showing which services it passed through, how long each took, and where errors occurred. The key tool for microservices: when a request fails after passing through 8 services, traces tell you which service caused the failure and how long each step took. OpenTelemetry is now the standard: a vendor-neutral SDK for collecting and exporting traces, metrics, and logs from applications.
The Tools Stack in 2026
For small-to-medium organisations, the most practical stack: Grafana (dashboards and visualisation, free open source) + Prometheus (metrics collection and storage, open source) + Loki (log aggregation, Grafana’s log product) + Tempo (distributed tracing, Grafana’s trace product). This “Grafana stack” is production-grade, has an active open source community, and avoids proprietary lock-in. Cloud-native alternatives: AWS CloudWatch + X-Ray, Google Cloud Monitoring + Trace, Azure Monitor + Application Insights — integrated with their respective cloud providers, less effort to set up, but proprietary and potentially expensive at scale. The commercial observation platforms (Datadog, New Relic, Honeycomb, Dynatrace): add capabilities like automatic anomaly detection, AI-assisted troubleshooting, and better user interfaces, but at significant cost. Datadog costs approximately $23/host/month at standard pricing — for a 50-host fleet, $1,150/month. At scale, observability cost management becomes a discipline in itself.
What Actually Matters in Practice
Instrumentation at the start: adding observability to an existing system is harder than building it in from the start. Add OpenTelemetry SDK at project start; add structured logging from day one. SLOs before dashboards: a Service Level Objective (SLO) is a target for your service’s reliability (“99.9% of requests succeed within 200ms”). Defining SLOs before building dashboards focuses you on what matters, not what’s interesting. The on-call playbook: every alert should have a runbook — a document describing what the alert means and what steps to take to resolve it. Alerts without runbooks are noise. Alert fatigue: too many alerts = engineers learn to ignore them. Alert only on things that require human action; use SLO-based alerting (alert when your error budget is burning down fast) rather than simple threshold alerts (alert when error rate > 1% — too noisy). Correlation: the value of observability is being able to correlate a metric anomaly with the trace and log that explain it. Request IDs linking logs and traces are the basic mechanism.



