Monitoring and observability are often conflated but are distinct — monitoring tells you when something is wrong, observability tells you why. Both are essential for production systems. Here is a clear explanation of the three pillars.
Logs
Logs are immutable, timestamped records of events that happened in your system. Every web request, error, user action, and internal state change produces log entries. The challenge: high-volume systems generate millions of log lines per hour, and finding the relevant ones requires structured logging (JSON format, consistent field names: user_id, request_id, action) rather than unstructured text strings. Tools: centralized logging with Elasticsearch/Kibana (ELK stack), Loki + Grafana, or managed services like Datadog Logs, Sumo Logic, Papertrail.
Metrics
Metrics are numerical measurements over time — request rate, error rate, latency percentiles (p50, p95, p99), queue depth, CPU/memory usage. Unlike logs (which record individual events), metrics aggregate patterns. The key metric framework is the USE method (Utilisation, Saturation, Errors) for resources and the RED method (Rate, Errors, Duration) for services. Prometheus (open source) with Grafana dashboards is the dominant stack for metrics; alternatives include Datadog, New Relic, and InfluxDB.
Traces
Distributed tracing follows a request across multiple services as it moves through your system — from the front-end HTTP request through the API gateway, to service A, database query, service B, and back. Each step is a “span” with timing; the collection of spans for one request is a “trace.” Traces are essential for microservices where a user complaint (“my checkout is slow”) needs to be debugged across 5–10 services. OpenTelemetry (open standard) with Jaeger or Tempo is the current best practice; Datadog APM and New Relic are managed alternatives.
The Practical Minimum
For small teams: structured logging to a central store (Papertrail free tier, or Loki if self-hosting), request rate and error rate metrics from your application framework (most expose these automatically), and uptime monitoring (UptimeRobot, free) that alerts when your service is unreachable. Add distributed tracing only when debugging cross-service latency becomes a recurring problem.




