Advanced Distributed Systems ConceptsLesson 6.4
How to design for observability — metrics, logging, and tracing
three pillars of observability, structured logging, distributed tracing, RED metrics, SLO vs SLA, alerting on symptoms not causes
Three Pillars of Observability
You can't fix what you can't see. Observability tells you what your system is doing and why it's misbehaving — before customers report it.
Metrics — The RED Method
For every service, track:
- Rate: requests per second
- Errors: error rate (4xx, 5xx)
- Duration: latency (p50, p95, p99)
Structured Logging
// Bad: unstructured
console.log('User 123 failed to login')
// Good: structured JSON
logger.error({
event: 'login_failed',
user_id: 123,
reason: 'invalid_password',
ip: '1.2.3.4',
timestamp: new Date().toISOString()
})Distributed Tracing
A trace follows a request across multiple services. Each service adds a span with start time, duration, and metadata. Tools: Jaeger, Zipkin, OpenTelemetry. Trace ID propagates in HTTP headers (X-Trace-ID).
Alerting Philosophy
Alert on symptoms (error rate > 1%, p99 latency > 500ms), not causes (CPU > 80%). A CPU spike that doesn't affect user experience isn't an incident. An error rate spike that users experience immediately is.
