Debugging in Production and Distributed SystemsLesson 6.2

How distributed tracing works and how to use it for debugging

distributed tracing, trace ID, span, OpenTelemetry, trace visualization, latency analysis across services

Logs Alone Fail in Distributed Systems

In a system with multiple services, a single user request creates log entries across several services. Correlation IDs help, but distributed tracing goes further: it produces a visual timeline of every service call involved in a request, with durations and dependencies. You see exactly which service, which call, and which milliseconds are responsible for a slow or failing request.

Key Concepts

A trace represents one complete user request end-to-end. A span represents one unit of work within that request - one service call, one DB query. Spans nest (a parent span contains child spans) and are connected by a shared trace ID. OpenTelemetry is the standard instrumentation library across languages.

// Node.js with OpenTelemetry
const { trace, SpanStatusCode } = require('@opentelemetry/api');
const tracer = trace.getTracer('payment-service');

async function processPayment(order) {
  const span = tracer.startSpan('processPayment');
  span.setAttribute('order.id', order.id);
  span.setAttribute('order.amount', order.amount);

  try {
    const result = await chargeCard(order);
    span.setStatus({ code: SpanStatusCode.OK });
    return result;
  } catch (err) {
    span.recordException(err);
    span.setStatus({ code: SpanStatusCode.ERROR });
    throw err;
  } finally {
    span.end();
  }
}

Trace data is sent to a collector (Jaeger, Zipkin, or a managed service) and visualized as a flame graph. Look for spans that are unexpectedly long, missing, or producing errors.