How distributed tracing works and how to use it for debugging
distributed tracing, trace ID, span, OpenTelemetry, trace visualization, latency analysis across services
Logs Alone Fail in Distributed Systems
In a system with multiple services, a single user request creates log entries across several services. Correlation IDs help, but distributed tracing goes further: it produces a visual timeline of every service call involved in a request, with durations and dependencies. You see exactly which service, which call, and which milliseconds are responsible for a slow or failing request.
Key Concepts
A trace represents one complete user request end-to-end. A span represents one unit of work within that request -- one service call, one DB query. Spans nest (a parent span contains child spans) and are connected by a shared trace ID. OpenTelemetry is the standard instrumentation library across languages.
// Node.js with OpenTelemetry
const { trace, SpanStatusCode } = require('@opentelemetry/api');
const tracer = trace.getTracer('payment-service');
async function processPayment(order) {
const span = tracer.startSpan('processPayment');
span.setAttribute('order.id', order.id);
span.setAttribute('order.amount', order.amount);
try {
const result = await chargeCard(order);
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (err) {
span.recordException(err);
span.setStatus({ code: SpanStatusCode.ERROR });
throw err;
} finally {
span.end();
}
}
Trace data is sent to a collector (Jaeger, Zipkin, or a managed service) and visualized as a flame graph. Look for spans that are unexpectedly long, missing, or producing errors.
