Practice & Assessment
Test your understanding of Debugging in Production and Distributed Systems
Multiple Choice Questions
6A production incident starts. Error rates are spiking and you do not yet know the cause. What is the correct first action?
In a distributed trace, what does a span represent?
You deploy a bug fix behind a feature flag at 5% rollout. Error rates increase for those 5% of users. What should you do?
What is the most common cause of a memory leak in a Node.js event emitter pattern?
The Five Whys technique in a postmortem ends when:
Two heap snapshots show Closure object count grew from 1,200 to 8,500 after processing 1,000 requests. What does this indicate?
Coding Challenges
1Instrument a Service with Distributed Tracing
A provided Node.js service with three functions -- fetchUser, fetchOrders, and buildResponse -- calls them in sequence for each request. Instrument the service using the OpenTelemetry API: create a root span per request with the user ID attribute, create a child span for each of the three functions with relevant attributes, record any exceptions on the span, and end every span in a finally block. Input: provided service file. Output: instrumented service file where each function creates and closes a properly structured span. Time estimate: 25 minutes.
Mini Project
Production Incident Postmortem
Using a provided incident scenario (an e-commerce site checkout was broken for 22 minutes due to a null reference error introduced in a deployment), produce a complete postmortem document containing: an incident timeline (minute-by-minute from detection to resolution), a Five Whys analysis reaching a process-level root cause, a list of three contributing factors beyond the direct technical cause, five concrete action items with mock owners and due dates, and a description of what monitoring or testing would have caught this bug before it reached production. The postmortem must be written in a blameless tone throughout.
