Debugging in Production and Distributed SystemsLesson 6.5

How to write a postmortem that prevents the bug from recurring

postmortem, blameless culture, five whys, contributing factors, action items, incident timeline

The Postmortem Is the Final Debug Step

Fixing a bug in code prevents that exact bug from recurring. A postmortem prevents the class of bugs it represents from recurring. The goal is not to assign blame but to understand the system conditions that allowed the bug to reach production and to change those conditions.

The Five Whys

Start with the symptom and ask why five times. Each answer becomes the next question. This chain leads from the immediate technical cause to the underlying process failure - which is where the durable fix lives.

Symptom: Payment service returned 500 errors for 12 minutes on 2024-03-15

Why 1: Payment processing crashed
  - the discount calculation threw on string input
Why 2: String input was passed
  - the new API endpoint did not validate request types
Why 3: No validation
  - the PR review checklist had no input validation requirement
Why 4: No requirement
  - we had no standard for input validation in new endpoints
Why 5: No standard
  - we never formally documented expected practices after previous type bugs

Root cause: No enforced input validation standard
Fix: Add input validation lint rule + PR checklist item + shared validation utility

Action Items

Every postmortem must produce concrete action items with owners and deadlines - not vague intentions. Add more tests is not an action item. Add an ESLint rule for input validation on all new API handlers, assigned to specific developer, due Friday, is. Track completion. Postmortems whose action items are never completed are documentation, not prevention.

Ready to practice?

MCQs · Coding challenges · Mini project