Incident Postmortem

Blameless postmortem template: timeline, root cause, contributing factors, action items.

# Incident Postmortem â [Brief incident name] **Incident ID:** INC-[YYYY-MM-DD-NN] **Severity:** [SEV-1 | SEV-2 | SEV-3] **Date of incident:** [YYYY-MM-DD] **Detection time:** [HH:MM UTC] **Resolution time:** [HH:MM UTC] **Duration:** [HH:MM] **Customer impact:** [What customers saw and how many] **Author:** [Name] **Reviewers:** [Names] ## Summary One paragraph that a busy executive can read in 30 seconds and walk away knowing: what broke, who was affected, what we changed, and whether it could happen again. ## Timeline All timestamps in UTC. Include the moment of detection, the moment of escalation, every meaningful action taken, and the moment of confirmed resolution. Be specific â "10:14 â paged on-call" not "around 10am". | Time (UTC) | Event | |---|---| | 10:00 | Deploy of [version] begins | | 10:07 | Error rate alert fires (>2% of requests) | | 10:08 | On-call paged | | 10:14 | Incident channel opened, IC assigned | | 10:28 | Rollback decision made | | 10:35 | Rollback complete, error rate recovering | | 10:42 | Confirmed clean â incident closed | ## Root cause The technical chain of events. Lead with the proximate cause (what broke) then walk back to the contributing causes (what allowed the proximate cause to ship). Avoid "human error" â if a human action caused the break, the question is what system allowed the action to be consequential without protection. ## What went well - Alerting fired quickly - Rollback procedure was rehearsed - Comms to customers were timely - [other] ## What didn't go well - Detection took N minutes longer than it should - Runbook for this class of failure was stale - Customer comms went out after [external party] noticed - [other] ## Action items Each item: title, owner, due date, severity (P0 = blocks similar incidents, P1 = reduces blast radius, P2 = quality-of-life). Track in the engineering tracker, not buried in this doc. | Action | Owner | Due | Severity | Status | |---|---|---|---|---| | Add canary deploy gate | [name] | [date] | P0 | open | | Update runbook for [service] | [name] | [date] | P1 | open | | Refactor [code path] | [name] | [date] | P2 | open | ## Lessons for the broader org What does the team need to internalise as a result of this? What pattern in the codebase or org would prevent similar incidents in a different service? ## References - Incident channel: [link] - Related alerts: [links] - Code at time of incident: [commit SHA] - Customer status page updates: [link]

Price: Free
State: approved