Incidents
What is an incident?
Definitions vary, but it’s important to agree on one. Examples include:
“An incident is anything that takes you away from planned work with a degree of urgency.”
I definitely don’t like definitions that are too narrow, e.g. “an outage of our service” or similar. There are lots of situations where you might want coordinated response and communication and post-hoc analysis and writeup.
Process
My two go-to references for good incident response process are:
Other process guides:
Post-Incident
- Incident Write-ups They Want to Read
“Focus on narrative, not metadata”
“Support your readers”
“Be visual”
“Don’t be afraid of analysis”
Some Observations On the Messy Realities of Incident Reviews – Adaptive Capacity Labs
Incident Review and Postmortem Best Practices - by Gergely Orosz - The Pragmatic Engineer
Assorted Notes
Against Incident Severities and in Favor of Incident Types | Honeycomb
Assembly time is where you have the most control of an incident
3 mistakes I’ve made at the beginning of an incident (and how not to make them) | FireHydrant
Running More Low-Severity Incidents Is Improving Our Culture – The New Stack
There Is No Shame in Customer-Reported Incidents - The New Stack
Keep Calm and Respond: A Beginner’s Heuristic to Incident Response - DZone DevOps
How We Manage Incident Response at Honeycomb - The New Stack
Align platform and product engineering teams over incidents | FireHydrant
Incident Writeups
Some of the below are here because the incident itself was interesting, some because the writeup is particularly insightful.