Incidents

What is an incident?

Definitions vary, but it’s important to agree on one. Examples include:

“Any unplanned disruption or degradation of service that is actively affecting customers ability to use [us].”

“An incident is anything that takes you away from planned work with a degree of urgency.”

I definitely don’t like definitions that are too narrow, e.g. “an outage of our service” or similar. There are lots of situations where you might want coordinated response and communication and post-hoc analysis and writeup.

Process

My two go-to references for good incident response process are:

Other process guides:

Post-Incident

It’s not a Post Mortem if nobody died
Incident Write-ups They Want to Read
- “Focus on narrative, not metadata”
- “Support your readers”
- “Be visual”
- “Don’t be afraid of analysis”
Some Observations On the Messy Realities of Incident Reviews – Adaptive Capacity Labs
Incident Review and Postmortem Best Practices - by Gergely Orosz - The Pragmatic Engineer
Howie: The Post-Incident Guide

Assorted Notes

Incident Writeups

Some of the below are here because the incident itself was interesting, some because the writeup is particularly insightful.