Incidents

What is an incident?

Definitions vary, but it’s important to agree on one. Examples include:

“Any unplanned disruption or degradation of service that is actively affecting customers ability to use [us].”

“An incident is anything that takes you away from planned work with a degree of urgency.”

I definitely don’t like definitions that are too narrow, e.g. “an outage of our service” or similar. There are lots of situations where you might want coordinated response and communication and post-hoc analysis and writeup.

Process

My two go-to references for good incident response process are:

Other process guides:

Post-Incident

Assorted Notes

Incident Writeups

Some of the below are here because the incident itself was interesting, some because the writeup is particularly insightful.

Misc