Incidents
=========

What is an incident?
--------------------

Definitions vary, but it's important to agree on one.
Examples include:

`"Any unplanned disruption or degradation of service that is actively affecting customers ability to use [us]." <https://response.pagerduty.com/before/what_is_an_incident/>`_

`"An incident is anything that takes you away from planned work with a degree of urgency." <https://incident.io/guide/foundations/defining-an-incident>`_

I definitely don't like definitions that are too narrow, e.g. "an outage of our service" or similar.
There are lots of situations where you might want coordinated response and communication and post-hoc analysis and writeup.


Process
-------

My two go-to references for good incident response process are:

* `The PagerDuty Incident Response Documentation <https://response.pagerduty.com/>`_
* `The incident.io guide to incident management <https://incident.io/guide>`_

Other process guides:

* `Incident Response at Heroku <https://blog.heroku.com/archives/2014/5/9/incident-response-at-heroku>`_
* `Atlassian Incident Management Handbook <https://www.atlassian.com/software/jira/ops/handbook>`_
* `How To Establish a High Severity Incident Management Program <https://www.gremlin.com/how-to-establish-a-high-severity-incident-management-program/>`_


Post-Incident
-------------

* `It's not a Post Mortem if nobody died <https://doismellburning.tumblr.com/post/662733655595155456/its-not-a-post-mortem-if-nobody-died>`_
* `Incident Write-ups They Want to Read <https://blog.container-solutions.com/incident-write-ups-they-want-to-read>`_
    * "Focus on narrative, not metadata"
    * "Support your readers"
    * "Be visual"
    * "Don't be afraid of analysis"
* `Some Observations On the Messy Realities of Incident Reviews – Adaptive Capacity Labs <https://www.adaptivecapacitylabs.com/blog/2019/06/17/some-observations-on-the-messy-realities-of-incident-reviews/>`_
* `Incident Review and Postmortem Best Practices - by Gergely Orosz - The Pragmatic Engineer <https://newsletter.pragmaticengineer.com/p/incident-review-best-practices>`_
* `Howie: The Post-Incident Guide <https://www.jeli.io/howie/welcome>`_


Assorted Notes
--------------

* `Against Incident Severities and in Favor of Incident Types | Honeycomb <https://www.honeycomb.io/blog/against-incident-severities-favor-incident-types>`_
* `Assembly time is where you have the most control of an incident <https://firehydrant.com/blog/assembly-time-is-where-you-have-the-most-control-of-an-incident/>`_
* `3 mistakes I’ve made at the beginning of an incident (and how not to make them) | FireHydrant <https://firehydrant.com/blog/3-mistakes-ive-made-at-the-beginning-of-an-incident-and-how-not-to-make-them/>`_
* `Running More Low-Severity Incidents Is Improving Our Culture – The New Stack <https://thenewstack.io/running-more-low-severity-incidents-is-improving-our-culture/>`_
* `There Is No Shame in Customer-Reported Incidents - The New Stack <https://thenewstack.io/there-is-no-shame-in-customer-reported-incidents/>`_
* `Keep Calm and Respond: A Beginner's Heuristic to Incident Response - DZone DevOps <https://dzone.com/articles/keep-calm-and-respond-a-beginners-heuristic-to-inc-1>`_
* `Incident benchmark report | FireHydrant <https://firehydrant.com/reports/incident-benchmarks/>`_
* `Incident travel time - by Robert Ross - The Thought Drop <https://www.bobbytables.io/p/incident-travel-time>`_
* `How We Manage Incident Response at Honeycomb - The New Stack <https://thenewstack.io/how-we-manage-incident-response-at-honeycomb/>`_
* `Align platform and product engineering teams over incidents | FireHydrant <https://firehydrant.com/blog/align-platform-and-product-engineering-teams-over-incidents/>`_
* `A guide to Incident Command | by Jonathan Word | Medium <https://argoday.medium.com/incident-command-guide-9872b51d7c94>`_
* `Your guide to better incident status pages | FireHydrant <https://firehydrant.com/blog/your-guide-to-better-incident-status-pages/>`_


Incident Writeups
-----------------

Some of the below are here because the incident itself was interesting,
some because the writeup is particularly insightful.

* `High Scalability - High Scalability - Troubles with Sharding - What can we learn from the Foursquare Incident? <http://highscalability.com/blog/2010/10/15/troubles-with-sharding-what-can-we-learn-from-the-foursquare.html>`_
* `Twilio incident and Redis <http://antirez.com/news/60>`_
* `Incident report on memory leak caused by Cloudflare parser bug <https://blog.cloudflare.com/incident-report-on-memory-leak-caused-by-cloudflare-parser-bug/>`_


Misc
----

* `Wolf Incident Postmortem - LessWrong <https://www.lesswrong.com/posts/aRxDLju75KXD6PCpB/wolf-incident-postmortem>`_ - `I don't like how "post mortem" gets commonly used for non-fatal incident reviews/reports in the tech industry <https://doismellburning.tumblr.com/post/662733655595155456/its-not-a-post-mortem-if-nobody-died>`_, but it makes sense here(!)
* `Ransomware incidents now make up majority of British government’s crisis management COBRA meetings - The Record by Recorded Future <https://therecord.media/ransomware-incidents-now-make-up-majority-of-british-governments-crisis-management-cobra-meetings/>`_