Blameless Postmortems

• 5 min read
Hunter Fernandes

Hunter Fernandes

Software Engineer


I am fortunate enough to be able to shape (in some small part) the engineering culture at my company. Part of that comes from realizing that engineering culture comes from consensus more than other cultures, which means discussing topics with fellow engineers and advocating for your position. “Culture” here is not dictated but mutually agreed upon by the engineers.1

One of the things I have advocated is postmortem culture. Occasionally, something will happen that makes someone say, “that was bad enough that we don’t want it ever to happen again.” This is a key phrase indicating that you should write a postmortem!

A postmortem memorializes want went wrong, the actions and circumstances that led up to it, describes how the situation was defused, and creates specific action items to prevent it from happening again.

One cannot fully resolve an incident until a postmortem is written. Postmortems are core to knowledge transfer and creating best practices. “Best practices” can seem like an amorphous abstraction, but seeing a specific case where they help or would have prevented an incident is a compelling demonstration of the practice’s value.

However, writing a postmortem can look like a lot of work coming off an incident (even a minor one!). We’ve made a few improvements to the postmortem process to remove some roadblocks. Writing postmortems should not be that hard!

Four Things to Improve Postmortems

We’ve taken four specific steps to improve and encourage postmortem culture:

Starting with an empty document is difficult. Postmortem templates put you on rails to get started.

  1. Postmortem Guidelines. Engineers used to reference older postmortems and make new ones resembling those. So the quality varied and trended down over time. Now, we have a document that provides strong guidance on writing postmortems, from start to end. We explicitly outline the goals of a postmortem, the process, the feedback loop, and the dos and don’ts.

  2. Postmortem Templates. Beginning a postmortem is difficult. Having a base template that you can copy/paste makes creating them a lot easier: you start by just filling out fields! Starting with a blank page is hard. Starting with some boxes is much easier. And, as you explore further, you add more details!

  3. Wide distribution. We distribute our postmortem documents to the entire engineering organization by default and optionally even wider to the rest of the company. Postmortems are essential documents, and an emphasis on distribution is a clear indicator of value. We don’t write these just for fun!

  4. Blameless postmortems. When writing a postmortem, don’t name anyone — this removes blame and unlocks the ability to dive deeper. More on this in a bit.

Writing postmortems is easy and impactful when you fuse these items. Postmortems no longer look like a massive effort to new engineers, and we encourage writing a postmortem for less-than-outage level events.

Blameless Postmortems

The most significant improvement to the postmortem process we’ve made is officially moving to blameless postmortems. Blameless postmortems involve… not blaming anyone. You don’t name names. At most, you name roles. You realize that a person did not cause an incident, but rather a process did.

Obviously git blame is fine for code archaeology. But blaming engineers when failures happen is not.

Blameless postmortems turn introspection from “a time to play the musical-chair blame game” into an opportunity for learning and growth. By removing blame, you also remove politics. There are plenty of positive knock-on effects that come from dampening the effects of politics, including the ability to look under otherwise “politically active” rocks during your root cause analysis.

Moreover, I don’t want engineers to feel bad. That is not the goal. Their name should not be tied permanently to an outage in some official document. The goal is for Engineering as an organization to not repeat that mistake and remove the conditions that made that mistake even possible.

I occasionally get pushback from other engineers about not including names in postmortems. Truthfully, I still do not fully understand why this is the case. I have one theory that it’s at odds with an engineer’s initial instinct to include as much detail as possible. I think the argument comes from wanting completeness of the record. However, I don’t believe names add meaningful detail to the record. It could have been anyone in that role, and the postmortem should reflect that.

Large companies get their postmortems wrong all the time. Here’s a good postmortem and a bad postmortem:

Great postmortem culture: Amazon’s S3 outage, where the engineer who took down S3 was not named or shamed and not disciplined. Amazon correctly makes it clear that it was the process that failed and focuses on how they will technically prevent this in the future.

Bad postmortem culture: Salesforce DNS outage, where they blame an engineer for the outage. They focus on the engineer instead of the process. Beyond showing a lack of introspection from the engineering organization, a postmortem like this telegraphs a toxic engineering culture. “More training” is such a weasely and meaningless statement. If their engineers regularly circumvent their process to perform their job, the process is wrong. Not the engineers. Salesforce’s postmortem reads like an executive needed someone to blame. 2

All companies should adopt at least a blameless postmortem process. It will help your organization be honest with itself and fix the underlying root cause instead of sweeping issues under the rug.

Footnotes

  1. This does not mean that stalemates are allowed to develop. If no clear progress is being made, it’s always an option to “disagree and commit.”

  2. Things like this are what cause engineers to think twice before working for a company like Salesforce. I don’t want to work at a place where I am afraid that a mistake could cost me my job and management would single me out as the reason an outage happened.