Creating a company culture that can weather failure

Don’t call it a post mortem

Although the term “blameless post-mortem” is common — popularized by companies like Etsy, whose tracker for the process is called Morgue — Nather suggests picking a friendlier phrase. “If you call it a post-mortem that sounds so terribly morbid! The term we use is an after action report. We try to make it a very positive thing, rather than thinking of it as ‘having survived the battle we will now count our wounded and dead’.”

Don’t call it human error

When British Airways had to cancel all flights from Gatwick and Heathrow airports over a bank holiday weekend this May, it blamed the IT failure that stranded some 75,000 travellers on human error. A contractor appears to have turned the uninterruptable power supply off and the power surge when it was turned back on damaged systems in its data center. BA promised an independent investigation, but its initial explanation raised questions over the design of both the power and backup systems.

By contrast, when an engineer mistyped a command that took down the AWS S3 service — and many other services that depended on it, like Quora and file sharing in Slack — for several hours, Amazon’s explanation avoided the phrase “human error” and concentrated on explaining the flaws in the tools and process that allowed the mistake to be made.

Lambert maintains that “human error doesn't really exist. Providing that you hire good people who want to do right thing, they will usually do the right thing. It’s rare that you can say a person discarded the all good information they had and just did what they wanted and that's why we had this issue.”

The real problem is tools and processes that don’t prevent (or at least issue warnings about) the inevitable mistakes people make, or the lack of automation that means someone is typing in the first place.

“It’s a lazy approach to say people did the wrong thing,” said Lambert. “A better approach is to assume that everyone did right thing with the information they had, so you need to take away the blame and look at what information they had at each stage and what was missing, and what additional tools or processes you need to get better next time.”