Creating a company culture that can weather failure

Do reward reporting

The tale of a developer fired for confessing to deleting the production database on day one at a new job may be apocryphal, but the account on Reddit was certainly plausible and led many to point out that the fault lay not with the new developer, but with the documentation that included the details of the production database in a training exercise.

In contrast, when a chemistry student at the University of Bristol in the UK accidentally made an explosive and reported it, even though the emergency services had to carry out a controlled detonation, the dean of the Faculty of Science Timothy C. Gallagher praised the student for acting responsibly. He pointed out “the value of investing in developing and fostering a culture in which colleagues recognise errors and misjudgements, and they are supported to report near misses.”

In the airline industry, the International Confidential Aviation Safety Systems Group collects confidential, anonymous reports of near misses, cabin fires, maintenance and air traffic control problems to encourage full disclosure of problems. Similarly, when the US Forest Service conducts Learning Reviews after serious fires the results can be used only for preventing accidents, not legal or disciplinary action.

You want your team to feel safe enough to report the problems that haven’t yet led to a failure.

“Whether formalized in a policy or not, the team must be well aware that mistakes are tolerated, but concealment and cover-up are not,” said Burgess. “Personnel must clearly understand they will never be penalized for volunteering any and all information regarding any failure.”

“Part of your responsibility as a CIO is to build these relationships,” explained Nather. “The system admins should be your eyes and ears. You want to have the culture where someone will come into your office and close the door and say, ‘There’s something I think you ought to know.’ If you can get that, you can build a resilient organization.”

Treating IT and security as being a business service rather than a point of control helps create that kind of culture. “If you take the attitude that you’re there to help everyone else with their business, that’s very different from sitting in an ivory tower and saying, ‘Ooh you did something wrong, you missed a spot’,” she said.

Do learn from others' mistakes

Thinking about what you’d do differently the next time a problem occurs is useful, but you can also think about how you’d tackle problems you haven’t run into yet.

“What I see in in very mature organizations is that they also try to learn from other people's incidents,” says Nather. “Ask, ‘If that were to happen to us, what would it look like, how could we detect it and how could we respond to it?’”

Your competitors might or might not share the details of incidents they’ve faced and fixed (formally or informally), but you can also watch organizations with a similar technology setup and risk profile in other industries. Security vendors often blog step-by-step analyses of incidents. Nather also recommends a Twitter account @badthingsdaily that comes up with scenarios regularly:

“Your partner database just went down, a tornado just destroyed your backup data centers. You can take them and talk them through. You can even go through the exercise of building the tool or doing the scripting to be able to automate the detection so that’s one less thing your people have to worry about doing manually.”

These tabletop exercises can be more palatable than the ‘chaos monkey’ approach pioneered by Netflix to simulate failure by deliberately shutting down some systems. “For less mature organizations, actually breaking something is a real concern, which is why even talking through it without actually doing anything can be very useful.”