Creating a company culture that can weather failure

Do follow up

Adding this step to your playbook of what to do when things go wrong may seem obvious, but you have to follow up on an incident so you can learn from it.

“Schedule a formal review of the incident and identify next steps,” said Stephen Burgess, consultant at the Uptime Institute. He suggests having regular meetings designed to track incidents to a final resolution, to make sure the longer term changes actually happen.

“From the root cause should come any formalized lessons learned, which in turn must clearly identify whether there are any final corrective actions. Maintain scrutiny and open status of the failure incident until there is managerial confirmation that final corrective actions have been performed.” That might mean training, changing policies, processes and procedures, or making proactive repairs and infrastructure upgrades.

Sam Lambert, senior director of infrastructure at GitHub, suggests that IT could learn from other disciplines. “Other industries that build things and build things to last and want to learn from failures in things they build, carry out investigations as standard operating procedure. Look at flight investigations and how useful they’ve been for aviation safety.”

View failures as a chance to get ahead of similar potential problems, Lambert said. “If a failure case comes up and we recognize that failure case could be systemic in some other system, analyzing it gives us an opportunity to look at what may go wrong in the future.”

He points to several areas where GitHub has been able to go beyond fixing the immediate problem to improving their systems generally. “We’ve learned about cause and effect: one service going wrong can affect other services even when they're not the cause of the problem. We’ve learned ways to build in safeguards and do checking in our development process. We’ve learned to respect the time necessary to make systems resilient the first time. We’ve also learned that some things can't be prevented and you’ve just got to accept that and understand that you have to learn from them each time.”

Don’t play the blame game

Whether it’s external problems or an increasing willingness to try “more risky fare like fail-fast experimentation, open hackathons, and citizen developer programs,” CIOs are even more likely to face major IT failures, Constellation Research VP and principal analyst Dion Hinchcliffe said.

“The first step is to prepare for failures with solid contingency plans, but it’s also key to learn from failure through an honest and open, blame-free process.”

He admits that “this can be hard for IT for practical reasons — given the already maximized work schedules — as well as human ones: A hit to morale can occur when really digging into the root cause of failures and observing dysfunction.”

If the investigation focuses on assigning blame rather than understanding the systemic failures that led to the incident, you won’t make staff feel safe enough to share information, suggest solutions, warn you about possible issues or absorb the lessons of the incident.

To help avoid blame, Nather suggests “not looking backwards and rehashing it and saying 'If only this had happened…’ It’s better to say, ‘If we assume this could happen again, how could we respond better this time?'” Not only does that remove the notion of finding fault, but it’s also more realistic. “Everyone would like to look at an incident and say, 'We’ll never have that happen again,' but you can't really say that!”

Rather than assigning blame, Lambert recommends understanding the reasoning behind decisions. “Often, doing dumb stuff is about not having time to do good stuff. People make trade-offs that they're not necessary happy with but sometimes you just have to do that. Sit down with the person who made those trade-offs and ask them why. What were the pressures, what was the information they had that made these trade-offs make sense.”