adv

Creating a company culture that can weather failure

Creating a company culture that can weather failureDo you change processes after handling an incident, or do you just carry on and wait for the next problem? Instead of dealing with individual failures, think about creating a culture in your IT department that can not only handle problems but truly learn from them.

Cloud providers are routinely better at learning from failure than most enterprises — because they have to be. It’s critical that they are transparent about failures to keep the trust of their customers, but it also hits the bottom line if they take too long to solve problems. When AWS, Google, Azure or GitHub has a major outage, you’ll see regular updates, and once the problem has been fixed, a public incident response will cover what changes are being made to make sure the same thing doesn’t happen again.

For example, when an engineer at GitLab accidentally deleted the production database earlier this year (while trying to recover from load issues caused by spammers), the service was down for several hours. Worse still, nearly all of the backup tools GitLab was using turned out not to have been creating backups and six hours of production data across some 5,000 projects was lost.

The engineers documented what was happening in real time (on Twitter, YouTube and in a shared public document), followed by a blog with the key details and a full post-mortem. This explained not just the sequence of what went wrong but also the misconfigurations and other complications that resulted in having no up-to-date backups, giving them a clear list of the on-going changes that needed to be made.

Or consider US retailer Target’s data breach, which did a lot more damage to the company. After discovering at the peak of the 2013 Christmas shopping rush that hackers had installed malware in their credit card terminals, the retailer found that the details of around 40 million debit and credit cards had been stolen, as well as names, addresses and phone numbers for up to 70 million customers. The data breach cost the business over $100 million in settlements with banks, Visa and a federal class action suit, and Target CEO Gregg Steinhafel resigned in 2014.

Fast forward three years, though, and “Target has become a role model for other retailers,” Wendy Nather, former CISO and principal security strategist at security firm Duo, told CIO.

“They made a huge turnaround after their breach; they really built up their security program to the point where they really have a lot of transparency. They host security events. They were one of the organizations that helped found R-CISC, the Retail Cyber Intelligence Sharing Center. They really have led the charge to start exchanging threat intelligence amongst retailers.”

That puts Target in a much better place than if it had only fixed the immediate problems and then stopped. “Other organizations that have been breached have circled the wagons. Their attorneys didn’t let them say anything, they’re not learning from the breach, they’re not changing their spending on security and it’s very clear they will fall to the same kind of breach again later.”

The difference is as much culture as technology, Nather said. “It’s all in how they responded and made something positive come out of what was a terrible situation.”

If you want turn problems into learning experiences, there are some key do’s and don’ts.



Comments