More than 70% of all data center outages are caused by human error and not by a fault in the infrastructure design. Furthermore, “mistakes” that led to an outage can often be traced to a poor decision by senior management.
“These decisions are seemingly disconnected in time and space from the site of the incident. It could be design compromises, budget cuts, staff reduction and vendor selection to name a few,” said Philip Hu, managing director of North Asia at Uptime Institute.
He added: “More importantly, human error speaks to management decision regarding staffing levels, training, maintenance and overall rigor of the data center operations.
Hu was one the keynote speakers during the recent Data Center Summit 2016 organized by Computerworld Hong Kong.
Uptime Institute is an advisory organization and is recognized globally for the creation and administration of the tier standards & certifications for data center design, construction, and operational sustainability.
Hu pointed out that while tier certification verifies a data center’s full compliance in terms of design, installed infrastructure and ongoing operations, there is no existing standard to help data center executives assess operations.
“There is a lack of appropriate procedures to address the largest risk to data center availability. Data center operators do not have a means to conduct risk analysis at a portfolio level and provide their senior management with the information needed to make calculated decisions on whether to accept the risk identified in the report or take the corrective actions required to mitigate risks,” Hu said.
Staff training is the biggest oversight in data centers
Uptime Institute has identified five management and operations deficiencies in today’s data centers: staffing, maintenance, training, planning, coordination & management, and operating conditions.
The biggest oversights are being committed in training (over 35%) and in operating conditions (over 33%), with data centers exhibiting ineffective behaviors in these areas.
“Many facilities do not have a formal program with lessons plans. Their on-the-job programs are not documented and there is no list of training required by position,” Hu said.
The enterprises’ neglect in providing sufficient training exacerbates the staffing deficiencies in many of these organizations.
Being understaffed and overworked with no plans to add headcount is the least of the enterprise’s problems where data center staffing in concerned.
“Many DC staff have no experience with looking after data center-specific equipment. They are brought onboard without being vetted against a list of required qualifications – because companies do not keep such a list. Roles and responsibilities are not documented,” Hu said.
Slack record-keeping in the data center
Hu also pointed out that senior management is often lax in demanding proper documentation of data center operations and maintenance activities.
“Many enterprises cannot perform failure analysis of their data center because there are no records of outages or near misses,” he said.
The lack of documentation also hampers preventive maintenance. Many data centers have no list of required PM activities and if there is one, the PM activities are not fully scripted.
“There is no quality control in the process,” Hu observed. “Also, the Maintenance Management System (MMS) is missing critical data such as warranty info, maintenance history and performance data, among others. Hence, it is not surprising that the MMS is unable to produce a deferred maintenance report.”