[M5Hosting] Post Incident Update - Facility Power

Michael J McCafferty mike at m5computersecurity.com
Thu Mar 19 21:10:20 PDT 2009


Dear M5Hosting Customer,

    I have attached the post-incident report from the data center  
facility provider to this email below. As one of our most important  
vendors, we keep engaged with American Internet Services (AIS) plans,  
upgrades, and operational changes. Naturally, as such a critical  
supplier, we scrutinize their people's actions, technology, and  
facility, especially when it comes to service impacting events. It is  
our job to provide the best service possible to our customers, and  
that requires us to be a tough and determined customer of AIS, on your  
behalf.
    While power has been restored and the facility has resumed normal  
operating status, this event is not "over" for us. M5Hosting is  
evaluating our own response to the event, and how our processes,  
systems and technologies can be improved to mitigate the impact of  
another service affecting event. AIS will certainly be doing the  
same... and we will follow their actions closely and remain engaged  
and involved with them.
    With this said, and with due respect that we are talking about an  
unplanned loss of power in an Internet Data Center, I am pleased with  
AIS's response to the incident once it happened. All of M5Hosting's  
technical staff and I were on site for up to 18hrs after the event and  
observed their response to it first hand. It is clear that AIS's  
response was well directed and planned.
    Please find their Post Incident Report attached below. A diagram  
of the AIS power infrastructure for the affected facility can be found  
at:
http://www.m5hosting.com/AIS_SDTC_Power_Diagram_l.jpg

    As always, I'd like to hear from you about this email, and the  
events and actions described in it... or anything at all.

Sincerely,

Michael J. McCafferty
Principal Engineer
M5 Hosting
mike[at]5hosting.com
877-344-4678 x501

[quote]
Dear Valued Customer:

As a follow up to the power event which occurred on the morning of  
March 18th, 2009 at the 9725 Scranton Data Center (SDTC), American  
Internet Services has compiled the following post incident report for  
our customer base.  As always, our Account Relations and Management  
team members are available to discuss specific customer issues or  
concerns, while this report is intended to provide comprehensive  
overview of the event itself.

At approximately 08:15AM PDT, March 18th, the SDTC datacenter suffered  
a complete power failure for approximately 30 seconds while conducting  
routine maintenance to the critical datacenter systems. The work that  
was being performed is part of AIS? Standard Operating Procedure. This  
procedure is in alignment with industry guidelines, and our commitment  
to provide customers with the highest availability in data center  
solutions. As we have informed our customers in the past, all critical  
systems are tested bi-monthly by our team of mechanical engineers in  
conjunction with our outside contractors under service agreements.  
Standard maintenance is performed during normal business hours and is  
carefully planned to incorporate the strictest test procedures to  
ensure the success of the work performed. Our SOP incorporates  
escalation processes and back out procedures in the unlikely event of  
an alert or anomaly during the standard maintenance.

Regretfully, during our maintenance yesterday, we encountered a  
mechanical failure. The Powerware 9515 UPS plant failed during the  
transition of building load from street power to generator power.  
Approximately 30 seconds upon failure of the UPS plant, our CTO,  
Richard Sears, who was present for the maintenance, restored power to  
the data center by manually moving the building to generator, quickly  
isolated the failure to the UPS plant, reset all four UPS modules, and  
brought all four UPS modules back online. Following, he moved the UPS  
plant from bypass mode to normal operational mode.

At that time, senior management called to initiate the Emergency  
Response Plan (ERP) and made a decision not to move the data center  
back to street power until our mechanical engineers and external  
contractors had an opportunity to perform diagnostics of all  
datacenter systems to determine what caused the failure to the UPS  
plant, as well as, test the general state of health of all critical  
systems.

Within approximately 15 minutes of initiating ERP, we had mobilized 18  
Customer Service Engineers, 5 Networking Engineers and Facilities and  
HVAC teams to the datacenter, in an effort to assists our customers  
with recovery. We also had UPS, battery and power experts from Eaton  
Powerware, CPD and Emerson there to assist in the investigation of the  
issue. As part of our emergency communication plan, all customers were  
proactively contacted and informed of the situation and were provided  
multiple progress updates throughout the day.

Upon reviewing of the findings, it was determined that one of our  
battery strings failed, resulting in their not being able to hold  
system load once the UPS plant went fully to battery. This caused a  
critically low battery voltage condition to the entire UPS plant and  
the plant protected itself by bypassing its system load to the main  
bus. This was during the time the building was being transferred from  
street power to generator power, so the main busses were both dead. In  
order to prevent a dead-head of the generator and utility systems, the  
SEL electrical system has a failsafe that prevents the main breakers  
from closing after the emergency breakers have been commanded to  
close, and there is power on the emergency bus. This condition  
prevented us from closing the main breakers, while we were still able  
to close the emergency breakers.

As with all of our critical datacenter systems, we have external  
contractors under maintenance agreements to provide system  
maintenance.  JT Packard is responsible for system maintenance on our  
entire UPS and battery plant at SDTC datacenter. We rely on our vendor  
to test each battery at specific intervals to determine if and when  
our batteries are approaching the threshold that requires replacement.  
JT Packard has been performing this system maintenance on a regular  
basis for several years now, of which most recently, reported 100%  
System Health.

The result of the investigation; in the opinion of both Eaton and CPD  
who conducted their investigation under separate check is that the  
battery string in question failed due to bad batteries that were not  
identified during the latest battery tests by JT Packard.

Upon validation of the findings, we mobilized 160 replacement  
batteries from Orange County to our datacenter and proceeded to  
schedule a three hour Emergency Maintenance window to start at 7:15PM  
PDT in order to replace the batteries and perform the load transfer  
back to street power. The evening's emergency maintenance window was  
completed successfully at approximately 10:00PM PDT and all critical  
systems where again, checked and diagnosed to be operating at 100%.

We sincerely apologize for the inconvenience yesterday?s event caused  
you. We want to assure you that we spare no expense when it comes to  
designing, deploying and maintaining our datacenter systems in order  
to meet the industry?s highest levels of reliability which our  
customers have come to expect. If you would like to receive any more  
detailed information regarding this matter, or would like a detailed  
layout of our power infrastructure, please let us know. We are here to  
be of assistance.

We want to thank all of our customers for their continued support  
while we worked together to mitigate this critical event.

Sincerely,

Alessandra M. Carrasco
Chief Executive Officer
American Internet Services

[/quote]




More information about the M5Hosting mailing list