Tuesday, June 03, 2008

Data Center Outage and the Contingency Plan Cannot be Implemented!

In the late afternoon of the 31st May (16:55 CDT) The Planet's H1 Data Center in Houston Texas went off line due a power failure.
Not I hasten to add any old power failure, as the The Planet System and Network Status Forum explained:
Quote:
This evening at 4:55pm CDT in our H1 data center, electrical gear shorted, creating an explosion and fire that knocked down three walls surrounding our electrical equipment room. Thankfully, no one was injured. In addition, no customer servers were damaged or lost.
Unquote:

For any company the above would be a challenge to both Contingency and Business Continuity
handling, but 'The Planet' is not just any business and Data Centre H1 is not just any Data Center. 'The Planet' is a major Web Hosting Company and Data Center H1 was 'inherited' when it merged with Everyones Internet (EV1) which was formerly RackShack. and was the original RackShack Data Center, which means its server farms include many legacy systems.

So this situation was not just an internal problem for 'The Planet' but a very serious Business Continuity Problem for the 7,500 Customers with a total of 9000 Servers depending on this Data Centre..

Could things get any worse? - Oh Yes it can!
The Building is evacuated, the Fire Department are in charge and even with the fire out they will not allow implementation of the Back-up Power Plan for safety reasons and finally they won't let staff back into the Building.

The company did what it could in the first few Hours after the explosion:
  • It started to alert Internal, Contractor and Supplier teams and began moving them to site
  • It start to alert Customers via Forum and Automated Voice Service and through one Customer Portal (the other was hosted in the affected center and was down).
Because of the confused situation and without being allowed into the building to assess damage, the first meaningful information could not be posted until 4 Hours after the incident.

From then on the company posted updates either to a 1 hour base schedule, or more often whenever there was 'breaking news,.even if the news was bad!
For many updates there was little to report, but the temptation to make rash promises was resisted.

This policy of NOT raising Customer expectations too high was proved the correct course when it was discovered the damage to power cabling was even greater than suspected. By lateral thinking, plus the use of additional generators and 'blood, sweat and tears' step by step recovery moved forward.

You can look at Bulletins to Customers - Current to Oldest here:
The Planet H1 Maintenance Updates
or in Oldest to Current here :
The Planet System & Network Forum

Whilst 56 Hours later recovery was not fully complete the CEO felt able to issue this audio statement: Doug Erwin, chairman and CEO of The Planet

Fifty Six hours without service is for a company dependent on internet sales a disaster and a lot of customers are obviously angry and concerned, worried sick about their business, but 56 Hours is a lot better than some original estimates of 5 Days.

Whilst The Planet is going to have to deal with the aftermath of this incident in terms of Customer claims and possibly litigation, I believe that Operationally and from a Customer Advisory Viewpoint the Company did everything it should and could.

It is rightly proud of the 24 Hour a day efforts from its staff many brought in from Dallas and the speed and effectiveness of Contractor and Supplier teams also working 24 Hours a day.

Whilst Operationally I admire the company's management of the incident, I am concerned that there maybe some questions over Contingency Planning and Risk Management including Safety Inspection Policies.

Whilst Web Site and Server Hosting is lucrative it is also extremely competitive and therefore the possibility of mass migration of Customers to a fall back site, or dispersing them across other data centres was never part of the equation for handling this incident and I suspect would not be an economic proposition (although offering it as a premium option could possibly make sense).

However I do think that during the 'lessons learned studies' some consideration is given to looking at how all the 'single point of failure' situations (not just electrical) can be mitigated, not only in H1 but across other Company Data Centers .

I also hope that the legal eagles allow The Planet to share the results of the investigations and studies industry wide as this is NOT about a marketing war, it is about safety and customer confidence that its Web Site, which for many is the core of the business is well protected.

Initially some 'The Planet' customers may feel moving to another host is the answer, but would they be able to handle a similar problem any better than 'The Planet' and would they the Customer be willing to pay the premium required for mirroring across Data Centers to provide guaranteed fallback! In most cases probably not

Throughout the industry worldwide today, there is almost certainly the feeling that 'thank God it wasn't us' and then the depressing thought: 'this time'.
There is probably one group of people seeing a 'silver lining; in all this: Insurers, who I suspect are already drafting new improved
Business Continuity Insurance for Internet Dependent SME Businesses.

The Idle Man.
.

No comments: