[M5Hosting] Explanation of 6/17 and 6/18 Network Events
Michael J McCafferty
mike at m5computersecurity.com
Thu Jun 19 05:14:26 PDT 2008
Dear Valued M5Hosting Customer,
We have had some challenges during the past few days. I am writing to
you to explain the high packet loss events on Tuesday and Wednesday and
what we are doing to mitigate the risk and impact of the same events
happening again.
Technical Explanation:
Both the Tuesday and Wednesday events were caused by a layer 2
broadcast storm which propagated across a large VLAN across multiple
switches. This exhausted bandwidth on some uplink connections, as the
number of devices/interfaces involved in the storm generated enough
traffic to do so. There was no hardware failure or failure of fault
tolerant systems as designed or implemented. No amount of redundancy or
hardware sparing would have reduced the impact of the storms.
Ultimately the final mitigation and resolution is going to involve a
multi-pronged approach. A) segmenting the network so that in the event
that a storm the number of devices and interfaces involved will be
smaller. B) Configuration changes to dampen or shunt such storms where
possible. This may require some network hardware upgrades. C) Prevent
other VLANs from reaching the same state as a policy for resource
allocation.
What We Are Doing:
Network Improvements:
I started M5Hosting just as most businesses get started, as a good
technician struck by entrepreneurial inspiration. Even with 13 solid
years of experience building and operating large data center
environments, NOCs and leading systems integration projects, I am
willing to admit that I am no longer the best network engineer for
M5Hosting.
I have contacted two San Diego-based network engineering firms that
each have their own 24/7 NOC and experienced certified expert network
engineers. I know both of them to have excellent reputations and
exceptionally skilled engineers. Hiring a professional firm will give us
deeper experience, knowledge and coverage than would be possible
otherwise. My first conversations have been very encouraging.
The first order of business for the chosen firm will be to review the
data regarding the recent issues and make any changes that are needed to
mitigate the risk of the same event re-occurring (this is actually
already under way). Then, they will evaluate the design and
implementation of what is in operation now. Finally, plan for and
implement network systems for the next order of scale with regard to
capacity, performance, availability, monitoring and resilience to
similar these recent issues.
Communication Improvements:
We don't have major outages very often. Last time we had of similar
impact, we had about 1/5th as many customers as we have now. At that
time, we improved our communication processes and improved our phone
capacity. It is time to make additional improvements in how we
communicate with you during and after an event.
During Wednesday's event we experienced about 250 times the normal
phone and email volume for the duration of the event. Needless to say,
it would not be possible to answer each call with a live human expert as
we normally do. Even just pasting a pre-written yet specific response in
to a support ticket at the rate of one per minute took hours. When your
server is unreachable, you want and definitely deserve more immediate
information.
We have an outline of how we plan to do this. Exact technical details
are being investigated now. It will at least involve a "system status"
option via the main telephone number, a system status web page that is
available even in the event of a catastrophic event, and a useful
auto-reply to support tickets. The system status web page may be
formatted as a Blog, to enable frequent and speedy updates.
We are also asking the network engineering firms mentioned above for
proposals to handle a roll-over of calls we are unable to answer
ourselves, so that you are still greeted by a knowledgeable live human.
I will update you with more information on the above improvements as it
is available.
I take these events very seriously. I am keenly aware that not
only does my own business, but hundreds, if not thousands, of other
business with millions of customers or users, rely on M5Hosting. I will
make certain the source of the issues will be fully resolved, and
mitigate the risk of it these issues and restore your confidence in
M5Hosting. It is my only job.
Please reply directly to me with your feedback on any of the above.
Sincerely,
Mike
--
************************************************************
Michael J. McCafferty
Principal, Security Engineer
M5 Hosting
http://www.m5hosting.com
You can have your own custom Dedicated Server up and running today !
RedHat Enterprise, CentOS, Ubuntu, Debian, OpenBSD, FreeBSD, and more
************************************************************
More information about the M5Hosting
mailing list