[M5Hosting] Explanation of recent interruptions
Michael J McCafferty
mike at m5computersecurity.com
Sun Jan 29 15:09:35 PST 2006
Dear Happy M5 Hosting Customer,
Today I am writing to explain some recent interruptions in
our service. Every growing business has it's challenges. This month,
we had a few challenges which affected the service you trust us to provide.
When I started M5 Hosting, one of the first things I decided
was that that communication, honesty and integrity would be core
values to how we do business. I know many businesses brag about
these very same principals. I myself have been a customer of other
hosting companies in the past. I was almost always disappointed by
them in terms of those three things. Communication, Honesty and Integrity.
In keeping with those core values, I am writing to you today
to honestly communicate a few challenges we have had this month, how
they may have affected us all, and what is being done to address them.
1/10/06 - after 4:30pm PST - Intermittent high packet loss and lost
TCP sessions to all customers
Generally, what any of our customers do with their server is
entirely up to them, except when it affects other customers or is
illegal. In this case, what a relatively new customer was doing was
both illegal and disruptive. It took some time to diagnose what was
going on. It turned out to be a resource exhaustion on the firewall.
Specifically, the state table had reached a configured hard limit.
The firewall is capable of far a far higher limit than the default
value, so we raised the limit. It wasn't until later that it was
determined that the traffic was due to illegal actions of one
customer. This customer has been removed from the network.
What have we done to mitigate the risk of this happening
again ? We have increased the capacity of the firewall to 5x greater
than it was. We have optimized the rules so that the current network
load uses about 10% of the system resources as it did before this
incident. So, overall we can handle about 50x more traffic before
this will be a problem again. Additionally, we have more clearly
defined our anti-fraud policy. If we had followed our policy, this
new customer would not have been accepted.
Evening of 1/28/06 - Network outage for most customers.
They say that human error accounts for 70% of all computer
downtime. I'll bet it's even higher than that. This outage was human
error. While optimizing the firewall to mitigate the risk described
above, a simple typographical error rendered the firewall impassable
to almost all traffic. Unfortunately this also locked us out of the
firewall. Generally the 24hr NOC staff at the Data Center facility
are pretty responsive. As according to Murphy's Law, right when we
needed them them most, there were some issues with their land phones
which delayed recovery of the network. By the time we got through to
them on the phone, we were half way to the data center (the data
center is only 10 to 15min away). Rather than walk them through the
procedure to recover, we had them open the door to the rack, and
connect a crash cart to the firewall, in preparation of our arrival.
We arrived about 2 minutes later and remedied the problem quickly.
They were my own fingers that caused this outage. I
apologize for the mistake.
Late evening on 1/28/06 and early afternoon on 1/29/06 - Shared
Hosting server outage
The server named "Witt" suffered a very rare kernel panic
last night and again this afternoon. The system came back up without
issue once it was rebooted. However, the fact that it has happened
twice in 18hrs is not a good thing.
What are we doing to resolve it? We have reviewed the log
files on the system but have found no useful information relating to
these two incidents. We have upgraded the kernel to the latest
available from RedHat, to ensure that all fixes and patches which may
relate to this issue are applied. This evening we will take the
system down and physically inspect the hardware to ensure that the
fans are all working properly, loose wires are covered, cards are
seated properly, etc. At that time we will replace the RAM entirely
(since memory problems are hard to find). We will take further action
if required.
I hope you find value in this email. I hope that you
appreciate the communication and the honesty. It would be so much
easier to not have to write this email, and pretend none of it ever
happened. But, that would make us like your last provider. We want to
remain your provider. We want you to brag about us, not complain about us.
Your feedback on this email, the affects of the incidents
described above, or the steps we are taking to mitigate them are
always welcome. I'd really like to hear from you.
Thanks you for your trust and your business !
Mike
************************************************************
Michael J. McCafferty
Principal, Security Engineer
M5 Hosting
http://www.m5hosting.com
You can have your own custom Dedicated Server up and running today !
RedHat Enterprise, CentOS, Fedora, Debian, OpenBSD, FreeBSD, and more
************************************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.m5computersecurity.com/pipermail/m5hosting/attachments/20060129/9a2a96e7/attachment.htm
More information about the M5Hosting
mailing list