Site Meter Family Medicine Notes

Web Host Trouble

The Web host that provides services for Docnotes.net and medlogs.com has had some difficulties recently.

About 10 days ago (as you may have noticed) the server was down for about two days.  This is because one of their servers (srv2) needed to be replaced. they moved all of the clients off of that machine and onto new machine (srv3). 

Today's difficulties are described in their response to the trouble ticket pasted below.

Unfortunately, they made no mention of this on their web site - and the whole thing bothers me enough that I think it might be time to move to new Web host that offers some redundancy or just move it to one of the servers that we host ourselves.  Originally, we were on a very inexpensive plan with this host, and "you get what you pay for" was understood and accepted.  At this point, we are paying $30 per month in light of the band with required to support medlogs.com.

I think for that much, a little redundancy would be in order. 

----------------------------

Here are the details for any one bored enough to be interested: 

 

 ---11/26/2005 4:29:59 PM---
(Apologies for the cut/paste response, but we are attempting to keep everyone up to date on this issue as quickly as possible) Earlier this morning srv3 stopped responding to requests. After attempts to gain access to the server remotely were uneffective, a reboot of the server was attempted via our APC remote switch, but the server still would not come back online. A tech at the datacenter was dispatched to investigate the issue and we are currently awaiting word back from them as to the cause of the issue and a solution. As soon as we have more information we will continue to update all open tickets on the issue.

---11/26/2005 6:55:10 PM : CLOSED ---
Apparently the cause of this mornings srv3 failure turned out to be a faulty RAM chip, which caused a kernel panic and the machine to freeze up.  When we attempted a remote reboot and the system did not come back online immediately, the tech who was sent to investigate discovered that the machine had encountered another kernel panic during the boot up process, and begun the process of diagnosing the root of the problem.

Ultimately replacing both sticks of RAM in the server with new ones, the machine would successfully boot up, but wanted to perform a full fsck of the drives to ensure no data-corruption had occured.  The technician allowed this process to run it's course (which took a couple of hours) and made note of any files that appeared to have been corrupted by the unclean shutdown of the server.  The bulk of the corrupt files were not of consequence (temporary files in /var/tmp mostly), but a few were mySQL datafiles belonging to client websites.

The tech then proceeded to bring the machine up fully after the fsck was complete, but did not bring up the multi-user processes (apache, mySQL, etc) until after he could run mysql rebuilds on the affected database files.  (In order to ensure that no client data was lost from the databases).

At this time the mysql repairs are complete, and all services (web, mysql, email, etc) on srv3 are back up and running.

As a safety precaution the tech is still running some diagnostics and digging through some logs on the machine, but nothing that remains to be done should require taking the server offline.  This process may cause the machine to be slightly slower then usual for the next hour or three, but should not have any other side effects.

---

As a few clients who experienced last weeks srv2 failure have already asked, let me take a moment and assure everyone that we do not believe srv3 is "on it's last legs" or anything of that nature.  Srv2 was a much older piece of equipment then srv3, and prior to today srv3 had an uptime of roughly 180days, it's last reboot having been during a kernel upgrade six months ago, so no, I don't believe that srv3 is "dying" on us, I simply think it's a combination of a dying ram chip (which we have seen before), and incredibly bad timing considering the events of last week. :(

technorati tags:

Comments

That's bad. I'm always surprised to see a hosting company giving me the exacts details of my server outage. Not that I don't care, but... no in fact, I don't care. The only thing that I want is to see my servers up and running because I don't want to loose my members. If you pay 30$ per month for your bandwidth, they should have write you, IMHO.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Links

Creative Commons License
This weblog is licensed under a Creative Commons License.
Powered by
Movable Type 3.2