Page 1 of 1

The outage of 2013-01-16

Posted: Thu Jan 17, 2013 6:07 pm
by crfriend
The "regulars" here must have certainly noticed that SkirtCafe was down for almost the entire day of 2013-01-16. This was due to a catastrophic hardware failure to the database server and a subsequent comedy of errors and dead parts with the replacement iron.

Here's what the hosting company posted about it, verbatim. All times quoted are Pacific Standard Time, which is UTC - 8 hours.
Hello,

Update - 10:00 PM -- Data migration is still completing. We hope to be able to call this finished very soon!

Update - 6:54 PM -- Data migration to the new hardware is nearly complete. Most customers should be seeing their sites loading, but a few may still not be seeing their content yet. I am very sorry for the delay, and thank you for your continued patience!

Update - 12:45pm PST -- The replacement hardware had some bad disks which needed to be replaced. Now that the drives are all in good health, we are rebuilding the RAID array to finalize the preparation of the server. Once that is complete, we will be able to migrate data back to the new server and bring everything back online. Thanks for your patience!

Update - 10:00am PST -- The replacement server is almost done being set up and once it is ready we will be migrating everything to the new hardware. This process is taking longer than expected, but should be underway shortly. Please check back here for more updates!

Update! We are still in the process of replacing this hardware and getting everything back online. We are actively moving data and taking a different approach so that we can get this server back online as quickly as possible. Please check back here for more updates throughout the morning!

As a reminder...

You are receiving this message because your web hosting services are being affected by hardware maintenance due to issues with your shared MySQL server perch.

Due to issues with the RAID controller, we are replacing the hardware for this server and will be restoring all data from our backup server to the new hardware. Unfortunately, this will cause your databases to be offline until they are restored. We sincerely apologize for the inconvenience and will work to bring them back online as quickly as possible.

Once the 'all clear' is given by us, if you're still having trouble loading your sites or access your server, please let us know the specifics, so we can take a look for you. You may contact us either by replying to this message from your inbox, or by submitting a support ticket in the DreamHost panel (https://panel.dreamhost.com/), under Support/Contact Support.

This notice will also appear on that page in your DreamHost Panel and will be updated throughout the process with the latest details we can provide.

This incident is only related to the shared MySQL server perch, and no other servers or services (such as web or email) are affected.

Thank you for your patience!
The Happy DreamHost Server Fixing Team
The Outage seems to have begun at about 22:56 on 2013-01-15 UTC and ended at around 02:00 on 2013-01-17 UTC.

Being a computing professional, and having worked on hardware in a field-engineering gig, I feel for the guys in the trenches and machine-rooms; it's no fun when you're dealing with a dead system and likely have everybody from your immediate boss to most of the way up the food chain harassing you for current status (and, no, they do not like the very accurate answer of, "It'll be done when it's done.). So this one, at around 24 hours must've been excruciating.

Re: The outage of 2013-01-16

Posted: Fri Jan 18, 2013 3:51 am
by skirted_in_SF
I was on about this time (8:00PM PST) yesterday (1/16/13) and I thought something was strange since the board seemed to have forgotten what I had read the day before. Now I understand. :)

Re: The outage of 2013-01-16

Posted: Fri Jan 18, 2013 1:09 pm
by skirtingtoday
And there was me thinking it was your problems wrestling with Windows 8... ;)

Good to have the site back up and running again!

Re: The outage of 2013-01-16

Posted: Fri Jan 18, 2013 1:37 pm
by crfriend
skirtingtoday wrote:And there was me thinking it was your problems wrestling with Windows 8... ;)
That angst has been passed to my wife. I got the base infrastructure working and the files from her old laptop (carefully scanned for viruses, &c.) on her new one and she seems happy.

Once everything has calmed down, I'll see if I can get the old laptop capable of running Linux in a more or less stable way and turn it into a general-purpose compute-server as it has several times more horsepower as all the classic gear I have combined.

Re: The outage of 2013-01-16

Posted: Fri Jan 18, 2013 9:58 pm
by crfriend
Just to let everybody know that it didn't go un-noticed, there was another short outage earlier today (2013-01-18) where the database engine was unreachable. If memory serves (and it may be serving liver-and-onions) this one lasted somewhere between a half hour and 45 minutes. There is another migration in progress to another set of hardware that will, hopefully, prove up to the task at hand.

Re: The outage of 2013-01-16

Posted: Fri Jan 18, 2013 10:10 pm
by Brad
Now is a good time to show my appreciation to Carl and the others who make this board possible. When I type in Skirt Cafe, I magically expect it to appear on the screen and I was lost without it. But the infrastructure, both human and electronic, is so invisible to us that it appears not to exist.

Re: The outage of 2013-01-16

Posted: Fri Jan 18, 2013 10:17 pm
by Sarongman
Brad wrote:Now is a good time to show my appreciation to Carl and the others who make this board possible
I second that motion--- all those in favour say aye---motion carried :thumleft: :thumright: :thumleft: :thumright:

Re: The outage of 2013-01-16

Posted: Fri Jan 18, 2013 10:21 pm
by crfriend
Brad wrote:[... T]he infrastructure, both human and electronic, is so invisible to us that it appears not to exist.
Those are words of the very highest praise, Brad, and I, and on behalf of my team, thank you for them.

The sad thing about being very proficient in computing -- and especially infrastructure-computing -- is that success by its very nature means being invisible for when something goes wrong everybody notices. It's like when the lights go out at home and it's actually the utility that's at fault not a blown bulb or a popped fuse. This is why I refuse to slag off on the guys (and likely gals, too) "in the trenches" who deal with the hardware.

Organisationally, the hosting company takes care of the hardware and the OS side of things, I deal with software upgrades and technical tweaks to keep junk to a minimum, and the moderation team (and myself) deal with the "human interaction" layer. At home and at work, I deal with all of those, so I really do feel the pain of others.

As the saying goes, "This too shall pass"; let's just hope it passes quickly, not unlike a bad case of gas.

The main players in this are Bob (who still, very generously, foots the bill), Milfmog, Uncle Al, and myself. However, let's not forget the real focus of this -- the community! Without you -- all of you -- this place would wither and die, and I, for one, think that would be sad.

Re: The outage of 2013-01-16

Posted: Sat Jan 19, 2013 12:36 pm
by crfriend
Here's the latest news, as of about 06:00 UTC:
The migration to the new server has started and should be complete by tomorrow evening. All databases should be accessible at this time and will we update via email when the transition to the new machine completes.

We will update this post again tomorrow morning unless there is a change in status which necessitates additional notification.
We're online as of this writing (no kidding) so hopefully things will go better this time than last.

Re: The outage of 2013-01-16

Posted: Sat Jan 19, 2013 2:42 pm
by ChrisM
Hear Hear - three cheers of appreciation to Carl!

...On invisibility: Yes Carl, I do front of house sound (mixer board guy) for live performances, and sound is exactly like that: If it sounds great, the band gets compliments. If it sounds bad, the sound guy gets complaints. Our job is to be invisible, like a window, and a window only gets noticed when it's dirty.

Ah well, c'est la vie.

Chris

Re: The outage of 2013-01-16

Posted: Sat Jan 19, 2013 5:37 pm
by skirtyscot
crfriend wrote:The main players in this are Bob (who still, very generously, foots the bill) ...
How much does it cost to run the site? Purely in cash outlay terms, I mean, and valuing your labour at $0 per hour (sorry!)

If it is a hefty sum, have you considered making it possible for members to donate towards the running costs?

Re: The outage of 2013-01-16

Posted: Sat Jan 19, 2013 5:47 pm
by crfriend
skirtyscot wrote:How much does it cost to run the site?
The last time I spoke with Bob, I offered to kick a few quid at the issue and he declined the offer. I suspect it's something below his "noise level" so he doesn't worry about it.

On the matter of my time being worth $0 per hour, I'll note that I donate my time and draw my compensation in seeing the forum continue to run smoothly and membership gradually grow and folks get confident enough to challenge societal norms and swap trousers for skirts! As the Moderator of our local Town Meeting comments about his $1 salary per year, "You got the best Moderator you can buy for a dollar." (I think he does it for the same reasons that I do things here.)