Arrrggggg…..

Ok... so here's the deal. Friday was a great start to a day. Light work load, Christmas party, gotta award at the party as well as a bonus which was greatly appreciated. Then all hell broke loose. I returned to work after the party. Everyone else went home. I figured I could change out some switches or do some of the many things I needed to get done with the network down. Shortly after entering the server room I noticed a beeping sound. I thought it was a UPS on which we have has some known issues. After I tracked it down a bit more, I found the sound was a really sick sounding beep was coming from inside the exchange server(email). I then checked my email and noticed it was very slow. I mean, slower than the normal slow. I figured I'd just reboot it and then all would be well. Not so. I rebooted this bastard and all I got was "No boot disk found, Press F1 to continue" Ouch, that's not any fun. After further inspection I found that it appeared the RAID 5 array had failed. It was showing 1 drive missing. No biggie there, 1 drive outta a RAID 5 it should work fine. But it wasn't. The flakiness continued from there. I immediately got on the phone and order a server since we did not have a backup. I ordered it outta Cali so I would have a chance to get it out in shipping for a Saturday delivery. Evidently I had not much luck at that since Saturday all the pieces and parts arrived except the server itself. :( Well after working late Friday, Saturday was a nice change of scenery by driving Brandt up to meet Rachel. But then as soon as I got back in town, I headed back to work. I pretty much abandoned the old exchange server and began working on getting exchange up on another server. We had installed exchange on a domain controller just for testing purposes. It wasn't updated but it did have users on it and it did have a little old email from a migration test we had done. It's VERY underpowered and slow. I turned off all non essential services and got it running. By Midnight or one we had email flowing again. Sunday started as a full day but not in any kind of crisis mode. I was basically setting up users on the new mail server to take email and making sure they could login and get to their email. It takes a lot of time but isn’t too technical in nature. This lasted most all of the day. Right about 7pm I was getting ready to head home with the plan to be to get Dell after their server in the morning then recover all the old email. I figured I’d double and triple check all of our servers and systems before heading home. This way I was sure everything would work on Monday morning. This was when I noticed a second server that was dead. How was it possible to loose two servers at once? We are extremely limited on resources so no backup servers are available here. To make things worse this was a production server. Losing email is an inconvenience but loosing part of production cost us money. So this now became top priority. Monday morning I started attempting to recover this one production server. I was able to restore some static data to enable our main production software to function mostly normally. That was my one saving grace. The problem with this one server is that nobody here at this company knew anything about the software or the technology that it was built on. It is a fairly obscure technology called FoxWeb which is closely related to FoxPro but for the web. Very little, to NO support it available for this product. It’s pretty much a time bomb waiting to go off. Well in this case it did. I restored the data files to a second IIS server that we obtained through the purchase of another company. After the data was back on we started getting the SSL certificate from Verisign. They are such a pain in the ass. Basically it’s 24 hours to get a replacement cert. During that time I got the FoxWeb up and running. Tuesday I got the SSL cert but for some reason the server wouldn’t take it. Well it would take it but it wouldn’t server up SSL on the site. After much cussing and kicking I found that when the guys from the company that we bought they, when they removed their SSL cert they never released port 443 which SSL runs on. Well since that wasn’t released our site couldn’t grab it. As soon as I forcefully removed that and inserted it into our site that part all worked. Well a couple of hours of trial and error and the production site came up and actually worked. Maybe even a little better and for sure a whole lot cleaner than before. At this time I switched my priority back to recovery of the mail server. I had Mike, one of my co-workers here, coordinate with Dell to get someone to replace hardware in our exchange server. The guy came out and replaced like every piece of hardware in the server except the hard drives. So now we have the mail server back… but the RAID 5 is lost so in theory all the data is lost as well. Well I remember reading someplace about a emergency recover utility for dell RAID arrays. I figured at this point what do we have to loose. I ran the utility and of course it warns you about ten times that most likely this is going to erase your entire drive and anything on it is going to be lost. But like I said I had nothing on it to loose. I ran it and in about 2 seconds it popped in the missing two drives and said thank you very much and have a nice day. I rebooted and didn’t even notice when the damn thing booted right into windows and was waiting at the logon screen. I thought I was in the free and clear at this point. I logged in and although the OS was pretty chewed up it looked like everything was there. All the exchange data looked intact. Yeah, it’s time to party. I leave early, 7:30, and took Teri and her cousin out to dinner for her birthday. Well later that night I found that exchange wouldn’t start up cuz the there were many, many, many file corruption errors and such. I ran a chkdsk on it and went to bed. This morning I found the chkdsk fixed some of the errors but the data was still hosed beyond recovery. The server itself seemed to be mostly solid, or at least solid enough to recover from. And the domain was 100% intact so that was a big deal. I restored from Thursday nights backup the OS and the exchange data. We lost Friday’s email in this process but I guess it’s for the good of the rest of the data. Even after the restore the exchange information store would not start. It kept bitching about errors in the database. I started to clean the database but the errors kept coming. Well I was getting to a point where I was running outta options, again! I ran a couple of patches and cleaners against it. Magically when I attempted a start on the information store it somehow started. This is where I’m at now. I’m logging in as each user and pulling their email out. I don’t really trust a migration especially on a database that appears to have errors. So far I’ve gotten about half the email out with no problems. All the important emails have been retrieved so at this point I have to go ahead and declare victory. So this process took almost 60 hours over the last 5 days, that doesn’t include the time to take Brandt to his mom’s, and many thousands of dollars in a new server as well as a few gray hairs I’m sure. BTW, I also lost like 6 pound in the process. It’s amazing what stress will do for you. People here at work have really been great. I don’t know what they have been saying behind my back, but to my face they have all be very clam, collected, and supportive. Now it’s time to go home… maybe do some laundry or something. So how was your weekend?