The first call came at 8:11. The server which parcels out all the data our weather computers use had locked. Rebooting it brought the infamous Blue Screen of Death. The system just wouldn’t properly start.
By 8:30 I knew the problem was serious. I hopped in the car and headed toward New Haven.
Earlier today a co-worker asked, “Shouldn’t that have been an engineering phone call?”
Yes and no. An engineer was working on the problem when I got in. He’s a smart guy, but this is an esoteric setup with the aforementioned weird hardware. He was incredibly helpful as we worked the problem together with a tech back in Madison, Wisconsin.
Beyond that he would have soon run into a roadblock. The WebEx remote access software the equipment’s vendor uses on our equipment spontaneously freezes on our network! No one knows why. It just does.
I drove into New Haven carrying a spare hard drive. Our first try was to replace the über fast SAS drive with my plain vanilla IDE. It didn’t work. No one thought it would, but we were grasping at straws.
The system came up so sluggishly it was as if we were writing on the screen with a crayon! Back to the drawing board, but we’d wasted over an hour.
We took a chance and plugged the dead drive back in. Then we attempted to restore a ‘ghosted’ version of the system. It sounds simple, but it was another 40 minutes before the computer came back. Even then what we had was unusable!
This server is divided into two drives. One contains the system files, things like Windows. The other drive has the data, organized into a complex structure of directories and subdirectories. The system half was back but the data structure, data and any customizations we’d created over the last five or six years was gone!
At least we now had enough computer running to allow Billy in Madison remote access. I configured the network access (this system is decidedly not plug-and-play) and got online. As predicted the WebEx access crashed pretty quickly. I fired up TeamViewer (my current go to remote access software). We were golden.
He was in Wisconsin. I was in Connecticut. We were working in parallel fixing separate but equal problems simultaneously. The phone was on speaker, but we were mainly silent until our paths crossed or one (usually me) of us needed additional guidance. After we finished I noticed this single call ran nearly four hours!
Helaine called my cell around 3:00 AM. She was having trouble sleeping. I told her I wouldn’t be home much later than 4:00 AM… and I wasn’t.
The system isn’t 100% restored yet, but it’s mostly there. I assume the last pieces can be put in the puzzle today. God, I hope so.
This will sound very strange, but the whole process was satisfying even though it took almost a full work day out of the middle of my weekend. We solved a huge problem that didn’t seem solvable. It was a pain and tedious, but it needed to be done and it was.
You know that scene in Apollo 13 where Gene Kranz talks to the crew in Mission Control?
We’ve never lost an American in space, we’re sure as hell not gonna lose one on my watch! Failure is not an option.
I’d like to think I’ve got that same work ethic.