I’ve been working with computers most of my life. My first/only computer course was 1968. For the past 25+ years they have been an integral part of my work life.
Nowadays I wrangle around a dozen machines (see photo) at work which let me produce a forecast and feed it to a bunch of different (buzzword coming) platforms.
Mostly, I get it. I understand how computers work. That gives me a leg up. Often it’s necessary to think along with the programmer to affect a fix.
There are two things which always surprise me.
1) There’s always something that’s not working!
It might be hardware or software or even a bad piece of data which should be a temperature or cloud but ends up being interpreted as a command. The computer stops what its doing. There’s never a time when I can depend on everything!
Google is well known for designing its software specifically to understand hardware will always fail. Those Google guys are right.
2) Computers often continue to work when something’s wrong–though it turns out they’re really waiting to fail at a time less convenient to me!
That’s part of what’s happened today and it’s causing me to tear my hair out.
A hardware failure late last week took out a two hard drive RAID array (two disks which act as one to provide constant backup or, in this case, additional speed). This particular piece of equipment was down for a day while we waited for FedEx to deliver the replacement. No problem. Like Google we understand working around bad hardware.
Once we replaced the drives we had to repopulate them with data. In this case it was an accurate rendition of the Earth’s surface–really. That meant nearly 200 GB of data had to move across our network. It took hours.
By late last Thursday evening we were up and running perfectly. We’d made some accommodations for the new hardware. No sweat.
Saturday was rainy and heavily tested this new configuration which worked nearly perfectly.
It failed this morning!
What was different between Saturday and today? As far as I can tell nothing!
The point is the computer was working just fine though it obviously wasn’t. There was something still wrong that needed just the right moment… the right set of circumstances… to fail
For whatever reason I was always under the (false) assumption that you needed perfection within these complex system for things to work. Obviously not. And, of course, it makes you wonder what’s next… or if you really can ever fix all the problems.
I’ve still got over two hours of data transfer to go this second time. Time to think about what might be next.
6 thoughts on “When PCs Fail”
Welcome to the computer system administration world.
The more complex the system, the harder it is to:
a) Comprehend at all, and
b) Be really sure it’s working properly.
You are only at a dozen systems–I’m dealing with a stack of servers and a ton of clients, and most days I have a heck of a time figuring out what is going right or wrong…
You have enough computers on that wall to heat a 4 bedroom home for the winter, are they all liquid cooled? How old are the Processors? Mother board? SCSI controller Board? it could be as simple as changing out a power supply,or bad memory, or adding a fan,I do know that in my experience ( I Never went to a Computer School) And In the Beginning there was an awful lot of hit & miss, But i used to love it when a friend would state the obvious to me and be right, so in that Case. I DUNNO.
Problems like that are just so frustrating and, as gaboik said, sometimes it either just takes the stupidest thing or someone well-removed from the situations to propose a solution which works.
I hope you do get it solved Geoff.
When I first saw that picture I was frankly surprised that’s what WTNH weather runs off of. Wow!
I see you have mostly Dell Precision Workstations and a few white boxes. What drives the decision to go with workstations instead of 1U Poweredge servers? I don’t think you’d lose anything processing wise and you would gain a ton of space and generate less heat.
Also, have you considered a SAN (Storage Area Network) instead of all the distributed storage? It would greater protect your data and make your storage much more flexible.
What brought down the array? Bad RAID controller?
The reason these are off-the-shelf boxes is because the systems are really made of individual subsystems marketed so stations buy just what they need, or add on over time.
There are three servers in this operation. One gets data via satellite (the vendor uses a satellite to broadcast the same continuous datastream to many clients), the other two over the Internet. They in turn provide the data and imagery which populate the maps and graphics shown on TV. The amount of data ingested and discarded around-the-clock is massive.
Some of the boxes talk to each other, others chat grudgingly or just provide video that’s passed though still other machines in analog form.
The racks sit behind the news set in the studio which is normally kept reasonably cool.
I see the WeatherBug Dell P360 in there… I think you guys are due for an upgrade soon.