I like my job, we have a great team of folks to work with, and my boss is pretty great.
I think he micro manages certain things a bit much, but for him, that means he does those things himself, which isn't all bad.
Now on to the customers. That is what drives me nuts. It's a delicate balance, because if some of the customers weren't as messed up as a soup sandwich, then I'd probably be out of a job.
However, it is frustrating to give recommendations to customers only to have them totally ignore them. We've gone in to customers with horrible environmentals (hey just because the vendor says the equipment will operate from 40-90 degrees F and from 20-80 percent relative humidity, non-condensing) does that mean you can operate your data center in a near sauna and expect uptime near 100%. ~70 degrees and around 50% humidity is optimum for maximum uptime) and recommend some NO-COST or LOW-COST changes to the data center to improve uptime and they blow it off. I see spikes in failures every spring and fall when we go from typically low temps and low humidity outdoors (and ultimately in the data center as well) to the opposite and then back again in the fall.
Do they change things? Nope.
Same with patches, etc. I beg, plead, cajole, warn customers that if I cannot get configuration data from their systems, I cannot run this information through our system that checks for potential "gotcha's" that will bring their computers down.
But as soon as one goes down, they can't call me fast enough to ask me 20 questions about what happened, when will you be here, why isn't it fixed yet, etc.
Give me a break, if you were worried enough about it before it went down to be proactive, then why should I drop everything now to bail out your sorry butt. Oh yeah, you are the customer.
Then there are those customers who spend MILLIONS on computer gear and won't pony up money in the tens of thousands to actually train their administrators/operators on how it works. Of course training is no guarantee that someone will become familar and competent, but it's probably the first step.
So I like my job, I get great satisfaction out of installing large systems and solving hard problems.
However, I dred visiting some customers because there is no give and take. Well, I suppose there is, they want to give all the responsibilty for failures to the vendor and take credit for all of the successes.
That wears me thin from time to time.
One final stupid customer story, and I'll close this out. About a month ago now, I had a customer with an SF12K. (Basically a machine that can be domained into upto 9 seperate computers or one really big one with over 50 microprocessors) Anyway, I get a call to gather the Explorer data from this host because one of the domains is down. The only message in the case is Domain X panics, customer cannot send explorer data.
Well one of these things is about $1M dollars, so you'ld think an outfit with that kind of money to spend could figure out how to transfer this data over a network back to Sun for analysis. Nope, they can't, so I drive to downtown St. Louis at about 3:30 in the afternoon to pickup a tape to transport it back to my office to upload. I get the tape and have to travel back to the office with rush hour traffic.
So it's now close to 5PM when I get back to the office, the host has been down for several hours, and I'm just now getting the data. So I upload the tape (contains all of about 10MB of data at most) and send it to the backline engineers and begin looking at it myself with one of the local engineers who has "afterhours" duty (basically the on-call guy) so he is familiar with the case if there is any real field tasks.
One nice thing about these SF12K's and the even bigger SF15K is the console of each machine is logged, so every keystroke and response is visible.
Well, I look at the console log to get the PANIC message to see if it is a hardware issue, and it isn't, the system is complaining about missing files.
(For those geeks out there it was giving an Errno #8 on /etc/init, symbolically linked to /sbin/init the mother of all Unix/Solaris processes)
So I call the customer up and tell them it appears /sbin/init is missing or corrupt and they will probably need to restore from a backup tape. The response, this is the backup server. DOH!!!
So I ask them if they made any backups with the native tools that come from Solaris, even though you are using some slick tool from another vendor to manage a complex backup scheme. It is always a good idea to have some sort of backup that you can install quickly in what many call a "bare metal install" because if your slick tool is trashed, you can't use all the slickness to restore, but if you have a ufsdump (the built-in utility in solaris) that has a copy of your O/S and that slick tool with all of it's indexes, your local Sun SSE can bail you out.
Nope, no such ufsdump...
So anyway, the afterhours engineer ends up spending a good portion of the night helping them recover. Basically doing a task that the admins should be able to do in his and her (yes we had both men and women on the customer's team) sleep.
The next day, I start digging into the why did it happen. It seems the person on the console at 10:29:41 typed about 500 lines of commands in a matter of seconds. The only way this happened is if someone did an errant cut and paste.
It seems someone at the console using the root account accidentally pasted some screen output to the prompt and of course the shell interpreted that information as commands to execute.
Some of that output included a directory listing, basically the output of ls -l of the /etc directory.
This output looks like:
lrwxrwxrwx 1 root root 12 Feb 19 16:59 hosts -> ./inet/hosts
drwxr-xr-x 2 adm adm 512 Feb 19 16:59 http
drwxr-xr-x 2 root sys 512 Feb 20 23:46 imq
drwxr-xr-x 4 root sys 1024 Mar 18 18:31 inet
lrwxrwxrwx 1 root other 17 Feb 20 22:08 inetd.conf -> ./inet/inetd.conf
lrwxrwxrwx 1 root root 12 Feb 19 16:59 init -> ../sbin/init
drwxr-xr-x 2 root bin 3072 May 30 16:57 init.d
prw------- 1 root root 0 Jun 25 11:30 initpipe
-rw-r--r-- 1 root sys 1360 May 30 16:58 inittab
lrwxrwxrwx 1 root root 19 Feb 19 16:59 install -> ../usr/sbin/install
I didn't give all of it to you, but I did include /etc/init as I mentioned above.
I figured out that what happened to the customer machine was the shell interpreted the line:
lrwxrwxrwx 1 root root 12 Feb 19 16:59 init -> ../sbin/init
as execute the command lrwxrwxrwx with some arguments and redirect the output (that's what the > means) to the file ../sbin/init
Well lrwxrwxrwx is not a valid solaris command, so the file ../sbin/init was replaced with the output of there command or ZERO bytes. Basically this file (and many others) were zeroed out by operator or administrator error.
But the hard part for me and our team was this was now OUR emergency. These folks had millions to spend on hardware. (They have 1 SF12K and 3 SF15K's plus a pretty large amount of storage) but they couldn't be bothered with training the admins or putting in the proper safeguards to prevent direct root logins.
(You should not allow a person to log in as root directly, but instead you should require normal user logins and then those users take the root role with the su command. Then you can log the su's to see who was root when the damage was done.)
But even after we came out and fixed what they broke and even after we POLITELY demonstrated how the human error occurred and after we even demonstrated WHEN it happened so they could figure out who did it. (Blame is very important in corporate America

) I don't think I got as much as a thankyou nor an apology from the guilty party.
I'm pretty sure the guilty party was on our case for a faster resolution, but certainly wasn't quick to admit or even recognize that he or she trashed the machine. Instead they wanted to quietly open a service call and hope we'd fix it without finding out how the system went down.
No wonder I've lost my hair, my customers make me pull it out.
It's great that I have a good team to work with in the local office, otherwise I really would hate my job.
Thanks for reading.
TB
(mispelling and improper word usages will just stay in this post

)