2013-09-07

really ... don't do it!

Tempting. Very tempting. But first ...

Nice eye catching advert as seen at linkedin.

Anxious to quickly solve an issue? Go ahead, reboot that box. The phone is ringing with plenty of other attention time sinks just waiting to ... sink your time. So reboot, and move on to the next issue.

But first ...

Putting out fires can be quite fun. For a little while, at least. Nothing makes the day go by faster than staying busy.

But if one simply spends all their time and energy on fire fighting, there is no time left over for ... actual fire prevention. Prevent the fire from getting started in the first place, and amazingly, actual productive work time begins.

First things first, kudos and props are due to Steve Litt and his book Troubleshooting Techniques of the Successful Technologist. Anyone thinking of a career in programming or engineering, and no, that does not include civil engineers, that is slightly different ... maybe another topic for  later ... but first ... much benefit and wisdom can be gleaned in Mr. Litt's book.

The ...Technologist is book is some pretty heavy stuff. If one is looking at earning a EE sheepskin, you're one of the target audiences. If that isn't the case, his other troubleshooting titles, with not quite as much depth in engineering level subject matter, might be a better fit.

Or, one can wing it. And hope to get lucky. I've had the privilege and honor of  working with many different technicians, electrical engineers, programmers, developers ... and can not count myself among the elite crowd of perhaps the one person out of twenty, or more like the one out of one thousand with a natural, instinctive ability for troubleshooting. The rest of us have to work at it.

Every time I find myself in a rush, forgetting the training, experience ... I've taken a wrong turn or three more times that I would care to admit. And while a wrong turn can eventually get you to a right solution, and plenty of other learning experiences along the way, time spent is ... time spent. Its gone, it is not coming back, life does not present any do over opportunities when it comes to how one spends their time.

So now its time for some specifics. There was the mention of privilege, honor, and work mates. But unfortunately, for a few of them, the "privilege" part was ... not so much. More like a challenge. One in particular, when getting reports of "clients experiencing issues", pops into the application server host, and try this. Try that. Didn't help. Change something. Change something else. Add a thing or two. Maybe three, or even five. Delete a couple things. Tweak a configuration file, or twenty of them.

Eventually, after a couple of hours, or could be a couple of days later, Mr. Trouble Shooter (Spreader?) smugly pops into the team office and "mentions" trouble for something, and it just has to be the database, and pressing for more information, Mr. Tee does not offer any helpful details at all. Even though he's the actual source of three or four other new problems. "I'm not sure what is going on. Nothing has changed. The clients just started getting errors."

Really. Yep. Nothing at all changed. Eventually, a replacement team mate for Mr. Tee gets stuck trying to undo the all the recent damage on the application server host, and there is a reason they don't allow me to access that particular server with admin rights. The less I work on a Windows sever, generally, the healthier it stays. But there are still lots of servers where I could achieve similar damage results.

Problems don't just show up all by themselves. Maybe a server crashes, a full swap partition is one possible cause for an operating system lockup. Or perhaps a misplaced foot trips and yanks out a power cord. Or a programmer decides a change is needed, and does some kind of data update. Or a usage tipping point, we didn't buy enough CPU to handle 400 concurrent client sessions, or didn't buy enough memory, or an unknown software bug causes a program to hang onto a limited resource well after its no longer needed, and it just plain demands more. 

Cause and effect. Its not effect and cause, like saying the temperature increase in the morning causes the sun to appear above the horizon. 

So, to properly troubleshoot, what do you DO? The immediate first answer, DO NOTHING. That's right, NOTHING is the correct answer. Of all the tools available for diagnosing troubles, signal meters, packet sniffers, blinky lights, blinky lights that don't seem to be blinking right (yes ... more on that one later ...) the most important item in the troubleshooter's toolbox is ... the troubleshooter brain.

Diving right into the try this, try that, what happens when we do so and so ... will usually lead to more trouble. Take a step back, a deep breath, and think. Not do.

Might have a manager, or worse, director, or some other individual way higher up the food chain dancing around and pounding on your desk, if you still have one, demanding a fix yesterday. Better yet, last week. And you best be sure to attend the "how could we have done better" meeting set for tomorrow, and be prepared to fall upon your sword, or toss a team mate under the bus.

Learn proper troubleshooting technique. It requires clear logic, your basic Yes/No, or Go/No Go check, there is no Grey area. At least that is how it should be- trying to find a bottleneck or put a finger on which overused resource is causing "the system is slooooow" complaints is one of the more challenging scenarios.

Verify the problem. If you can't verify, you can't be sure if a "fix" is an actual fix. Find and correct the problem, fixing symptoms is not fixing the root cause of the trouble.

For a specific example, at community college a professor in a Java class, students running the lab programs, everything worked fine, for a while, and then ... not. Turns out a close handle call was absent from the code, so everyone that ran a servlet was ... surprise, holding onto a limited resource well after it was no longer needed. Reboot. Problem Solved.

Or not. The reboot frees up those connections, but the underlying problem, the cause, is still present. Maybe there's a JSP setting to increase file handles, or something similar, but that is a symptom fix, not a problem fix. Double the handles, maybe double the working life time. Something like that.

In a House episode, *maybe* it was Season 8, Dead & Buried, Hugh Laurie's character says, "Death is a consequence, not a symptom. If its not a symptom it's not relevant". Or something very close to that, and oh how true it is. Brilliant. Cause and effect.

So gather up the info. Write down the symptoms. Elaborate on the symptoms. List out where the trouble might be. As Mr. Litt often states, "Narrow it down". And don't neglect the logic part- does it make sense? If unit X really is the source of the trouble, is it used by another system? And are symptoms present in the other system(s) as well? And verify the problem. If you can't reliably and predictably reproduce the problem there is no way to ensure that a fix is indeed an actual problem fix.

Yes, a reboot is a quick solution. But it also clears the system memory, running processes, and other bits of information. Which just might hold important clues (symptoms) of the root cause of the problem.

And yes, can't forget about the Blinky Lights. I do have the privilege of counting that one of one thousand elite troubleshooters as a coworker, and a good friend. One night, a good while back, perhaps 02:00 (that is 2AM for everyone else) the phone rings. This is a true story.

  • Hello?
  • The janitor says the computer lights are blinking funny in the training room.
  • Ummm OK then. Well, since the janitor has obviously diagnosed a system problem, then he is certainly welcome to fix it.
  • zzzzzzz .....
I would have soooo wanted my two minutes back. Plus the fifteen or twenty minutes it takes to get back to dream state sleep. Also have to share the trouble ticket we got one day, "Line printer. Printing one line at a time." Hmmmm.

Sounds like its functioning as designed.

Also can't count the number of times we get an email, "I'm getting an error. Do you see anything going on in the system?" Really? You have an error. And my attention.

But would the additional ten seconds needed to paste the error text, or twenty seconds to grab a screen shot of the error dialog box (and save it in high colour bit map format what does a jpeg mean anyway, can we also test the mail server under a heavier work load) ... is that expecting too much? I can help you, but you have to help me to help you too. Something like that.

Buzzing all over the farm looking for a needle in a haystack is not a productive use of time. In fact, the farm might not even be in the right county. Or village. Or wrong haystack, that might be a challenge too.

But first ...