How I Learned to Start Loving Implementation Plans

February 11, 2009

Hal Pomeranz, Deer Run Associates

Lately I’ve been seeing multiple articles pop up arguing for the importance of plans and checklists in many walks of life, but especially in IT process.  I couldn’t agree more.  In fact, when an IT person believes something as strongly as I believe in the power of implementation plans and checklists, you know there must be a story involved.  Let me tell you mine.

Back in the mid-90’s my good friend Jim Hickstein and I were working for the Internet skunkworks of a huge East Coast direct mail marketing company.  At the time I think I was the Operations Manager and Jim was my tech lead in the Sys Admin group, but since we were a small operation both Jim and I were still deeply involved in the tactical implementation and roll-out of new technologies.

In particular, there was one instance where Jim and I were scheduled to roll out a new storage array into our production ecommerce infrastructure.  Our production environment was hosted at company HQ down in Connecticut (our offices were in the Boston area), and in some sense this IT activity was part of an ongoing evangelism on our part of new “open” technology in what was largely a mainframe shop up until our arrival.  We wanted to show that we were better, faster, cheaper, and just as reliable as the platforms that they were used to.

I remember talking to Jim and suggesting that we come up with an implementation plan prior to the installation activity.  But clearly neither one of us had gotten the religion yet, because after talking about it for a while we decided that we “knew what we were doing” (despite never having deployed a unit of this type before) and didn’t need a plan.  So with our figurative cowboy hats firmly in place, we had the equipment shipped and a couple of weeks later we hopped in Jim’s car and followed it down to wildest Connecticut.

Upon arrival, we met up with a couple of the IT folks from HQ who were curious about what we were doing and wanted the chance to observe how we operated.  That was cool with us, because we wanted to evangelize as much as possible, remember?  Off we all went to the data center, unpacked and inspected the equipment, and generally killed time waiting for our three-hour outage window to commence.  Once the outage window arrived, Jim and I began our work.

Crisis set in almost immediately.  I can’t even remember at this point what started going wrong, but I think almost every person who’s done production Operations recognizes that “bottom drops out” feeling you get in your stomach when you suddenly realize what you thought was routine work has gone horribly awry.  As I recall, I was working the back of the machine hooking up the storage array and Jim was on the system console and I clearly remember staring across the rack at Jim and seeing the same horrified look reflected in his eyes.

I don’t remember much of the next two and a half hours.  I know we didn’t panic, but started working through the problems methodically.  And we did end up regaining positive control and actually got the work completed– just barely– within our outage window.  But frankly we had forgotten anything else in the world existed besides our troubled system.

When the crisis was resolved and we returned to something resembling the real world, we remembered that our actions were being scrutinized by our colleagues from HQ.  I turned to acknowledge their presence, expecting to see (deservedly) self-satisfied smiles on their faces from seeing us nearly fall on our faces in a very public fashion.  Well boy was I surprised: they were actually standing there in slack-jawed amazement.  “That was the most incredible thing I’ve every seen!  You guys are awesome!” one of them said to me.

What could we do?  Jim and I thanked them very much for their kind words, used some humor to blow off some of our excess adrenaline, and extracted ourselves from the situation as gracefully as possible.  After leaving the data center, Jim and I walked silently back to his car, climbed in, and just sat there quietly for a minute.  Finally, Jim turned to me and said, “Let’s never do that again.”

And that was it.  From that point on our operations policy playbook– enforced religiously by both Jim and myself– included the requirement that all planned work include a detailed implementation and rollback plan that must be reviewed by at least one other member of the operations staff before any work would be approved.  And to those who are thinking that this policy must have slowed down our ability to roll out changes in our production infrastructure, you couldn’t be more wrong.  In fact, our ability to make changes improved dramatically because our percentage of “successful” change activity (defined as not requiring any unplanned work or firefighting as a result of the change) got to better than 99%.  We simply weren’t wasting time on unplanned work, so as a consequence we had more time to make effective changes to our infrastructure.

That’s it.  Something as humble as a checklist can make your life immeasurably better.  But you have to be willing to admit that you need the help of a checklist in the first place, which compromises our need for the “expert audacity” that so many of us in IT feed on.  It’s a tough change to make in oneself.  I only hope you won’t have to learn this lesson the hard way, like Jim and I did.

Advertisements

4 Responses to “How I Learned to Start Loving Implementation Plans”

  1. tixrus said

    heh heh. You & Jim went down there pretty knowledgeable with a fairly detailed idea of what you needed to do and yet it was a crises because there are always some little gotchas that you didn’t count on or think about ahead of time. I would say more power to the person who can THINK UP all the potential gotchas ahead of time and make a checklist of them, so when you have a timed situation like that you just press all the right buttons bing bang boom and you have a contingency for everything. Kinda like the perfect software plans for every possible exception.

  2. I have told the up-side of this story a number of times since, chiefly in job interviews. 🙂 I now use the term “time machine” to describe the detailed written plan. With the detailed written plan, if you forget a step further up the page, you can go back and fix it, before it happens. You get to GO BACK IN TIME.

    So don’t do: Think, Type. Think, Type. Think, Type. “Shit”.

    Do: Think, Think, Think. Type, Type, Type. Beer.

    If you can’t literally cut and paste from the plan document into the root shell, you haven’t got enough detail in your plan, and “variables” are coming from your head. They won’t be right. On a subsequent trip to Connecticut, we needed to move all the drives around in the storage array, because someone had set them up so that both drives in a given RAID-1 pair, for instance, were in the same tray. You could not remove just one drive to replace it. The detailed written plan that I wrote for this trip had two branches, Plan A, and Plan B, depending on whether the slot numbers were row-major or row-minor in the trays. (Without an example of the box, and not trusting anyone to “verify” this remotely, it was impossible to know ahead of time.) I cut and pasted the test, selected Plan B, pasted that, and got to bed before dawn.

  3. Tixrus: You don’t have to be an expert to get value out of this, though. Merely thinking about it ahead of time will get you 80% of the benefit. You don’t paste it all in one shot, either: It goes in chunks, with tests interspersed. And you do still have to think if it doesn’t go according to plan. But it’s still much more controllable.

    I have sometimes forgotten to _read_ my carefully prepared plan, and on at least one occasion it cost me an extra trip to Toronto, because I plum forgot to do the next step, and glibly skipped over it, doing the next thing from my head. (Their change-control regime was a nightmare.) So that takes practice, too.

  4. Awesome story! I too was a renegade IT guy herding the proverbial cats we dealt with every day. The expert audacity was a big factor, but so was the they-cant-fire-me-if-i-dont-write-it-down mentality that many of us had. About 10 years ago, I figured it out. Now knowledge sharing allows me to be widely more successful than I used to be because I focus on predictable outcomes. Two things lead me to that…. shooting film photography semi-pro, and becoming a pilot. I’ve flown the exact same plane dozens of times, but I still get out that checklist to make sure I don’t miss a step.

    In fact, I now have checklists for packing when I go out of town. The only times that I forget things is when I go off the checklist!

Comments are closed.

%d bloggers like this: