Lately I’ve been seeing multiple articles pop up arguing for the importance of plans and checklists in many walks of life, but especially in IT process. I couldn’t agree more. In fact, when an IT person believes something as strongly as I believe in the power of implementation plans and checklists, you know there must be a story involved. Let me tell you mine.
Back in the mid-90’s my good friend Jim Hickstein and I were working for the Internet skunkworks of a huge East Coast direct mail marketing company. At the time I think I was the Operations Manager and Jim was my tech lead in the Sys Admin group, but since we were a small operation both Jim and I were still deeply involved in the tactical implementation and roll-out of new technologies.
In particular, there was one instance where Jim and I were scheduled to roll out a new storage array into our production ecommerce infrastructure. Our production environment was hosted at company HQ down in Connecticut (our offices were in the Boston area), and in some sense this IT activity was part of an ongoing evangelism on our part of new “open” technology in what was largely a mainframe shop up until our arrival. We wanted to show that we were better, faster, cheaper, and just as reliable as the platforms that they were used to.
I remember talking to Jim and suggesting that we come up with an implementation plan prior to the installation activity. But clearly neither one of us had gotten the religion yet, because after talking about it for a while we decided that we “knew what we were doing” (despite never having deployed a unit of this type before) and didn’t need a plan. So with our figurative cowboy hats firmly in place, we had the equipment shipped and a couple of weeks later we hopped in Jim’s car and followed it down to wildest Connecticut.
Upon arrival, we met up with a couple of the IT folks from HQ who were curious about what we were doing and wanted the chance to observe how we operated. That was cool with us, because we wanted to evangelize as much as possible, remember? Off we all went to the data center, unpacked and inspected the equipment, and generally killed time waiting for our three-hour outage window to commence. Once the outage window arrived, Jim and I began our work.
Crisis set in almost immediately. I can’t even remember at this point what started going wrong, but I think almost every person who’s done production Operations recognizes that “bottom drops out” feeling you get in your stomach when you suddenly realize what you thought was routine work has gone horribly awry. As I recall, I was working the back of the machine hooking up the storage array and Jim was on the system console and I clearly remember staring across the rack at Jim and seeing the same horrified look reflected in his eyes.
I don’t remember much of the next two and a half hours. I know we didn’t panic, but started working through the problems methodically. And we did end up regaining positive control and actually got the work completed– just barely– within our outage window. But frankly we had forgotten anything else in the world existed besides our troubled system.
When the crisis was resolved and we returned to something resembling the real world, we remembered that our actions were being scrutinized by our colleagues from HQ. I turned to acknowledge their presence, expecting to see (deservedly) self-satisfied smiles on their faces from seeing us nearly fall on our faces in a very public fashion. Well boy was I surprised: they were actually standing there in slack-jawed amazement. “That was the most incredible thing I’ve every seen! You guys are awesome!” one of them said to me.
What could we do? Jim and I thanked them very much for their kind words, used some humor to blow off some of our excess adrenaline, and extracted ourselves from the situation as gracefully as possible. After leaving the data center, Jim and I walked silently back to his car, climbed in, and just sat there quietly for a minute. Finally, Jim turned to me and said, “Let’s never do that again.”
And that was it. From that point on our operations policy playbook– enforced religiously by both Jim and myself– included the requirement that all planned work include a detailed implementation and rollback plan that must be reviewed by at least one other member of the operations staff before any work would be approved. And to those who are thinking that this policy must have slowed down our ability to roll out changes in our production infrastructure, you couldn’t be more wrong. In fact, our ability to make changes improved dramatically because our percentage of “successful” change activity (defined as not requiring any unplanned work or firefighting as a result of the change) got to better than 99%. We simply weren’t wasting time on unplanned work, so as a consequence we had more time to make effective changes to our infrastructure.
That’s it. Something as humble as a checklist can make your life immeasurably better. But you have to be willing to admit that you need the help of a checklist in the first place, which compromises our need for the “expert audacity” that so many of us in IT feed on. It’s a tough change to make in oneself. I only hope you won’t have to learn this lesson the hard way, like Jim and I did.