Good show!

February 13, 2009

Hal Pomeranz, Deer Run Associates

Gene Kim, Kevin Behr, and I have been having some fun on Twitter (where we’re @RealGeneKim, @kevinbehr, and @hal_pomeranz respectively) with the #itfail thread.  Every time we come across an example of poor IT practice, one of us throws out a tweet.  It’s driven some entertaining discussions, and generally been a good way to blow off steam.

But I also think folks with good IT practices should be recognized in some way.  So I just wanted to send out this quick post to recognize the proactive IT process demonstrated by our hosts here at  When I logged into my dashboard today, I found the following message waiting for me:

We will be making some code changes in about 42 hours which will log you out of your account. They should only take a few seconds and you should be able to log in afterwards without any problems. (more info)

It seems like such a simple thing, but to me it demonstrates some excellent IT process competancies:

  • Planned changes: they know well in advance when change is occurring and how long they expect it to take
  • Situational awareness: they know who is going to be impacted and in what fashion
  • Communication: they have a mechanism for alerting the affected parties in a reasonable time frame
  • Transparency:  they’re willing to alert their user community that they will be inconvenienced rather than just let it happen and hope nobody notices

While these all may seem like trivial or “of course” items to many of you reading this blog, let me tell you that many of the IT shops that I visit as part of my consulting practice regularly #itfail in some or all of the above areas.

So kudos to the folks at WordPress!  Good show!


Queue Inversion Week

February 12, 2009

Hal Pomeranz, Deer Run Associates

Reliving the last story from my days at the mid-90’s Internet skunkworks, reminded me of another bit of tactical IT advice I learned on that job, and which has become a proven strategy that I’ve used on other engagements.  I call it “Queue Inversion Week”.

One aspect of our operations religion at the skunkworks was, “All work must be ticketed” (there’s another blog post behind that mantra, which I’ll get to at some point).  We lived and died by our trouble-ticketing system, and ticket priority values generally drove the order of our work-flow in the group.

The problem that often occurs to organizations in this situation, however, is what I refer to as the “tyranny of the queue”.  Everybody on the team is legitimately working on the highest-priority items.  However, due to limited resources in the Operations group, there are lower priority items that tend to collect at the bottom of the queue and never rise to the level of severity that would get them attention.  The users who have submitted these low-priority tickets tend to be very understanding (at least they were at the skunkworks) and would wait for weeks or months for somebody in my group to get around to resolving their minor issues.  I suspect that during those weeks/months the organization was actually losing a noticable amount of worker productivity due to these “minor” issues, but we never quantified how much.

What did finally penetrate was a growing rumble unhappiness from our internal customers.  “We realize you guys are working on bigger issues,” they’d tell me in staff meetings, “but after a few months even a minor issue becomes really irritating to the person affected.”  The logic was undeniable.

I took the feedback back to my team and we started kicking around ideas.  One solution that had a lot of support was to simply include time as a factor in the priority of the item: after the ticket had sat in the queue for some period of time, the ticket would automatically be bumped up one priority level.  The problem is that when we started modeling the idea, we realized it wouldn’t work.  All of the “noise” from the bottom of the queue would eventually get promoted to the point where it would be interfering with critical work.

Then my guy Josh Smift, who basically “owned” the trouble ticketing system as far as customization and updates was concerned, had the critical insight: let’s just “invert” the queue for a week.  In other words, the entire Operations crew would simply work items from the bottom of the queue for a week rather than the top.  It was simple and it was brilliant.

So we looked at the project schedule and identified what looked like a “slack” week and declared it to be “Queue Inversion Week.”  We notified our user community and encouraged them to submit tickets for any minor annoyances that they’d been reluctant to bring up for whatever reason.

To say that “Queue Inversion Week” was a raging success was to put it mildly indeed.  Frankly, all I wanted out of the week was to clear our ticket backlog and get our customers off our backs, but the whole experience was a revelation.  First, the morale of my Operations team went through the roof.  Analyzing the reasons why, I came to several conclusions:

  • It got my folks out among the user community and back in touch with the rest of the company, rather than being locked up in the data center all day long.  The people who my folks helped were grateful and expressed that to my team, which makes a nice change from the usual, mostly negative feedback IT Ops people tend to get.
  • The tickets from the bottom of the queue generally required only the simplest tactical resolutions.  Each member of my team could resolve dozens of these items during the week (in fact, a friendly competition arose to see who could close the most tickets), and feel terrific afterwards because there was so much concrete good stuff they could see that they’d done.
  • Regardless of what outsiders think, I believe most people in IT Operations really want to help the people who are their customers.  It’s depressing to know that there are items languishing at the bottom of the queue that will never get worked on.  This week gave my team an excuse to work on these issues.

I think I can reasonably claim that Queue Inversion Week also had a noticable impact on the morale of the skunkworks as a whole.  After all, many of the annoying problems that our users had been just doing work-arounds for were now removed as obstacles.  Like a good spring cleaning, everybody could breathe a little easier and enjoy the extra sunshine that appeared through the newly cleaned windows.

We repeated Queue Inversion Week periodically during my tenure at the skunkworks, and every time it was a positive experience that everybody looked forward to and got much benefit from.  You can’t necessarily have it happen on a rigid schedule, because other operational priorities interfere, but any time it looks like you have a little “slack” in the project schedule coming up and the bottom of your queue is full of little annoying tasks, consider declaring your own “Queue Inversion Week” and see if it doesn’t do you and your organization a world of good.

Hal Pomeranz, Deer Run Associates

Lately I’ve been seeing multiple articles pop up arguing for the importance of plans and checklists in many walks of life, but especially in IT process.  I couldn’t agree more.  In fact, when an IT person believes something as strongly as I believe in the power of implementation plans and checklists, you know there must be a story involved.  Let me tell you mine.

Back in the mid-90’s my good friend Jim Hickstein and I were working for the Internet skunkworks of a huge East Coast direct mail marketing company.  At the time I think I was the Operations Manager and Jim was my tech lead in the Sys Admin group, but since we were a small operation both Jim and I were still deeply involved in the tactical implementation and roll-out of new technologies.

In particular, there was one instance where Jim and I were scheduled to roll out a new storage array into our production ecommerce infrastructure.  Our production environment was hosted at company HQ down in Connecticut (our offices were in the Boston area), and in some sense this IT activity was part of an ongoing evangelism on our part of new “open” technology in what was largely a mainframe shop up until our arrival.  We wanted to show that we were better, faster, cheaper, and just as reliable as the platforms that they were used to.

I remember talking to Jim and suggesting that we come up with an implementation plan prior to the installation activity.  But clearly neither one of us had gotten the religion yet, because after talking about it for a while we decided that we “knew what we were doing” (despite never having deployed a unit of this type before) and didn’t need a plan.  So with our figurative cowboy hats firmly in place, we had the equipment shipped and a couple of weeks later we hopped in Jim’s car and followed it down to wildest Connecticut.

Upon arrival, we met up with a couple of the IT folks from HQ who were curious about what we were doing and wanted the chance to observe how we operated.  That was cool with us, because we wanted to evangelize as much as possible, remember?  Off we all went to the data center, unpacked and inspected the equipment, and generally killed time waiting for our three-hour outage window to commence.  Once the outage window arrived, Jim and I began our work.

Crisis set in almost immediately.  I can’t even remember at this point what started going wrong, but I think almost every person who’s done production Operations recognizes that “bottom drops out” feeling you get in your stomach when you suddenly realize what you thought was routine work has gone horribly awry.  As I recall, I was working the back of the machine hooking up the storage array and Jim was on the system console and I clearly remember staring across the rack at Jim and seeing the same horrified look reflected in his eyes.

I don’t remember much of the next two and a half hours.  I know we didn’t panic, but started working through the problems methodically.  And we did end up regaining positive control and actually got the work completed– just barely– within our outage window.  But frankly we had forgotten anything else in the world existed besides our troubled system.

When the crisis was resolved and we returned to something resembling the real world, we remembered that our actions were being scrutinized by our colleagues from HQ.  I turned to acknowledge their presence, expecting to see (deservedly) self-satisfied smiles on their faces from seeing us nearly fall on our faces in a very public fashion.  Well boy was I surprised: they were actually standing there in slack-jawed amazement.  “That was the most incredible thing I’ve every seen!  You guys are awesome!” one of them said to me.

What could we do?  Jim and I thanked them very much for their kind words, used some humor to blow off some of our excess adrenaline, and extracted ourselves from the situation as gracefully as possible.  After leaving the data center, Jim and I walked silently back to his car, climbed in, and just sat there quietly for a minute.  Finally, Jim turned to me and said, “Let’s never do that again.”

And that was it.  From that point on our operations policy playbook– enforced religiously by both Jim and myself– included the requirement that all planned work include a detailed implementation and rollback plan that must be reviewed by at least one other member of the operations staff before any work would be approved.  And to those who are thinking that this policy must have slowed down our ability to roll out changes in our production infrastructure, you couldn’t be more wrong.  In fact, our ability to make changes improved dramatically because our percentage of “successful” change activity (defined as not requiring any unplanned work or firefighting as a result of the change) got to better than 99%.  We simply weren’t wasting time on unplanned work, so as a consequence we had more time to make effective changes to our infrastructure.

That’s it.  Something as humble as a checklist can make your life immeasurably better.  But you have to be willing to admit that you need the help of a checklist in the first place, which compromises our need for the “expert audacity” that so many of us in IT feed on.  It’s a tough change to make in oneself.  I only hope you won’t have to learn this lesson the hard way, like Jim and I did.