The Blame Game

 

“A strange game. The only winning move is not to play.”

from the movie “War Games” (1983)

One classic pathology of low-performing IT organizations is that when an outage occurs they spend an inordinate amount of time trying to figure out who’s fault it is, rather than working the problem.  Dr. Jim Metzler has even coined a new metric for this activity: Mean Time To Innocence (MTTI), defined as the average time it takes each part of the IT Operations organization to demonstrate that it’s not responsible for the outage.  Whether you call it a “Witch Hunt” or “The Blame Game” or identify it with some other term, it’s a huge waste of the time and ends up making everybody involved look like a complete ignoramus.  It’s also one of the classic signs to the rest of the business that IT Operations is completely out of touch, because otherwise they’d be trying to solve the problem rather than working so hard at finding out whose fault it is.

I’m so intolerant of this kind of activity that I will often accept the blame for things I’m not responsible for just so we can move out of the “blame” phase into the “resolution” phase.  As the CIO in Metzler’s article so eloquently put it, “I don’t care where the fault is, I just want them to fix it.”   At the end of the day, nobody will remember whose fault it is, because once the problem is addressed they’ll forget all about it in the rush of all the other things they have to do.  At most they’ll remember the “go-to guy/gal” who made the problem go away.

To illustrate, let me tell you another story from my term as Director of IT for the Internet skunkworks of a big Direct Mail Marketing company.  We were rolling out a new membership program, and as an incentive we were offering the choice of one of three worthless items of Chinese-made junk with each new membership.  I’m talking about the kind of stuff you see as freebie throw-ins on those late-night infomercials– book lights, pocket AM/FM radios, “inside the shell” egg scramblers, etc.  The way the new members got their stuff is that we passed a fulfillment code to the back-end database at corporate that triggered the warehouse mailing the right piece of junk to the new member’s address.

About a week and a half into the campaign our customer support center started getting lots of angry phone calls: “Hey! I requested the egg scrambler and got this crappy book light instead.”  This provoked a very urgent call from one of the supervisors at the call center.  I said it sounded like there was a problem somewhere in the chain from our web site into fulfillment and I’d get to the bottom of it, and in the meantime we agreed that the best policy was to tell the customer to keep the incorrectly shipped junk as our gift and we’d also send them the junk they requested.

Once we started our investigation, the problem was immediately obvious.  We had an email from the fulfillment folks with the code numbers for the various items, and those were the code numbers programmed into our application. However, when we checked the list of fulfillment codes against the back-end data dictionary, we realized that they’d transposed the numbers for the various items when they sent us the email.  Classic snafu and an honest mistake.  Once we figured out the problem, it took seconds to fix the codes and only a few minutes to run off a report listing all of the new members who were very shortly going to be receiving the wrong items.

So the question then became how to communicate the problem and the resolution to the rest of the business.  I settled for simplicity over blame:

“We have recently been made aware of a problem with the fulfillment process in the new rollout of member service XYZ, resulting in new members receiving the wrong promotional items.  This was due to fulfillment codes being incorrectly entered into the web application.  We have corrected the problem and have provided Customer Service and Fulfillment with the list of affected members so that they can ship the appropriate items immediately.”

You will note that I carefully didn’t specify whose “fault” it was that the incorrect codes were inserted into the application.  I’m sure the rest of the business assumed it was my team’s fault.  I’m sure of this because the Product Manager in charge of the campaign called me less than fifteen minutes after I sent out the email and literally screamed at me– I was holding the phone away from my ear– that he knew it wasn’t our fault (he’d seen all the email traffic during the investigation) and how could I let the rest of the company assume we were at fault?

And I told him what I’m going to tell you now: nobody else cared whose fault it was.  Fulfillment was grateful that I’d jumped on this particular hand grenade and saved them from the shrapnel.  My management was impressed that we’d resolved the problem less than two hours after the initial reports and had further produced a list of the affected members so that Customer Service could get out ahead of the problem rather than waiting for irate customers to call.  Total cost was minimal because we caught it early and addressed it promptly.

And that’s the bottom line: all that throwing blame around would have done was made people angry and lengthen our time to resolution.  Finding somebody to blame doesn’t make you feel justified or more fulfilled somehow, it just makes you tired and frustrated.  So always try to short-circuit the blame loop and move straight into making things better.

Advertisements

Communicating Success (and Failure)

At the end of our recent SANS webcast, Mike Poor closed by emphasizing how important it was for IT and Information Security groups to advertise their operational successes to the rest of the organization (and also to their own people).  Too often these functions are seen as pure cost centers, and in these difficult economic times it’s up to these organizations to demonstrate return value or face severe cutbacks.

The question is what are the right metrics to publish in order to indicate success?  All too often I see organizations publishing meaningless metrics, or even metrics that create negative cultures that damage corporate perception of the organization:

  • It seems like a lot of IT Ops groups like to publish their “look how much stuff we operate” metrics: so many thousand machines, so many petabytes of disk, terabytes of backup data per week, etc.  The biggest problem with these metrics is that they can be used to justify massive process inefficiencies.  Maybe you have thousands of machines because every IT project buys its own hardware and you’re actually wasting money and resources that could be saved by consolidating.  Besides, nobody else in the company cares how big your… er, server farm is.
  • Then there are the dreaded help desk ticket metrics: tickets closed per week, average time to close tickets, percentage of open tickets, etc.  The only thing these metrics do is incentivize your help desk to do a slapdash job and thereby annoy your customers.  There’s only one help desk metric that matters: customer satisfaction.  If you’re not doing customer satisfaction surveys on EVERY TICKET and/or you’re not getting good results then you fail.

So what are some good metrics?  Well I’m a Visible Ops kind of guy, so the metrics that matter to me are things like amount of unplanned downtime (drive to zero), number of successful changes requiring no unplanned work or firefighting (more is better), number of unplanned or unauthorized changes (drive to zero), and projects completed on time and on-budget (more is better).  Of course, if your IT organization is struggling, you might be tempted to NOT publish these metrics because they show that you’re not performing well.  In these cases, accentuate the positive by publishing your improvement numbers rather than the raw data: “This month we had 33% less unplanned downtime than last month.”  This makes your organization look proactive and creates the right cultural imperatives without airing your dirty laundry.

There are a couple of other places where I never fail to toot my own horn:

  • If my organization makes life substantially better for another part of the company then you’d better believe I’m going to advertise that fact.  For example, when my IT group put together a distributed build system that cut product compiles down from over eight hours to less than one hour, it not only went into our regular status roll-ups, but I also got the head of the Release Engineering group to give us some testimonials as well.
  • Whenever a significant new security vulnerability comes out that is not an issue for us because of our standard builds and/or operations environment, I make sure the people who provide my budget know about it.  It also helps if you can point to “horror story” articles about the amount of money other organizations have had to pay to clean up after incidents related to the vulnerability.  This is one of the few times that Information Security can demonstrate direct value to the organization, and you must never miss out on these chances.

What’s That Smell?

If communicating your successes builds a corporate perception of your organization’s value, being transparent about your failures builds trust with the rest of the business.  If you try to present a relentlessly positive marketing spin on your accomplishments your “customers” elsewhere in the company will become suspicious.  Plus you’ll never bamboozle them sufficiently with your wins that they won’t notice the elephant in the room when you fall on your face.

The important things to communicate when you fail are that you understand what lead to the failure, that you have the situational awareness to understand the impact of the failure on the business, and the steps you’re taking to make sure that the same failure never happens again (the only real organizational failure is allowing the same failure to happen twice).  Here’s a simple checklist of items you should have in your disclosure statement:

  • Analysis of the process(es) that led to the failure
  • The duration of the outage
  • How the outage was detected
  • The systems and services impacted
  • Which business units were impacted and in what way
  • Actions taken to end the outage
  • Corrective processes to make sure it never happens again

Note that in some cases it’s necessary to split the disclosure across 2-3 messages.  One is sent during the incident telling your constituents, “Yes, there’s a problem and we’re working it.”  The next is the “services restored at time X, more information forthcoming” message.  And then finally your complete post-mortem report.  Try to avoid partial or incomplete disclosure or idle speculation without all of the facts– you’ll almost always end up with egg on your face.

Conclusion

If you don’t communicate what’s happening in your IT and/or InfoSec organization then the the other business units are basically going to assume you’re not doing anything during the time when you’re not directly working on their requests. This leads to the perception of IT as nothing more than “revenue sucking pigs“.

However, you also have to communicate in the right way.  This means communicating worthwhile metrics and metrics which don’t create bad cultural imperatives for your organization.  And it also means being transparent and communicating your failures– in the most proactive way possible– to the rest of the organization.

Calabrese’s Razor

I’ve long held the opinion that the community of “Information Security Experts” agree with each other 90% of the time, but waste 90% of their time arguing to the death with other InfoSec Experts about the remaining 10%.  This was painfully brought home to me several years ago as I was facilitating the consensus process around the Solaris Security document published by the Center for Internet Security.  You won’t believe the amount of time we spent arguing about seemingly trivial things like, “Should the system respond to echo broadcast?”  And as the consensus circle widened, we ended up wasting more time on these issues and repeating debates over and over again as new people joined the discussion.  In short, it was killing us.  People were burning out and failing to provide constructive feedback and we were failing to deliver updates in a timely fashion.

I see these kind of debates causing similar mayhem in the IT Ops and InfoSec groups at many organizations.  The problem is that in these cases the organizations are not simply debating the content of a document full of security recommendations, they’re arguing about matters of operational policy.  This seems to promote even more irrational passions, and also raises the stakes for failing to come to consensus and actually move forward.

At the low point of our crisis at the Center for Internet Security, the person who was most responsible for finding the solution was Chris Calabrese, who was facilitating the HP-UX benchmark for the Center. At roughly the same time as our issues at the Center, the IT Ops and InfoSec teams at Chris’ employer had gotten bogged down over similar kinds of issues and had decided to come up with an objective metric for deciding which information security controls were important and which ones were just not worth arguing about.  Suddenly the discussion of these issues was transformed from matters of opinion to matters of fact.  Consensus arrived quickly and nobody’s feelings got hurt.

Overview of the Metric

So we decided to adapt the metric that Chris had used to our work at the Center.  After some discussion, we decided that the metric had to account for two major factors: how important the security control was and how much negative operational impact the security control would impose.  Each of the two primary factors was made up of other components.

For example, the factors relating to the relative importance of a security control include:

  • Impact (I): Is the attack just a denial-of-service condition, or does it allow the attacker to actually gain access to the system? Does the attack allow privileged access?
  • Radius (R): Does the attack require local access or can it be conducted in an unauthenticated fashion over the network?
  • Effectiveness (E): Does the attack work against the system’s standard configuration, or is the control in question merely a backup in case of common misconfiguration, or even just a “defense in depth” measure that only comes into play after the failure of multiple controls?

Similarly, the administrative impact of a control was assessed based on two factors:

  • Administrative Impact (A): Would the change require significant changes to current administrative practice?
  • Frequency of Impact (F): How regularly would this impact be felt by the Operations teams?

The equation for deciding which controls were important simply evolved to: “(I * R * E) – (A* F)”.  In other words, multiply the terms related to the importance of the control to establish a positive value and then subtract the costs due to the administrative impact of the control.

The only thing missing was the actual numbers.  It turns out a very simple weighting scheme is sufficient:

  • Impact (I): Score 1 if attack is a denial-of-service, 2 if the attack allows unprivileged access, and 3 if the attack allows administrative access (or access to an admin-equivalent account like “oracle”, etc)
  • Radius (R): Score 1 for attacks that require physical access or post-authenticated unprivileged access, and 2 for remote attacks that can be conducted by unauthenticated users
  • Effectiveness (E): Score 1 if the control requires multiple configuration failures to be relevant, 2 if the control is a standard second-order defense for common misconfiguration, and 3 if the attack would succeed against standard configurations without the control in question
  • Administrative Impact (A): Score 1 if the administrative impact is insignificant or none, 2 if the control requires modifications to existing administrative practice, and 3 if the control would completely disable standard administrative practices in some way
  • Frequency of Impact (F): Score 1 if the administrative impact is to a non-standard process or arises less than once per month, 2 if the administrative impact is to a standard but infrequent process that occurs about once per month, and 3 if the impact is to a regular or frequent administrative practice

In the case where a single control can have different levels of impact in different scenarios, what turned out best for us (and avoided the most arguments) was to simply choose the highest justifiable value for each term, even if that value was not the most common or likely impact.

Applying the Metric

Let’s run the numbers on a couple of controls and see how this works out.  First we’ll try a “motherhood and apple pie” kind of control– disabling unencrypted administrative access like telnet:

  • Impact (I): Worst case scenario here is that an attacker hijacks an administrative session and gains control of the remote system.  So that’s administrative level access, meaing a score of 3 for this term.
  • Radius (R): Anybody on the network could potentially perform this attack, so this term is set to 2.
  • Effectiveness (E): Again you have to go with the maximal rating here, because the session hijacking threat is a standard “feature” of clear-text protocols– score 3.
  • Administrative Impact (A): Remember, we’re not discussing replacing clear-text administrative protocols with encrypted protocols at this point (justifying encrypted access is a separate conversation).  We’re discussing disabling unencrypted access, so the score here is 3 because we’re planning on completely disabling this administrative practice.
  • Frequency of Impact (F): If telnet is your regular router access scheme, then this change is going to impact you every day.  Again, the score is then 3.

So what’s the final calculation?  Easy: (3 * 2 * 3) – (3 * 3) = 9.  What’s that number mean?  Before I answer that question, let’s get another point of comparison by looking at a more controversial control.

We’ll try my own personal nemesis, the dreaded question of whether the system should respond to echo broadcast packets:

  • Impact (I): Worst case scenario here ends up being a denial of service attack (e.g. “smurf” type attack), so score 1.
  • Radius (R): Depends on whether or not your gateways are configured to pass directed broadcast traffic (hint: they shouldn’t be), but let’s assume the worst case and score this one a 2.
  • Effectiveness (E): Again, being as pessimistic as possible, let’s assume no other compensating controls in the environment and score this one a 3.
  • Administrative Impact (A): The broadcast ping supporters claim that disabling broadcast pings makes it more difficult to assess claimed IP addresses on a network and capture MAC addresses from systems (the so-called “ARP shotgun” approach).  Work-arounds are available, however, so let’s score this one a 2.
  • Frequency of Impact (F): In this case, we have what essentially becomes a site-specific answer.  But let’s assume that your network admins use broadcast pings regularly and score this one a 3.

So the final answer for disabling broadcast pings is: (1 * 2 * 3) – (2 * 3) = 0.  You could quibble about some of the terms, but I doubt you’re going to be able to make a case for this one scoring any higher than a 2 or so.

Interpreting the Scores

Once we followed this process and produced scores for all of the various controls in our document, a dominant pattern emerged.  The controls that everybody agreed with had scores of 3 or better.  The obviously ineffective controls were scoring 0 or less.  That left items with scores in the 1-2 range as being “on the bubble”, and indeed many of these items were generating our most enduring arguments.

What was also clear was that it wasn’t worth arguing about the items that only came in at 1 or 2.  Most of these ended up being “second-order” type controls for issues that could be mitigated in other ways much more effectively and with much less operational impact.  So we made an organizational decision to simply ignore any items that failed to score at least 3.

As far as arguments about the weighting of the individual terms, these tended to be few and far between.  Part of this was our adoption of a “when in doubt, use the maximum justifiable value” stance, and part of it was due to choosing a simple weighting scheme that didn’t leave much room for debate.  Also, once you start plugging the numbers in, it’s obvious that arguing over a 1 point change in a single term isn’t usually enough to counteract the other factors enough to get a given control to reach the overall qualifying score of 3.

Further Conclusions

What was also interesting about this process is that it gave us an objective measure for challenging the “conventional wisdom” about various security controls.  It’s one thing to say, “We should always do control X”, and quite another to have to plug numbers for the various terms related to “control X” into a spreadsheet.  It quickly becomes obvious when a control has minimal security impact in the real world.

This metric also channelled our discussion into much more productive and much less emotional avenues.  Even the relatively coarse granularity of our instrument was sufficient to break our “squishy” matters of personal opinion into discrete, measurable chunks.  And once you get engineers talking numbers, you know a solution is going to emerge eventually.

So when your organization finds itself in endless, time-wasting discussions regarding operational controls, try applying Chris’ little metric and see if you don’t rapidly approach something resembling clarity.  Your peers will thank you for injecting a little sanity into the proceedings.

Chris Calbrese passed away a little more than a year ago from a sudden and massive heat attack, leaving behind a wife and children.  His insight and quiet leadership are missed by all who knew him.  While Chris developed this metric in concert with his co-workers and later with the input of the participants in the Center for Internet Security’s consenus process, I have chosen to name the metric “Calabrese’s Razor” in his memory.

Good show!

Gene Kim, Kevin Behr, and I have been having some fun on Twitter (where we’re @RealGeneKim, @kevinbehr, and @hal_pomeranz respectively) with the #itfail thread.  Every time we come across an example of poor IT practice, one of us throws out a tweet.  It’s driven some entertaining discussions, and generally been a good way to blow off steam.

But I also think folks with good IT practices should be recognized in some way.  So I just wanted to send out this quick post to recognize the proactive IT process demonstrated by our hosts here at WordPress.com.  When I logged into my dashboard today, I found the following message waiting for me:

We will be making some code changes in about 42 hours which will log you out of your WordPress.com account. They should only take a few seconds and you should be able to log in afterwards without any problems. (more info)

It seems like such a simple thing, but to me it demonstrates some excellent IT process competancies:

  • Planned changes: they know well in advance when change is occurring and how long they expect it to take
  • Situational awareness: they know who is going to be impacted and in what fashion
  • Communication: they have a mechanism for alerting the affected parties in a reasonable time frame
  • Transparency:  they’re willing to alert their user community that they will be inconvenienced rather than just let it happen and hope nobody notices

While these all may seem like trivial or “of course” items to many of you reading this blog, let me tell you that many of the IT shops that I visit as part of my consulting practice regularly #itfail in some or all of the above areas.

So kudos to the folks at WordPress!  Good show!

Queue Inversion Week

Reliving the last story from my days at the mid-90’s Internet skunkworks, reminded me of another bit of tactical IT advice I learned on that job, and which has become a proven strategy that I’ve used on other engagements.  I call it “Queue Inversion Week”.

One aspect of our operations religion at the skunkworks was, “All work must be ticketed” (there’s another blog post behind that mantra, which I’ll get to at some point).  We lived and died by our trouble-ticketing system, and ticket priority values generally drove the order of our work-flow in the group.

The problem that often occurs to organizations in this situation, however, is what I refer to as the “tyranny of the queue”.  Everybody on the team is legitimately working on the highest-priority items.  However, due to limited resources in the Operations group, there are lower priority items that tend to collect at the bottom of the queue and never rise to the level of severity that would get them attention.  The users who have submitted these low-priority tickets tend to be very understanding (at least they were at the skunkworks) and would wait for weeks or months for somebody in my group to get around to resolving their minor issues.  I suspect that during those weeks/months the organization was actually losing a noticable amount of worker productivity due to these “minor” issues, but we never quantified how much.

What did finally penetrate was a growing rumble unhappiness from our internal customers.  “We realize you guys are working on bigger issues,” they’d tell me in staff meetings, “but after a few months even a minor issue becomes really irritating to the person affected.”  The logic was undeniable.

I took the feedback back to my team and we started kicking around ideas.  One solution that had a lot of support was to simply include time as a factor in the priority of the item: after the ticket had sat in the queue for some period of time, the ticket would automatically be bumped up one priority level.  The problem is that when we started modeling the idea, we realized it wouldn’t work.  All of the “noise” from the bottom of the queue would eventually get promoted to the point where it would be interfering with critical work.

Then my guy Josh Smift, who basically “owned” the trouble ticketing system as far as customization and updates was concerned, had the critical insight: let’s just “invert” the queue for a week.  In other words, the entire Operations crew would simply work items from the bottom of the queue for a week rather than the top.  It was simple and it was brilliant.

So we looked at the project schedule and identified what looked like a “slack” week and declared it to be “Queue Inversion Week.”  We notified our user community and encouraged them to submit tickets for any minor annoyances that they’d been reluctant to bring up for whatever reason.

To say that “Queue Inversion Week” was a raging success was to put it mildly indeed.  Frankly, all I wanted out of the week was to clear our ticket backlog and get our customers off our backs, but the whole experience was a revelation.  First, the morale of my Operations team went through the roof.  Analyzing the reasons why, I came to several conclusions:

  • It got my folks out among the user community and back in touch with the rest of the company, rather than being locked up in the data center all day long.  The people who my folks helped were grateful and expressed that to my team, which makes a nice change from the usual, mostly negative feedback IT Ops people tend to get.
  • The tickets from the bottom of the queue generally required only the simplest tactical resolutions.  Each member of my team could resolve dozens of these items during the week (in fact, a friendly competition arose to see who could close the most tickets), and feel terrific afterwards because there was so much concrete good stuff they could see that they’d done.
  • Regardless of what outsiders think, I believe most people in IT Operations really want to help the people who are their customers.  It’s depressing to know that there are items languishing at the bottom of the queue that will never get worked on.  This week gave my team an excuse to work on these issues.

I think I can reasonably claim that Queue Inversion Week also had a noticable impact on the morale of the skunkworks as a whole.  After all, many of the annoying problems that our users had been just doing work-arounds for were now removed as obstacles.  Like a good spring cleaning, everybody could breathe a little easier and enjoy the extra sunshine that appeared through the newly cleaned windows.

We repeated Queue Inversion Week periodically during my tenure at the skunkworks, and every time it was a positive experience that everybody looked forward to and got much benefit from.  You can’t necessarily have it happen on a rigid schedule, because other operational priorities interfere, but any time it looks like you have a little “slack” in the project schedule coming up and the bottom of your queue is full of little annoying tasks, consider declaring your own “Queue Inversion Week” and see if it doesn’t do you and your organization a world of good.

How I Learned to Start Loving Implementation Plans

Lately I’ve been seeing multiple articles pop up arguing for the importance of plans and checklists in many walks of life, but especially in IT process.  I couldn’t agree more.  In fact, when an IT person believes something as strongly as I believe in the power of implementation plans and checklists, you know there must be a story involved.  Let me tell you mine.

Back in the mid-90’s my good friend Jim Hickstein and I were working for the Internet skunkworks of a huge East Coast direct mail marketing company.  At the time I think I was the Operations Manager and Jim was my tech lead in the Sys Admin group, but since we were a small operation both Jim and I were still deeply involved in the tactical implementation and roll-out of new technologies.

In particular, there was one instance where Jim and I were scheduled to roll out a new storage array into our production ecommerce infrastructure.  Our production environment was hosted at company HQ down in Connecticut (our offices were in the Boston area), and in some sense this IT activity was part of an ongoing evangelism on our part of new “open” technology in what was largely a mainframe shop up until our arrival.  We wanted to show that we were better, faster, cheaper, and just as reliable as the platforms that they were used to.

I remember talking to Jim and suggesting that we come up with an implementation plan prior to the installation activity.  But clearly neither one of us had gotten the religion yet, because after talking about it for a while we decided that we “knew what we were doing” (despite never having deployed a unit of this type before) and didn’t need a plan.  So with our figurative cowboy hats firmly in place, we had the equipment shipped and a couple of weeks later we hopped in Jim’s car and followed it down to wildest Connecticut.

Upon arrival, we met up with a couple of the IT folks from HQ who were curious about what we were doing and wanted the chance to observe how we operated.  That was cool with us, because we wanted to evangelize as much as possible, remember?  Off we all went to the data center, unpacked and inspected the equipment, and generally killed time waiting for our three-hour outage window to commence.  Once the outage window arrived, Jim and I began our work.

Crisis set in almost immediately.  I can’t even remember at this point what started going wrong, but I think almost every person who’s done production Operations recognizes that “bottom drops out” feeling you get in your stomach when you suddenly realize what you thought was routine work has gone horribly awry.  As I recall, I was working the back of the machine hooking up the storage array and Jim was on the system console and I clearly remember staring across the rack at Jim and seeing the same horrified look reflected in his eyes.

I don’t remember much of the next two and a half hours.  I know we didn’t panic, but started working through the problems methodically.  And we did end up regaining positive control and actually got the work completed– just barely– within our outage window.  But frankly we had forgotten anything else in the world existed besides our troubled system.

When the crisis was resolved and we returned to something resembling the real world, we remembered that our actions were being scrutinized by our colleagues from HQ.  I turned to acknowledge their presence, expecting to see (deservedly) self-satisfied smiles on their faces from seeing us nearly fall on our faces in a very public fashion.  Well boy was I surprised: they were actually standing there in slack-jawed amazement.  “That was the most incredible thing I’ve every seen!  You guys are awesome!” one of them said to me.

What could we do?  Jim and I thanked them very much for their kind words, used some humor to blow off some of our excess adrenaline, and extracted ourselves from the situation as gracefully as possible.  After leaving the data center, Jim and I walked silently back to his car, climbed in, and just sat there quietly for a minute.  Finally, Jim turned to me and said, “Let’s never do that again.”

And that was it.  From that point on our operations policy playbook– enforced religiously by both Jim and myself– included the requirement that all planned work include a detailed implementation and rollback plan that must be reviewed by at least one other member of the operations staff before any work would be approved.  And to those who are thinking that this policy must have slowed down our ability to roll out changes in our production infrastructure, you couldn’t be more wrong.  In fact, our ability to make changes improved dramatically because our percentage of “successful” change activity (defined as not requiring any unplanned work or firefighting as a result of the change) got to better than 99%.  We simply weren’t wasting time on unplanned work, so as a consequence we had more time to make effective changes to our infrastructure.

That’s it.  Something as humble as a checklist can make your life immeasurably better.  But you have to be willing to admit that you need the help of a checklist in the first place, which compromises our need for the “expert audacity” that so many of us in IT feed on.  It’s a tough change to make in oneself.  I only hope you won’t have to learn this lesson the hard way, like Jim and I did.