Calabrese’s Razor

February 26, 2009

Hal Pomeranz, Deer Run Associates

I’ve long held the opinion that the community of “Information Security Experts” agree with each other 90% of the time, but waste 90% of their time arguing to the death with other InfoSec Experts about the remaining 10%.  This was painfully brought home to me several years ago as I was facilitating the consensus process around the Solaris Security document published by the Center for Internet Security.  You won’t believe the amount of time we spent arguing about seemingly trivial things like, “Should the system respond to echo broadcast?”  And as the consensus circle widened, we ended up wasting more time on these issues and repeating debates over and over again as new people joined the discussion.  In short, it was killing us.  People were burning out and failing to provide constructive feedback and we were failing to deliver updates in a timely fashion.

I see these kind of debates causing similar mayhem in the IT Ops and InfoSec groups at many organizations.  The problem is that in these cases the organizations are not simply debating the content of a document full of security recommendations, they’re arguing about matters of operational policy.  This seems to promote even more irrational passions, and also raises the stakes for failing to come to consensus and actually move forward.

At the low point of our crisis at the Center for Internet Security, the person who was most responsible for finding the solution was Chris Calabrese, who was facilitating the HP-UX benchmark for the Center. At roughly the same time as our issues at the Center, the IT Ops and InfoSec teams at Chris’ employer had gotten bogged down over similar kinds of issues and had decided to come up with an objective metric for deciding which information security controls were important and which ones were just not worth arguing about.  Suddenly the discussion of these issues was transformed from matters of opinion to matters of fact.  Consensus arrived quickly and nobody’s feelings got hurt.

Overview of the Metric

So we decided to adapt the metric that Chris had used to our work at the Center.  After some discussion, we decided that the metric had to account for two major factors: how important the security control was and how much negative operational impact the security control would impose.  Each of the two primary factors was made up of other components.

For example, the factors relating to the relative importance of a security control include:

  • Impact (I): Is the attack just a denial-of-service condition, or does it allow the attacker to actually gain access to the system? Does the attack allow privileged access?
  • Radius (R): Does the attack require local access or can it be conducted in an unauthenticated fashion over the network?
  • Effectiveness (E): Does the attack work against the system’s standard configuration, or is the control in question merely a backup in case of common misconfiguration, or even just a “defense in depth” measure that only comes into play after the failure of multiple controls?

Similarly, the administrative impact of a control was assessed based on two factors:

  • Administrative Impact (A): Would the change require significant changes to current administrative practice?
  • Frequency of Impact (F): How regularly would this impact be felt by the Operations teams?

The equation for deciding which controls were important simply evolved to: “(I * R * E) – (A* F)”.  In other words, multiply the terms related to the importance of the control to establish a positive value and then subtract the costs due to the administrative impact of the control.

The only thing missing was the actual numbers.  It turns out a very simple weighting scheme is sufficient:

  • Impact (I): Score 1 if attack is a denial-of-service, 2 if the attack allows unprivileged access, and 3 if the attack allows administrative access (or access to an admin-equivalent account like “oracle”, etc)
  • Radius (R): Score 1 for attacks that require physical access or post-authenticated unprivileged access, and 2 for remote attacks that can be conducted by unauthenticated users
  • Effectiveness (E): Score 1 if the control requires multiple configuration failures to be relevant, 2 if the control is a standard second-order defense for common misconfiguration, and 3 if the attack would succeed against standard configurations without the control in question
  • Administrative Impact (A): Score 1 if the administrative impact is insignificant or none, 2 if the control requires modifications to existing administrative practice, and 3 if the control would completely disable standard administrative practices in some way
  • Frequency of Impact (F): Score 1 if the administrative impact is to a non-standard process or arises less than once per month, 2 if the administrative impact is to a standard but infrequent process that occurs about once per month, and 3 if the impact is to a regular or frequent administrative practice

In the case where a single control can have different levels of impact in different scenarios, what turned out best for us (and avoided the most arguments) was to simply choose the highest justifiable value for each term, even if that value was not the most common or likely impact.

Applying the Metric

Let’s run the numbers on a couple of controls and see how this works out.  First we’ll try a “motherhood and apple pie” kind of control– disabling unencrypted administrative access like telnet:

  • Impact (I): Worst case scenario here is that an attacker hijacks an administrative session and gains control of the remote system.  So that’s administrative level access, meaing a score of 3 for this term.
  • Radius (R): Anybody on the network could potentially perform this attack, so this term is set to 2.
  • Effectiveness (E): Again you have to go with the maximal rating here, because the session hijacking threat is a standard “feature” of clear-text protocols– score 3.
  • Administrative Impact (A): Remember, we’re not discussing replacing clear-text administrative protocols with encrypted protocols at this point (justifying encrypted access is a separate conversation).  We’re discussing disabling unencrypted access, so the score here is 3 because we’re planning on completely disabling this administrative practice.
  • Frequency of Impact (F): If telnet is your regular router access scheme, then this change is going to impact you every day.  Again, the score is then 3.

So what’s the final calculation?  Easy: (3 * 2 * 3) – (3 * 3) = 9.  What’s that number mean?  Before I answer that question, let’s get another point of comparison by looking at a more controversial control.

We’ll try my own personal nemesis, the dreaded question of whether the system should respond to echo broadcast packets:

  • Impact (I): Worst case scenario here ends up being a denial of service attack (e.g. “smurf” type attack), so score 1.
  • Radius (R): Depends on whether or not your gateways are configured to pass directed broadcast traffic (hint: they shouldn’t be), but let’s assume the worst case and score this one a 2.
  • Effectiveness (E): Again, being as pessimistic as possible, let’s assume no other compensating controls in the environment and score this one a 3.
  • Administrative Impact (A): The broadcast ping supporters claim that disabling broadcast pings makes it more difficult to assess claimed IP addresses on a network and capture MAC addresses from systems (the so-called “ARP shotgun” approach).  Work-arounds are available, however, so let’s score this one a 2.
  • Frequency of Impact (F): In this case, we have what essentially becomes a site-specific answer.  But let’s assume that your network admins use broadcast pings regularly and score this one a 3.

So the final answer for disabling broadcast pings is: (1 * 2 * 3) – (2 * 3) = 0.  You could quibble about some of the terms, but I doubt you’re going to be able to make a case for this one scoring any higher than a 2 or so.

Interpreting the Scores

Once we followed this process and produced scores for all of the various controls in our document, a dominant pattern emerged.  The controls that everybody agreed with had scores of 3 or better.  The obviously ineffective controls were scoring 0 or less.  That left items with scores in the 1-2 range as being “on the bubble”, and indeed many of these items were generating our most enduring arguments.

What was also clear was that it wasn’t worth arguing about the items that only came in at 1 or 2.  Most of these ended up being “second-order” type controls for issues that could be mitigated in other ways much more effectively and with much less operational impact.  So we made an organizational decision to simply ignore any items that failed to score at least 3.

As far as arguments about the weighting of the individual terms, these tended to be few and far between.  Part of this was our adoption of a “when in doubt, use the maximum justifiable value” stance, and part of it was due to choosing a simple weighting scheme that didn’t leave much room for debate.  Also, once you start plugging the numbers in, it’s obvious that arguing over a 1 point change in a single term isn’t usually enough to counteract the other factors enough to get a given control to reach the overall qualifying score of 3.

Further Conclusions

What was also interesting about this process is that it gave us an objective measure for challenging the “conventional wisdom” about various security controls.  It’s one thing to say, “We should always do control X”, and quite another to have to plug numbers for the various terms related to “control X” into a spreadsheet.  It quickly becomes obvious when a control has minimal security impact in the real world.

This metric also channelled our discussion into much more productive and much less emotional avenues.  Even the relatively coarse granularity of our instrument was sufficient to break our “squishy” matters of personal opinion into discrete, measurable chunks.  And once you get engineers talking numbers, you know a solution is going to emerge eventually.

So when your organization finds itself in endless, time-wasting discussions regarding operational controls, try applying Chris’ little metric and see if you don’t rapidly approach something resembling clarity.  Your peers will thank you for injecting a little sanity into the proceedings.

Chris Calbrese passed away a little more than a year ago from a sudden and massive heat attack, leaving behind a wife and children.  His insight and quiet leadership are missed by all who knew him.  While Chris developed this metric in concert with his co-workers and later with the input of the participants in the Center for Internet Security’s consenus process, I have chosen to name the metric “Calabrese’s Razor” in his memory.

Hal Pomeranz, Deer Run Associates

Recently my pal Bill Schell and I were gassing on about the current and future state of IT employment, and he brought up the topic of IT jobs being “lost to the Cloud”.  In other words, if we’re to believe in the marketing hype of the Cloud Computing revolution, a great deal of processing is going to move out of the direct control of the individual organizations where it is currently being done.  One would expect IT jobs within those organizations that had previously been supporting that processing to disappear, or at least migrate over to the providers of the Cloud Computing resources.

I commented that the whole Cloud Computing story felt just like another turn in the epic cycle between centralized and decentralized computing.  He and I had both lived through the end of the mainframe era, into “Open Systems” on user desktops, back into centralized computing with X terminals and other “thin clients”, back out onto the desktops again with the rise of extremely powerful, extremely low cost commodity hardware, and now we’re harnessing that commodity hardware into giant centralized clusters that we’re calling “Clouds”.  It’s amazingly painful for the people whose jobs and lives are dislocated by these geologic shifts in computing practice, but the wheel keeps turning.

Bill brought up an economic argument for centralized computing that seems to crop up every time we’re heading back into the shift towards centralized computing.  Essentially the argument is summarized as follows:

  • As the capital cost of computing power declines, support costs tend to predominate.
  • Centralized support costs less then decentralized support.
  • Therefore centralized computing models will ultimately win out.

If you believe this argument, by now we should have all embraced a centralized computing model.  Yet instead we’ve seen this cycle between centralized and decentralized computing.  What’s driving the cycle?  It seems to me that there are other factors that work in opposition and keep the wheel turning.

First, it’s generally been a truism that centralized computing power costs more than decentralized computing.  In other words, it’s more expensive to hook 64 processors and 128GB of RAM onto the same backplane than it is to purchase 64 uniprocessor machines each with 2GB of RAM.  The Cloud Computing enthusiasts are promising to crack that problem by “loosely coupling” racks of inexpensive machines into a massive computing array. Though when “loose” is defined as Infiniband switch fabrics and the like, you’ll forgive me if I suspect they may be playing a little Three Card Monte with the numbers on the cost spreadsheets.  The other issue to point out here is that if your “centralized” computing model is really just a rack of “decentralized” servers, you’re giving up some of the savings in support costs that the centralized computing model was supposed to provide.

Another issue that rises to the fore when you move to a centralized computing model is the cost to the organization to maintain their access to the centralized computing resource.  One obvious cost area is basic “plumbing” like network access– how much is it going to cost you to get all the bandwidth you need (in both directions) at appropriately low latency?  Similarly, when your compute power is decentralized it’s easier to hide environmental costs like power and cooling, as opposed to when all of those machines are racked up together in the same room.  However, a less obvious cost is the cost of keeping the centralized computing resource up and available all the time, because now with all of your “eggs in one basket” as it were your entire business can be impacted by the same outage.  “Five-nines” uptime is really, really expensive.  Back when your eggs were spread out across multiple baskets, you didn’t necessarily care as much about the uptime of any single basket and the aggregate cost of keeping all the baskets available when needed was lower.

The centralized vs. decentralized cycle keeps turning because in any given computing epoch the costs of all of the above factors rise and fall.  This leads IT folks to optimize one factor over another, which promotes shifts in computing strategy, and the wheel turns again.

Despite what the marketeers would have you believe, I don’t think the Cloud Computing model has proven itself to the point where there is a massive impact on the way mainstream business is doing IT.  This may happen, but then again it may not.  The IT job loss we’re seeing now has a lot more to do with the general problems in the world-wide economy than jobs being “lost to the Cloud”.  But it’s worth remembering that massive changes in computing practice do happen on a regular basis, and IT workers need to be able to read the cycles and position themselves appropriately in the job market.

Making Mentoring a Priority

February 19, 2009

Hal Pomeranz, Deer Run Associates

I always appreciate (and am in search of) tips for how to be a better sysadmin. I’ve never had the opportunity … to be in a large IT org. I think I miss out on a lot of learning opportunities by not being a part of a large IT org.

from a comment by “Joe” to “Queue Inversion Week

This comment reflects an industry trend that I’ve been worrying about for a while now.  Back in the 80’s when I was first learning to do IT Operations, it seemed like there were more opportunities to come up as a junior member of a larger IT organization and be mentored by the more senior members of the team.  It’s not overstating the case to say that I wouldn’t appear to be the “expert” that I seem to be today without liberal application of the “clue bat” by those former co-workers (and thanks to all of you– some of you don’t even know how much you helped me).

These days, however, it seems like there are a lot more “one person shops” in the IT world.  And a lot of IT workers are learning in a less structured way on their own– either on the job, or by fooling around with systems at home.  When they get stuck, their only fallback may be Google.  This has to lead to some less-than-optimal solutions and a lot of frustration and burn-out.

So if you’re a one person shop and you’re feeling the lack of mentoring, let me give you some suggestions for finding a support network.

Local User Groups

See if you can find a user group in your area.  Aside from the fact that most local groups sponsor informative talks, they’re also a good way to “network” with other IT folks in your area.  These are people you can call on when you get stuck on a problem.  There’s also the pure “group therapy” aspect of being able to be in a room with people who are living with the same day-to-day problems that you are and understand your language without need of Star Trek technology translation devices.

Google can help you find groups in your area.  Both SAGE and LOPSA also track local IT groups that are affiliated with those organizations.

If you can’t find an existing local group in your area, you might consider starting one.  I’ve found LinkedIn to be helpful for finding other IT people in my geographic area and contacting them.

Mailing Lists and Internet Forums

I subscribe to several IT-related mailing lists with world-wide memberships.  Some of the most active and useful mailing lists for getting questions answered seem to be the SAGE, LOPSA, and GIAC mailing lists, though there are membership costs and/or conference fees associated with getting access to these lists.  Also, there’s nothing that says you can’t subscribe to the mailing lists for various local user groups, even if you’re not actually close enough to attend their meetings.

There are of course different Internet forums where you can post questions and where you might actually get questions answered occasionally.  I haven’t done an exhaustive survey here, but I have found good Linux advice at the Ubuntu Forums and LinuxQuestions.org.  If you have favorites, you might mention them in the comments section.

Live Mentoring

This one is scary for most people, but you might consider contacting somebody who you think is an “expert” and asking them out to coffee/beer/lunch/dinner.  If they’re too busy, they’ll tell you.  But if you don’t ask you’ll never know, and you might be missing out on a great opportunity.

You must understand that my expectation is that if somebody helps you in this way, you are morally obligated to help someone else in a similar fashion in the future.  This is why I think you’ll find that most “experts” worth their salt are more than willing to extend this courtesy to you– somebody in their past provided them with guidance, and they’re just “paying back” by helping you.

Teaching Others

If my last idea was scary, this one will probably make you want to hide under a rock.  But teaching others is a great way to motivate yourself to learn.  I find that I don’t really master a subject until I have to organize my thoughts well enough to convey it to others.

Can’t locate anybody nearby to teach at?  Start a blog and write down your expertise for others to read.  Answer questions for some of the users on the Internet forums mentioned above.  Submit articles to technical journals (as the former Technical Editor for Sys Admin Magazine, I can attest that most of these publications are absolutely desperate for content)– some of them even pay money for articles.

If you’ve taken a SANS course and obtained your GIAC certification, you may be eligible to become a SANS mentor.  This can be an entre into becoming a SANS Instructor, and is therefore well worth pursuing.

In Conclusion

It’s unfortunate that there are so many folks out there without the built-in support network of working in a large IT organization.  But if you search diligently, I think you may be able to find some other people in your area to network with and get guidance from.  Remember that we all have different levels of expertise in different areas, so sometimes you’re the apprentice and sometimes you’re the “expert” (I’m constantly learning things from my students– yet another reason to teach others).

For the Senior IT folks who are reading this blog, I ask you to please make it a priority to reach out to the more junior members of our profession and help bring them along.  Somebody did it for you, and now it’s your turn.

Hal Pomeranz, Deer Run Associates

This is the second in a series of post summarizing some of the information I presented in the recent SANS roundtable webcast that I participated in with fellow SANS instructors Ed Skoudis and Mike Poor.  Hopefully this written summary will be of use to those of you who lack the time to sit down and listen to the webcast in its entirety.

During the webcast we were discussing the recent distributed denial-of-service (DDoS) attacks against the Immunity, Metasploit, Milw0rm, and Packet Storm web sites.  H.D. Moore has been posting brief updates about the attacks on his blog, and there are some interesting lessons to be extracted from his commentary.  It’s also good reading for H.D.’s scathing commentary on the quality of the attack.

To summarize, the attacks began around 9pm CST on Friday 2/6. The attack was a SYN-only connection attack on H.D.’s web servers on 80/tcp.  H.D. observed peak traffic loads of approximately 80,000 connections per second.

When faced with a DDoS like this, one of the first questions is are the source IPs on the traffic real or spoofed?  H.D. was able to observe his DNS server-related logs and see that about 95% of the source addresses were resolving metasploit.com periodically.  This tends to indicate that the attacker was not bothering to spoof source IP addresses in the traffic.

It also led to the first corrective action to thwart the DDoS: H.D. simply changed the A record for metasploit.com to 127.0.0.1.  The downside to this action is that now legitimate users couldn’t find metasploit.com, and in particular it broke the standard update mechanism for the Metasploit Framework.  However, since the attack was only targeting the main metasploit.com web site, H.D. was able to move his other services (blog.metasploit.com, etc) to an alternate IP address and keep operating for the most part.  He was also able to use various “Social Media” outlets (like Twitter, his blog, etc) to let people know about the changes and help them find the relocated services (I can hear the Web 2.0 press organs hooting now, “Twitter Foils DDoS Attack!”– *blech*).

Actually, H.D. is now in sort of an odd situation.  The attackers have hit him with a fire hose of data, but effectively left him controlling the direction of the nozzle.  If H.D. were less ethical than he is, he simply could have changed his A record to point at a routable IP address and used the attack to DDoS somebody else.  It’s unlikely that the folks at the re-targeted site would even have realized why this traffic was suddenly being directed at them, since the attacker wasn’t bothering to send actual HTTP transactions with information about the original target site.

In any event, the attack finally abated at about 9pm CST on Sunday, 2/8.  However, it restarted again Monday morning (read H.D.’s comment regarding the timing of the attack– priceless).  This time, the attacker appears to be using the hard-coded IP addresses of both the original metasploit.com web site and the alternate IP that H.D. had relocated to during the initial attack.  So changing the A record would no longer be effective at stopping the attack.

However, the attackers were still only targeting port 80/tcp.  All H.D. had to do is shut down the web servers on 80/tcp and restart them on 8000/tcp (again notifying his normal users via Twitter, etc).  H.D. notes that his SSL traffic on 443/tcp was completely unaffected by the DDoS, which again was a pretty poor showing by the attackers.

And so the attack  continues.  By Tuesday, the SYN traffic has reached 15Mbps/sec, and H.D.’s ISP is starting to get worried.  However, by this point the DDoS sources are apparently back to using DNS resolution because H.D. notes he’s pointing the metasploit.com A record back to 127.0.0.1 and moving his services to metasploit.org.  Of course, at some point the attackers are going to target H.D. at his new domain.  Maybe H.D. should consider deploying the Metasploit web presence into a fast-flux network to continue evading the DDoS weasels?

In any event, I think the countermeasures that H.D. took against the DDoS (changing the A record, changing IPs, changing ports) are all good stop-gap approaches to consider if you’re ever targeted by a DDoS attack like this.  I also commend H.D. for his use of “social media” as an out-of-band communications mechanism with his regular constituents.  And finally, the transparency of blogging about the attack to inform the rest of the community is certainly helpful to others.  Good show, H.D.!

Hal Pomeranz, Deer Run Associates

Ed Skoudis and Mike Poor were kind enough to invite me to sit in on their recent SANS webcast round-table about emerging security threats.  During the webcast I was discussing some emerging attack trends against the Linux kernel, which I thought I would also jot down here for those of you who don’t have time to sit down and listen to the webcast recording.

Over the last several months, I’ve been observing a noticable uptick in the number of denial-of-service (DoS) conditions reported in the Linux kernel.  What that says to me is that there are groups out there who are scrutinizing the Linux kernel source code looking for vulnerabilities.  Frankly, I doubt they’re after DoS attacks– it’s much more interesting to find an exploit that gives you control of computing resources rather than one that lets you take them away from other people.

Usually when people go looking for vulnerabilities in an OS kernel they’re looking for privilege escalation attacks.  The kernel is often the easiest way to get elevated priviliges on the system.  Indeed, in the past few weeks there have been a couple [1] [2] of fixes for local privilege escalation vulnerabilities checked into the Linux kernel code.  So not only are these types of vulnerabilities being sought after, they’re being found (and probably used).

Now “local privilege escalation” means that the attacker has already found their way into the system as an unprivileged user.  Which begs the question, how are the attackers achieving their first goal of unprivileged access?  Well certainly there are enough insecure web apps running on Linux systems for attackers to have a field day.  But as I was pondering possible attack vectors, I had an uglier thought.

A lot of the public Cloud Computing providers make virtualized Linux images available to their customers.  The Cloud providers have to allow essentially unlimited open access to their services to anybody who wants it– this is, after all, their entire business model.  So in this scenario, the attacker doesn’t need an exploit to get unprivileged access to a Unix system: they get it as part of the Terms of Service.

What worries me is attackers that pair their local privilege escalation exploits with some sort of “virtualization escape” exploit, allowing them hypervisor level access to the Cloud provider’s infrastructure.  That’s a nightmare scenario, because now the attacker potentially has access to other customers’ jobs running in that computing infrastructure in a way that will likely be largely undetectable by those customers.

Now please don’t mistake me.  As far as we know, this scenario has not occurred.  Furthermore, I’m willing to believe that the Cloud providers supply generally higher levels of security than many of their customers could do on their own (the Cloud providers having the resources to get the “pick of the litter” when it comes to security expertise).  At the same time, the scenario I paint above has got to be an attractive one for attackers, and it’s possible we’re seeing the precursor traces of an effort to mount such an attack in the future.

So to all of you playing around in the Clouds I say, “Watch the skies!”

Good show!

February 13, 2009

Hal Pomeranz, Deer Run Associates

Gene Kim, Kevin Behr, and I have been having some fun on Twitter (where we’re @RealGeneKim, @kevinbehr, and @hal_pomeranz respectively) with the #itfail thread.  Every time we come across an example of poor IT practice, one of us throws out a tweet.  It’s driven some entertaining discussions, and generally been a good way to blow off steam.

But I also think folks with good IT practices should be recognized in some way.  So I just wanted to send out this quick post to recognize the proactive IT process demonstrated by our hosts here at WordPress.com.  When I logged into my dashboard today, I found the following message waiting for me:

We will be making some code changes in about 42 hours which will log you out of your WordPress.com account. They should only take a few seconds and you should be able to log in afterwards without any problems. (more info)

It seems like such a simple thing, but to me it demonstrates some excellent IT process competancies:

  • Planned changes: they know well in advance when change is occurring and how long they expect it to take
  • Situational awareness: they know who is going to be impacted and in what fashion
  • Communication: they have a mechanism for alerting the affected parties in a reasonable time frame
  • Transparency:  they’re willing to alert their user community that they will be inconvenienced rather than just let it happen and hope nobody notices

While these all may seem like trivial or “of course” items to many of you reading this blog, let me tell you that many of the IT shops that I visit as part of my consulting practice regularly #itfail in some or all of the above areas.

So kudos to the folks at WordPress!  Good show!

Queue Inversion Week

February 12, 2009

Hal Pomeranz, Deer Run Associates

Reliving the last story from my days at the mid-90’s Internet skunkworks, reminded me of another bit of tactical IT advice I learned on that job, and which has become a proven strategy that I’ve used on other engagements.  I call it “Queue Inversion Week”.

One aspect of our operations religion at the skunkworks was, “All work must be ticketed” (there’s another blog post behind that mantra, which I’ll get to at some point).  We lived and died by our trouble-ticketing system, and ticket priority values generally drove the order of our work-flow in the group.

The problem that often occurs to organizations in this situation, however, is what I refer to as the “tyranny of the queue”.  Everybody on the team is legitimately working on the highest-priority items.  However, due to limited resources in the Operations group, there are lower priority items that tend to collect at the bottom of the queue and never rise to the level of severity that would get them attention.  The users who have submitted these low-priority tickets tend to be very understanding (at least they were at the skunkworks) and would wait for weeks or months for somebody in my group to get around to resolving their minor issues.  I suspect that during those weeks/months the organization was actually losing a noticable amount of worker productivity due to these “minor” issues, but we never quantified how much.

What did finally penetrate was a growing rumble unhappiness from our internal customers.  “We realize you guys are working on bigger issues,” they’d tell me in staff meetings, “but after a few months even a minor issue becomes really irritating to the person affected.”  The logic was undeniable.

I took the feedback back to my team and we started kicking around ideas.  One solution that had a lot of support was to simply include time as a factor in the priority of the item: after the ticket had sat in the queue for some period of time, the ticket would automatically be bumped up one priority level.  The problem is that when we started modeling the idea, we realized it wouldn’t work.  All of the “noise” from the bottom of the queue would eventually get promoted to the point where it would be interfering with critical work.

Then my guy Josh Smift, who basically “owned” the trouble ticketing system as far as customization and updates was concerned, had the critical insight: let’s just “invert” the queue for a week.  In other words, the entire Operations crew would simply work items from the bottom of the queue for a week rather than the top.  It was simple and it was brilliant.

So we looked at the project schedule and identified what looked like a “slack” week and declared it to be “Queue Inversion Week.”  We notified our user community and encouraged them to submit tickets for any minor annoyances that they’d been reluctant to bring up for whatever reason.

To say that “Queue Inversion Week” was a raging success was to put it mildly indeed.  Frankly, all I wanted out of the week was to clear our ticket backlog and get our customers off our backs, but the whole experience was a revelation.  First, the morale of my Operations team went through the roof.  Analyzing the reasons why, I came to several conclusions:

  • It got my folks out among the user community and back in touch with the rest of the company, rather than being locked up in the data center all day long.  The people who my folks helped were grateful and expressed that to my team, which makes a nice change from the usual, mostly negative feedback IT Ops people tend to get.
  • The tickets from the bottom of the queue generally required only the simplest tactical resolutions.  Each member of my team could resolve dozens of these items during the week (in fact, a friendly competition arose to see who could close the most tickets), and feel terrific afterwards because there was so much concrete good stuff they could see that they’d done.
  • Regardless of what outsiders think, I believe most people in IT Operations really want to help the people who are their customers.  It’s depressing to know that there are items languishing at the bottom of the queue that will never get worked on.  This week gave my team an excuse to work on these issues.

I think I can reasonably claim that Queue Inversion Week also had a noticable impact on the morale of the skunkworks as a whole.  After all, many of the annoying problems that our users had been just doing work-arounds for were now removed as obstacles.  Like a good spring cleaning, everybody could breathe a little easier and enjoy the extra sunshine that appeared through the newly cleaned windows.

We repeated Queue Inversion Week periodically during my tenure at the skunkworks, and every time it was a positive experience that everybody looked forward to and got much benefit from.  You can’t necessarily have it happen on a rigid schedule, because other operational priorities interfere, but any time it looks like you have a little “slack” in the project schedule coming up and the bottom of your queue is full of little annoying tasks, consider declaring your own “Queue Inversion Week” and see if it doesn’t do you and your organization a world of good.

Hal Pomeranz, Deer Run Associates

Lately I’ve been seeing multiple articles pop up arguing for the importance of plans and checklists in many walks of life, but especially in IT process.  I couldn’t agree more.  In fact, when an IT person believes something as strongly as I believe in the power of implementation plans and checklists, you know there must be a story involved.  Let me tell you mine.

Back in the mid-90’s my good friend Jim Hickstein and I were working for the Internet skunkworks of a huge East Coast direct mail marketing company.  At the time I think I was the Operations Manager and Jim was my tech lead in the Sys Admin group, but since we were a small operation both Jim and I were still deeply involved in the tactical implementation and roll-out of new technologies.

In particular, there was one instance where Jim and I were scheduled to roll out a new storage array into our production ecommerce infrastructure.  Our production environment was hosted at company HQ down in Connecticut (our offices were in the Boston area), and in some sense this IT activity was part of an ongoing evangelism on our part of new “open” technology in what was largely a mainframe shop up until our arrival.  We wanted to show that we were better, faster, cheaper, and just as reliable as the platforms that they were used to.

I remember talking to Jim and suggesting that we come up with an implementation plan prior to the installation activity.  But clearly neither one of us had gotten the religion yet, because after talking about it for a while we decided that we “knew what we were doing” (despite never having deployed a unit of this type before) and didn’t need a plan.  So with our figurative cowboy hats firmly in place, we had the equipment shipped and a couple of weeks later we hopped in Jim’s car and followed it down to wildest Connecticut.

Upon arrival, we met up with a couple of the IT folks from HQ who were curious about what we were doing and wanted the chance to observe how we operated.  That was cool with us, because we wanted to evangelize as much as possible, remember?  Off we all went to the data center, unpacked and inspected the equipment, and generally killed time waiting for our three-hour outage window to commence.  Once the outage window arrived, Jim and I began our work.

Crisis set in almost immediately.  I can’t even remember at this point what started going wrong, but I think almost every person who’s done production Operations recognizes that “bottom drops out” feeling you get in your stomach when you suddenly realize what you thought was routine work has gone horribly awry.  As I recall, I was working the back of the machine hooking up the storage array and Jim was on the system console and I clearly remember staring across the rack at Jim and seeing the same horrified look reflected in his eyes.

I don’t remember much of the next two and a half hours.  I know we didn’t panic, but started working through the problems methodically.  And we did end up regaining positive control and actually got the work completed– just barely– within our outage window.  But frankly we had forgotten anything else in the world existed besides our troubled system.

When the crisis was resolved and we returned to something resembling the real world, we remembered that our actions were being scrutinized by our colleagues from HQ.  I turned to acknowledge their presence, expecting to see (deservedly) self-satisfied smiles on their faces from seeing us nearly fall on our faces in a very public fashion.  Well boy was I surprised: they were actually standing there in slack-jawed amazement.  “That was the most incredible thing I’ve every seen!  You guys are awesome!” one of them said to me.

What could we do?  Jim and I thanked them very much for their kind words, used some humor to blow off some of our excess adrenaline, and extracted ourselves from the situation as gracefully as possible.  After leaving the data center, Jim and I walked silently back to his car, climbed in, and just sat there quietly for a minute.  Finally, Jim turned to me and said, “Let’s never do that again.”

And that was it.  From that point on our operations policy playbook– enforced religiously by both Jim and myself– included the requirement that all planned work include a detailed implementation and rollback plan that must be reviewed by at least one other member of the operations staff before any work would be approved.  And to those who are thinking that this policy must have slowed down our ability to roll out changes in our production infrastructure, you couldn’t be more wrong.  In fact, our ability to make changes improved dramatically because our percentage of “successful” change activity (defined as not requiring any unplanned work or firefighting as a result of the change) got to better than 99%.  We simply weren’t wasting time on unplanned work, so as a consequence we had more time to make effective changes to our infrastructure.

That’s it.  Something as humble as a checklist can make your life immeasurably better.  But you have to be willing to admit that you need the help of a checklist in the first place, which compromises our need for the “expert audacity” that so many of us in IT feed on.  It’s a tough change to make in oneself.  I only hope you won’t have to learn this lesson the hard way, like Jim and I did.