Hal Pomeranz, Deer Run Associates

At the end of our recent SANS webcast, Mike Poor closed by emphasizing how important it was for IT and Information Security groups to advertise their operational successes to the rest of the organization (and also to their own people).  Too often these functions are seen as pure cost centers, and in these difficult economic times it’s up to these organizations to demonstrate return value or face severe cutbacks.

The question is what are the right metrics to publish in order to indicate success?  All too often I see organizations publishing meaningless metrics, or even metrics that create negative cultures that damage corporate perception of the organization:

  • It seems like a lot of IT Ops groups like to publish their “look how much stuff we operate” metrics: so many thousand machines, so many petabytes of disk, terabytes of backup data per week, etc.  The biggest problem with these metrics is that they can be used to justify massive process inefficiencies.  Maybe you have thousands of machines because every IT project buys its own hardware and you’re actually wasting money and resources that could be saved by consolidating.  Besides, nobody else in the company cares how big your… er, server farm is.
  • Then there are the dreaded help desk ticket metrics: tickets closed per week, average time to close tickets, percentage of open tickets, etc.  The only thing these metrics do is incentivize your help desk to do a slapdash job and thereby annoy your customers.  There’s only one help desk metric that matters: customer satisfaction.  If you’re not doing customer satisfaction surveys on EVERY TICKET and/or you’re not getting good results then you fail.

So what are some good metrics?  Well I’m a Visible Ops kind of guy, so the metrics that matter to me are things like amount of unplanned downtime (drive to zero), number of successful changes requiring no unplanned work or firefighting (more is better), number of unplanned or unauthorized changes (drive to zero), and projects completed on time and on-budget (more is better).  Of course, if your IT organization is struggling, you might be tempted to NOT publish these metrics because they show that you’re not performing well.  In these cases, accentuate the positive by publishing your improvement numbers rather than the raw data: “This month we had 33% less unplanned downtime than last month.”  This makes your organization look proactive and creates the right cultural imperatives without airing your dirty laundry.

There are a couple of other places where I never fail to toot my own horn:

  • If my organization makes life substantially better for another part of the company then you’d better believe I’m going to advertise that fact.  For example, when my IT group put together a distributed build system that cut product compiles down from over eight hours to less than one hour, it not only went into our regular status roll-ups, but I also got the head of the Release Engineering group to give us some testimonials as well.
  • Whenever a significant new security vulnerability comes out that is not an issue for us because of our standard builds and/or operations environment, I make sure the people who provide my budget know about it.  It also helps if you can point to “horror story” articles about the amount of money other organizations have had to pay to clean up after incidents related to the vulnerability.  This is one of the few times that Information Security can demonstrate direct value to the organization, and you must never miss out on these chances.

What’s That Smell?

If communicating your successes builds a corporate perception of your organization’s value, being transparent about your failures builds trust with the rest of the business.  If you try to present a relentlessly positive marketing spin on your accomplishments your “customers” elsewhere in the company will become suspicious.  Plus you’ll never bamboozle them sufficiently with your wins that they won’t notice the elephant in the room when you fall on your face.

The important things to communicate when you fail are that you understand what lead to the failure, that you have the situational awareness to understand the impact of the failure on the business, and the steps you’re taking to make sure that the same failure never happens again (the only real organizational failure is allowing the same failure to happen twice).  Here’s a simple checklist of items you should have in your disclosure statement:

  • Analysis of the process(es) that led to the failure
  • The duration of the outage
  • How the outage was detected
  • The systems and services impacted
  • Which business units were impacted and in what way
  • Actions taken to end the outage
  • Corrective processes to make sure it never happens again

Note that in some cases it’s necessary to split the disclosure across 2-3 messages.  One is sent during the incident telling your constituents, “Yes, there’s a problem and we’re working it.”  The next is the “services restored at time X, more information forthcoming” message.  And then finally your complete post-mortem report.  Try to avoid partial or incomplete disclosure or idle speculation without all of the facts– you’ll almost always end up with egg on your face.

Conclusion

If you don’t communicate what’s happening in your IT and/or InfoSec organization then the the other business units are basically going to assume you’re not doing anything during the time when you’re not directly working on their requests. This leads to the perception of IT as nothing more than “revenue sucking pigs“.

However, you also have to communicate in the right way.  This means communicating worthwhile metrics and metrics which don’t create bad cultural imperatives for your organization.  And it also means being transparent and communicating your failures– in the most proactive way possible– to the rest of the organization.