Skunkworks in the Clouds

April 23, 2009

Hal Pomeranz, Deer Run Associates

Skunks... clouds... get it?

I was recently asked to make a guest appearance on a podcast related to information security in “the cloud”.  One of the participants brought up an interesting anecdote from one of his clients.  Apparently the IT group at this company had been approached by a member of their marketing team who was looking for some compute resources to tackle a big data crunching exercise.  The IT group responded that they were already overloaded and it would be months before they could get around to providing the necessary infrastructure.  Rebuffed but undeterred, the marketing person used their credit card to purchase sufficient resources from Amazon’s EC2 to process the data set and got the work done literally overnight for a capital cost of approximately $1800.

There ensued the predictable horrified gasping from us InfoSec types on the podcast.  Nothing is more terrifying than skunkworks IT, especially on infrastructure not under our direct control.  “Didn’t they realize how insecure it was to do that?” “What will happen when all of our users realize how easily and conveniently they can do this?” “How can an organization control this type of risky behavior?” We went to bed immersed in our own paranoid but comfortable world-view.

Since then, however, I’ve had the chance to talk with other people about this situation.  In particular, my friend John Sechrest delivered an intellectual “boot to the head” that’s caused me to consider the situation in a new light.  Apparently getting the data processed in a timely fashion was so critical to the marketing department that they figured out their own self-service plan for obtaining the IT resources they needed. If the project was that critical, John asked, was it reasonable from a business perspective for the IT group to effectively refuse to help their marketing department crunch this data?

Maybe the IT group really was overloaded– most of them are these days.  However, the business of the company still needs to move forward, and the clever problem-solving monkeys in various parts of the organization will figure out ways to get their jobs done even without IT support. “Didn’t they realize how insecure it was to do that?”  No, and they didn’t care.  They needed to accomplish a goal, and they did.

“What will happen when all of our users realize how easily and conveniently they can do this?” My guess is they’re going to start doing it a lot more.  Maybe that’s a good thing.  If the IT group is really overloaded, then perhaps it should think about actually empowering their users to do these kind of “one off” or prototype projects on their own without draining the resources of the core IT group.  Remember that if you let a thousand IT projects bloom, 999 of them are going to wither and die shortly thereafter.  Perhaps IT doesn’t need to waste time managing the death of the 999.

“How can an organization control this type of risky behavior?” You probably can’t.  So perhaps your IT group should provide a secure offering that’s so compelling that your users will want to use your version rather than the commodity offerings that are so readily available.  This solution will have to be tailored to each company, but I think it starts with things like:

  • Pre-configured images with known baseline configurations and relevant tools so that groups can get up an running quickly without having to build and upload their own images.
  • Easy toolkits for migrating data and out of these images in a secure fashion, with some sort of DLP solution baked in.
  • Secure back-end storage to protect the data at rest in these images with no extra work on the part of the users.
  • Integration with the organization’s existing identity management and/or AAA framework so that users don’t have to re-implement their own solutions.
  • Integration with the organization’s auditing and logging infrastructures so you know what’s going on.

Putting together the kind of framework described above is a major IT project, and will require input and participation from your user community.  But once accomplished, it could provide massive leverage to overtaxed IT organizations.  Rather than IT having to engineer everything themselves, they provide secure self-service building blocks to their customers and let them have at it.

Providing architecture support and guidance in the early stages of each project is probably prudent.  After all, the one hardy little flower that blooms and refuses to die may become a critical resource to the organization that may eventually need to be moved back “in house”.  While the fact that the building blocks that were used to create the service are already well-integrated with the organization’s centralized IT infrastructure will help, having a reasonable architectural design from the start will also be a huge help when it comes time to migrate and continue scaling the service.

Am I advocating skunkworks IT?  No, I like to think I’m advocating self-service IT on a grand scale.  You’ll see what skunkworks IT looks like if you ignore this issue and just let your users develop their own solutions because you’re too busy to help them.


Hal Pomeranz, Deer Run Associates

awful-wiring-2-smallerOne of my personal and professional mantras is, “It doesn’t take that much longer to do it right.”  Sure, you can always do a half-assed job on some project just to shove it out the door, but there are inevitably down-stream costs.  Whether it’s time lost having to go back and fix your broken junk or customer frustration and negative perception, your short-cut decision usually ends up costing you more in the long run.

Of course you know I have a story to illustrate this principle.  During the peak of the dot-com boom I was helping a small start-up company move out of their sub-lease arrangement into their new permanent home.  The weekend of the move, we had contracted with a moving company to do the physical move of all the equipment and personal items and I was helping the IT team for the company do the setup, server room build out, telephony configuration, etc.  The building we were moving into was a two-story affair, and the plan for the equipment that was moving onto the second floor was to arrange it on pallets, wrap the pallet securely, and forklift the pallets through a second-story window.  Pretty standard practice, actually.

The moving company, however, had negotiated a fixed-price bid.  So it was in their best interests to get the work done as quickly as possible.  Without our knowledge, they decided to throw caution to the wind and not wrap the palleted equipment.  Sure enough, the first pallet went up on the forklift and as the pallet was moving forward through the window, one of the computers toppled off the edge and fell a full story onto concrete.  The punchline is that the computer belonged to the office manager for the company and was the computer that was going to cut the check to the moving company.  We were actually able to recover the hard drive from the twisted chassis and boot it in another PC, but the moving company had to pay a significant penalty that more than erased any savings that they might have achieved from not wrapping the pallets.  Also, after that incident, we made them stop and wrap all the pallets anyway, which cost them even more time.

Once the equipment was loaded in, the IT team and I could get started with our part of the project.  And it was a big project:  from our Friday evening start to late Sunday night, I think we got maybe 12 hours of downtime to sleep.  But there we were Sunday night and everything appeared to be ready for business to resume Monday morning.

Exhausted, but feeling pretty good about ourselves, we were standing admiring the new server room and somebody pointed out that we forgot to put the covers back on the cable raceways.  There was a collective groan.  We were “done”– surely replacing the covers could wait until the following week when we were all less tired?  But when we all looked each other in the eye, we knew that the upcoming week was going to be so chaotic that if we put off doing it now we’d never get around to it in a timely fashion.  So, without a word being spoken, we all grabbed some covers and spent another half hour or so fitting them into place.

When the execs toured the facility the following morning, we actually did get some compliments about how “neat” and “professional” the server room looked, but I don’t think that’s really why a crew of exhausted geeks spent an extra half hour re-assembling cable raceways.  Nor was it because we’re anal retentive or budding masochists.  I think it was because (a) we wanted to establish a culture of “doing things right” in the new server room so that entropy wouldn’t set in so quickly, and (b) we knew that going back and fixing the problem later would take longer than doing it right then.

So when you find yourself looking for excuses not to “finish” a project and just “get ‘er done”, take a moment and think about whether expediency is really the best policy.  Remember that bugs are always cheaper to fix in development than after the product ships.  It doesn’t take that much longer to do it right.

Avoiding Avoidance

April 6, 2009

Hal Pomeranz, Deer Run Associates

At a recent tech event, I ended up having a conversation with Wendy Kincade and Colleen Dick about how we sometimes let critical projects slide. Wendy said a very wise thing, which is that when people are avoiding a project, it’s most often because they don’t think they’ll reach a successful outcome.  In some sense it becomes a vicious cycle: we know we really should be making progress on that big project, but yet we somehow always find other things to be working on that allow us to avoid it.  Of course, the longer we avoid it, the less time we have to complete the project and it only becomes bigger and scarier as a result of the shrinking time window.

Full speed into the unknown

I’ve certainly come to recognize this behavior in myself.  It’s much more comfortable to spend your days doing small, tactical tasks that have short completion times and satisfying outcomes.  It’s a huge leap of faith to set out to tackle and enormous project when you’re not sure if you have all the skills necessary to accomplish the task, where the end of the project is not clearly in sight, and where the cost of failure may be high.  It gives me an uncomfortable feeling in the pit of my stomach, not unlike the feeling you get when contemplating stepping off from a great height.

When I recognize this feeling in myself, I immediately take steps to start tackling the project, because I know from previous experience that if I let it linger the situation is only going to get worse.  Here are some tactics that I’ve developed for getting over the hump:

1. Break it up: Vast monolithic projects are daunting, so break the project up into a set of deliverables, milestones, and dependencies.  Then outline the steps necessary to reach each component of the project.  You don’t have to create a formal project plan– in fact, I’ve seen people spend all their time grooming a plan in MS Project, just to avoid getting started on the actual work.  A simple outline format is fine.

2. Pick an easy one: Once you’ve got a notion of the individual tasks you need to accomplish to finish your project, pick one of the tasks that you think you can complete quickly and get it done.  During our conversation, Colleen commented, “I know that if I can just knock one thing down, that gives me energy to push further into the project.”

3. Make it fun: What motivates you?  I really enjoy figuring out and mastering new technology.  So if there’s a component of the project that requires me to do a bunch of research to figure something out, I’ll tend to do that first plus give myself leeway to spend extra time really getting mastery of that subject.  While I need to be careful to prevent turning the research into an avoidance exercise in and of itself, I also know that any mastery I acquire will be useful at some point in the future, even if it’s not directly relevant to the project at hand.  Some people reward themselves after completing a particular part of the project– take a break to play your favorite video game, hang out with friends, go for a hike, whatever.

4. Consider past success: Reflect on the fact that you’ve accomplished difficult tasks in the past.  Remember the satisfaction you felt when you finally shipped those projects.  Use these feelings to reinforce your belief that you’ll be successful at the project you’re currently embarking on.

While I’ve come to know my own avoidance behaviors and learned to take steps to work around them, I don’t think I’ll ever be entirely free of them.  I think it’s just a natural human risk aversion response.  However, I also recognize that one has to take risks to accomplish great things.  I have a quote from the explorer Magellan on the wall in my office and I look at it often:

Unlike the mediocre, the intrepid spirits seek victory over those things that seem impossible… They embark on the most daring of all endeavors… to meet the shadowy future without fear and conquer the unknown.

Hal Pomeranz, Deer Run Associates

In a comment to Monday’s post on Change Management, John Moore wrote:

I would argue, however, that this level of change management is only appropriate once you reach a certain size company. If the company is more than 100 people, you need to have these policies in place and you must have enforcement, or the cost for running the IT team in a manner that benefits the company is impossible.

In startups, where I have spent much of the last decade, the change management systems you have defined above would be overly prohibitive and remove the flexibility that is critical for success.

John’s not alone in expressing this view.  I’ve heard similar sorts of comments from companies of all different sizes– some of which were substantially larger than John’s suggested 100 person threshold.  But I think that change management is important regardless of what size you’re at, and it doesn’t have to remove any “flexibility” or “agility” from the organization.  Quite the contrary, appropriate change management should enable the organization to move more rapidly because it reduces failed changes an unplanned work that suck resources that could otherwise be more productively channeled.

The key word there is “appropriate”.  Of course the change management process in a 3-10 person start-up looks completely different from the process in a company with hundreds of employees.  In an early-stage start-up you’ve typically got a team of people working very closely together with laser focus on a single line of business.  You don’t tend to have the kind of “process flow control” issues that larger companies do, where you need change review meetings to balance competing priorities and competing schedule issues.

But even three-person start ups need to make production changes thoughtfully and with rigor.  It’s easy to think “we know what we’re doing” and get yourself into a lot of trouble and cause a significant outage.  It doesn’t take that much longer to sit down and write a detailed implementation plan, have one of your co-workers review it, and then execute it (Hickstein’s, “Think, think, think, type, type, type, `beer’!”).  And the bonus is that history of implementation plans helps you when you need to grow your infrastructure, because now you have the documented list of configuration changes necessary to produce replicas of your existing systems.

Do I think a three-person start-up needs formal change control meetings?  Heck no!  If you have regular Engineering meetings, set aside a little time to mention scheduled production updates (if any) and solicit feedback.  If you don’t have regular meetings, set up an email alias where notices of production changes can be posted.  That way, at least everybody will be aware of the current state of affairs on the production systems (or can refer back to the archives as appropriate), which is critical information for them to know as they’re developing code for those platforms.

I would, however, recommend that you implement some sort of configuration control process on your production systems.  It could be as simple as implementing an Open Source utility like AIDE or Samhain, just to keep an eye on what’s happening on the system.  Aside from alerting you to cockpit error on the part of your own people, these kinds of tools can also alert you to more nefarious activity and are part of a good baseline security posture.

At some point in the growth cycle of the company, you’re going to start getting feedback from developers that they “don’t care” about the production update notices.  Congratulations! You’ve just reached a major milestone in your company’s maturation process– the beginning of separation of duties.  This is probably also around the time you’ll be hiring your first full-time IT person, so start soliciting resumes.

Your change management processes will also start adjusting to your new realities.  Your new IT person is going to become the keeper of the implementation plans and other change documentation.  They’ll probably also start pushing you for more formal outage windows, just so they can have some predictability in the environment.  And they’re also going to start pushing back on the developers to keep them from making direct changes on the production systems.  Let these things happen.

The next thing you know, you’re going to look up and realize you’ve got several IT folks and they’ve got their own manager.  Furthermore, you’ve got several products now being developed concurrently.  This is the stage where John suggests that your company needs to start embracing a formal change management process like the one described in Visible Ops, and I agree.  Hopefully you figure out you’ve reached this stage before you have a production outage caused by multiple, badly coordinated updates.

Just like the wrong time to fix bugs in your product is after the product has shipped, it’s wrong to try and build a culture of change management from scratch in an established company.  It is very hard to change a “cowboy culture” once it’s been allow to establish itself.  Visible Ops has a quote from Dr. Bob Doppelt, who was actually speaking of public health matters when he uttered it, but it is nonetheless appropriate: “The righter we do the wrong things, the wronger we become.” The problem is that inattention to change management can appear to work for a period of time– mostly because nobody’s bothering to track the amount of time lost to firefighting and unplanned work.  But suddenly an organization wakes up and realizes that they’ve become utterly crushed by the tyranny of unplanned work.  Digging out of this hole is painful.

So resist the notion that change management is “only for big companies”.  Don’t you hope to be a big company some day? Well you’re not going to receive an angelic visitation complete with fully-functioning change management process on the magic day you somehow cross the “big company” threshold.  Better instead to be a small company that believes strongly in change management and grows naturally into a formal change management process.

Hal Pomeranz, Deer Run Associates

Lately Gene Kim, Kevin Behr, and I have been on a nearly messianic crusade against IT suckage.  Much of our discussion has centered around The Visible Ops Handbook that Gene and Kevin co-authored with George Spafford. Visible Ops is an extremely useful playbook containing four steps that IT groups can follow to help them become much higher performing organizations.

However, I will admit that Visible Ops is sometimes a hard sell.  That’s because the first step of Visible Ops is to create a working change management process within the IT organization– with functional controls and real consequences for people who subvert the change management process.  Aside from being a difficult task in the first place, just the mere concept of change management causes many IT folks to start looking for an exit.  “We hate change management!” they say.  “Don’t do this to us!”  What I quickly try to explain to them is that they don’t hate change management, they just hate bad change management.  And, unfortunately, bad change management is all they’ve experienced to date, so they don’t know there’s a better way.

What are some of the hallmarks of bad change management processes?  See if any of these sound familiar to you:

1. Just a box-checking exercise: The problem here is usually that an organization has implemented change management only because their auditors told them they needed it.  As a result, the process is completely disconnected from the actual operational work of IT in the organization.  It’s simply an exercise in filling out and rubber-stamping whatever ridiculous forms are required to meet the letter of the auditors’ requirements.  It does not add value or additional confidence to the process of making updates in the environment.  Quite the contrary, it’s just extra work for an already over-loaded operations staff.

2. No enforcement: The IT environment has no controls in place to detect changes, much less unauthorized changes.  If the process is already perceived as just a box-checking exercise and IT workers know that no alarms will be raised if they make a change without doing the paperwork, do you think they’ll actually follow the change management process?  Visible Ops has a great story about an organization that implemented a change management process without controls.  In the second month changes were down by 50%, and another 20% in month three, yet the organization was still in chaos and fighting with constant unplanned outages.  When they finally implemented automated change controls, they discovered that the rate of changes was constant, it’s just  the rate of paperwork that was declining.

3. No accountability: What does the organization do when they detect an unauthorized change?  The typical scenario is when a very important member of the operations or development staff makes an unauthorized change that ends up causing a significant outage.  Often this is where IT management fails their “gut check”– they fear angering this critical resource and so the perpetrator ends up getting at worst a slap on the wrist.  Is it any wonder then that the rest of the organization realizes that management is not taking the change management process seriously and thus the entire process can be safely ignored without individual consequences?

I firmly believe that change management can actually help an organization get things done faster, rather than slower.  Seems counter-intuitive, right?  Let me give you some recommendations for improving your change management process and talk about why they make things better:

1. Ask the right questions: What systems, processes, and business units will be affected? During what window will the work be done? Has this change been coordinated with the affected business units and how has it been communicated? What is the detailed implementation plan for performing the change? How will the change be tested for success? What is the back-out plan in case of failure?

Asking the right questions will help the organization achieve higher rates of successful changes, which means less unplanned work.  And unplanned work is the great weight that’s crushing most low-performing IT organizations.  As my friend Jim Hickstein so eloquently put it, “Don’t do: think, type, think, type, think, type, `shit’! Do: think, think, think, type, type, type, `beer’!”  Also, coordinating work properly with other business units means less business impact and greater overall availability.

2. Learn lessons: The first part of your change management meetings should be reviewing completed changes from the previous cycle.  Pay particular attention to changes that failed or didn’t go smoothly. What happened? How can we make sure it won’t happen next time?  What worked really well?  Like most processes, change management should be subject to continuous improvement.  The only real mistake is making the same mistake twice.

Again the goal of these post-mortems should be to drive down the amount of unplanned work that results from changes in the IT environment.  But hopefully you’ll also learn to make changes better and faster, as well as stream-lining the change management process itself.

3. Keep appropriate documentation: Retain all documentation around change requests, approvals, and implementation details. The most obvious reason to do this is to satisfy your auditors.  If you do a good job organizing this information as part of your change management process, then supplying your auditors with the information they need really should be as easy as hitting a few buttons and generating a report out of your change management database.

However, where all this documentation really adds value on a day-to-day basis is when you can tie the change management documentation into your problem resolution system.  After all, when you’re dealing with an unplanned outage on a system, what’s the first question you should be asking?  “What changed?”  Well, what if your trouble tickets automatically populated themselves with the most recent set of changes associated with the system(s) that are experiencing problems?  Seems like that would reduce your problem resolution times and increase availability, right?  Well guess what?  It really does.

4. Implement automated controls and demand accountability: If you want people to follow the change management process, they have to know that unplanned changes will be detected and consequences will ensue.  As I mentioned above, management is sometimes reluctant to following through on the “consequences” part of the equation.  They feel like they’re held hostage to the brilliant IT heroes who are saving the day on a regular basis yet largely ignoring the change management process.  What management needs to realize is that it’s these same heroes who are getting them into trouble in the first place.  The heroes don’t need to be shown the door, just moved into a role– development perhaps– where they maybe don’t have access to the production systems.

Again, the result is less unplanned work and higher availability.  However, it’s also my experience that having automated change controls also teaches you a huge amount about the way your systems and the processes that run on them are functioning.  This greater visibility and understanding of your systems leads to a higher rate of successful changes.

The great thing about the steps in Visible Ops is that each step gives back more resources to the organization than it consumes.  The first step of implementing proper and useful change management processes is no exception.  You probably won’t get it completely right initially, but if you’re committed to continuous improvement and accountability, I think you’ll be amazed at the results.

When benchmarking the high-performing IT organizations identified in Visible Ops, the findings were that these organizations performed 14 times more changes with one quarter the change failure rate of low-performing organizations, and furthermore had one third the amount of unplanned work and 10x faster resolution times when problems did occur.  For the InfoSec folks in the audience, these organizations were five times less likely to experience a breach and five times more likely to detect one when it occurred.  Further these organizations spent one-third the time on audit prep compared to low-performing organizations and had one quarter the number of repeat audit findings.

If change management is the first step on the road to achieving this kind of success, why wouldn’t you sign up for it?

Hal Pomeranz, Deer Run Associates

Lately I was reading another excellent blog post from Peter Thomas where he was discussing different metaphors for IT projects.  As Peter points out, it’s traditional to schedule IT projects is if they were standard real-world construction projects like building a skyscraper.  Peter writes in his blog:

Building tends to follow a waterfall project plan (as do many IT projects). Of course there may be some iterations, even many of them, but the idea is that the project is made up of discrete base-level tasks whose duration can be estimated with a degree of accuracy. Examples of such a task might include writing a functional specification, developing a specific code module, or performing integration testing between two sub-systems. Adding up all the base-level tasks and the rates of the people involved gets you a cost estimate. Working out the dependencies between the base-level tasks gets you an overall duration estimate.

Peter goes on to have some wise thoughts about why this model may not be appropriate for specific types of IT projects, but his description above got me thinking hard about an IT project management issue that I’ve had to grapple with during my career.  The problem is that the kind of planned project work that Peter is discussing above is only one type of work that your IT staff is engaged in.  Outside of the deliverables they’re responsible for in the project schedule, your IT workers also have routine recurring maintenance tasks that they must perform (monitoring logs, shuffling backup media, etc) as well as losing time to unplanned work and outages.  To stretch our construction analogy to its limits, it’s as if you were trying to build a skyscraper with a construction crew that moonlighted as janitors in the neighboring building and were also on-call 24×7 as the local volunteer fire department.  You were expecting the cement for the foundation to get poured on Thursday, but the crew was somewhere else putting out a fire and when they got done with that they had to polish the floors next door, so now your skyscraper project plan is slipping all over the place.

I’ve developed some strategies for dealing with these kinds of issues, but I don’t feel like I’ve discovered the “silver bullet” for creating predictability in my IT project schedules.  Certainly one important factor is driving down the amount of unplanned work in your IT environment.  Constant fire fighting is a recipe for failure in any IT organization, but how to fix this problem is a topic for another day. Another important strategy is to rotate your “on-call” position through the IT group so that only a fraction of your team is engaged in fire fighting activities at any given week.  When a person is on-call, I normally mark their resources as “unavailable” on my project schedule just as if they were out of the office, and then resource leveling allows you to more accurately predict the date for deliverables that they’re responsible for.

Finally, I recognize that IT workers almost never have 100% of their time available to work on IT projects, and I set their project staffing levels accordingly.  I may only be able to schedule 70% of Charlene’s time to project Whiz-Bang, because Charlene is our Backup Diva and loses 30% of her time on average to routine backup maintenance issues and/or being called in to resolve unplanned issues with the backup system.  And notice the qualifier “on average” there– some weeks Charlene may get caught up in dealing with a critical outage with the backup system and not be able to make any progress on her Project Whiz-Bang deliverables.  When weeks like this happen, you hope that Charlene’s deliverables aren’t on the critical path and that she can make up the time later in the project schedule– or you bring in other resources to pick up the slack.

Which brings me to another important piece of strategy I’ve picked up through the years: IT project slippage is inevitable, so you want to catch it as quickly as possible.  The worst thing that can happen is that you get to the milestone date for a multi-week deliverable only to discover that work on this segment of the project hasn’t even commenced.  This means you need to break your IT projects down into small deliverables that can be tracked individually and continuously.  I’m uncomfortable unless the lowest-level detail in my project schedule has durations of a few days or less.  Otherwise your project manager is almost guaranteed to be receiving some nasty surprises much too late to fix the problem.

These are some of the strategies I’ve come up with for managing IT projects, but I still admit to some large amount of trepidation when wrangling large IT efforts.  I’m curious to hear if any of you reading this blog have useful strategies that you’ve developed for managing IT projects in your environment?  Let’s discuss them in the comments!

Hal Pomeranz, Deer Run Associates

Fifteen years ago or more I was listening to a presentation by Vint Cerf where he was advocating for the adoption of CIDR as a solution to many of the routing issues the core Internet providers were facing at the time.  In responding to his critics, he made an off-handed comment to the effect that, “People say to me, ‘Vint, you can have my IP prefix when you pry it out of my cold, dead fingers.’ To which I respond, ‘Remember you said dead.'”

Needless to say, this got a huge laugh out of the audience.  But the kernel of this little comment is a nugget of IT wisdom that applies in so many different situations.  To state it more plainly, there are times of compelling change when the most sensible course is to simply ignore the current installed base issues and just move forward.  By the time you’ve finished your new roll-out, today’s installed base will be completely subsumed into the new technology.

In Vint’s case, his position was spectacularly vindicated of course, because the number of Internet-connected hosts grew by an order of magnitude during the period when CIDR was being rolled-out.  But this kind of thinking also applies equally well on a smaller scale to operational issues faced by many IT operations.  Have new baseline images you want to roll out to your organization but are meeting resistance from your user community?  Roll the images out on newly deployed systems only and wait for attrition to take care of the existing installed base.  Given the cycle rate of technology in most organizations, that gives you a half-life of change in about 18 months.

The Blame Game

March 5, 2009

Hal Pomeranz, Deer Run Associates

“A strange game. The only winning move is not to play.”

from the movie “War Games” (1983)

One classic pathology of low-performing IT organizations is that when an outage occurs they spend an inordinate amount of time trying to figure out who’s fault it is, rather than working the problem.  Dr. Jim Metzler has even coined a new metric for this activity: Mean Time To Innocence (MTTI), defined as the average time it takes each part of the IT Operations organization to demonstrate that it’s not responsible for the outage.  Whether you call it a “Witch Hunt” or “The Blame Game” or identify it with some other term, it’s a huge waste of the time and ends up making everybody involved look like a complete ignoramus.  It’s also one of the classic signs to the rest of the business that IT Operations is completely out of touch, because otherwise they’d be trying to solve the problem rather than working so hard at finding out whose fault it is.

I’m so intolerant of this kind of activity that I will often accept the blame for things I’m not responsible for just so we can move out of the “blame” phase into the “resolution” phase.  As the CIO in Metzler’s article so eloquently put it, “I don’t care where the fault is, I just want them to fix it.”   At the end of the day, nobody will remember whose fault it is, because once the problem is addressed they’ll forget all about it in the rush of all the other things they have to do.  At most they’ll remember the “go-to guy/gal” who made the problem go away.

To illustrate, let me tell you another story from my term as Director of IT for the Internet skunkworks of a big Direct Mail Marketing company.  We were rolling out a new membership program, and as an incentive we were offering the choice of one of three worthless items of Chinese-made junk with each new membership.  I’m talking about the kind of stuff you see as freebie throw-ins on those late-night infomercials– book lights, pocket AM/FM radios, “inside the shell” egg scramblers, etc.  The way the new members got their stuff is that we passed a fulfillment code to the back-end database at corporate that triggered the warehouse mailing the right piece of junk to the new member’s address.

About a week and a half into the campaign our customer support center started getting lots of angry phone calls: “Hey! I requested the egg scrambler and got this crappy book light instead.”  This provoked a very urgent call from one of the supervisors at the call center.  I said it sounded like there was a problem somewhere in the chain from our web site into fulfillment and I’d get to the bottom of it, and in the meantime we agreed that the best policy was to tell the customer to keep the incorrectly shipped junk as our gift and we’d also send them the junk they requested.

Once we started our investigation, the problem was immediately obvious.  We had an email from the fulfillment folks with the code numbers for the various items, and those were the code numbers programmed into our application. However, when we checked the list of fulfillment codes against the back-end data dictionary, we realized that they’d transposed the numbers for the various items when they sent us the email.  Classic snafu and an honest mistake.  Once we figured out the problem, it took seconds to fix the codes and only a few minutes to run off a report listing all of the new members who were very shortly going to be receiving the wrong items.

So the question then became how to communicate the problem and the resolution to the rest of the business.  I settled for simplicity over blame:

“We have recently been made aware of a problem with the fulfillment process in the new rollout of member service XYZ, resulting in new members receiving the wrong promotional items.  This was due to fulfillment codes being incorrectly entered into the web application.  We have corrected the problem and have provided Customer Service and Fulfillment with the list of affected members so that they can ship the appropriate items immediately.”

You will note that I carefully didn’t specify whose “fault” it was that the incorrect codes were inserted into the application.  I’m sure the rest of the business assumed it was my team’s fault.  I’m sure of this because the Product Manager in charge of the campaign called me less than fifteen minutes after I sent out the email and literally screamed at me– I was holding the phone away from my ear– that he knew it wasn’t our fault (he’d seen all the email traffic during the investigation) and how could I let the rest of the company assume we were at fault?

And I told him what I’m going to tell you now: nobody else cared whose fault it was.  Fulfillment was grateful that I’d jumped on this particular hand grenade and saved them from the shrapnel.  My management was impressed that we’d resolved the problem less than two hours after the initial reports and had further produced a list of the affected members so that Customer Service could get out ahead of the problem rather than waiting for irate customers to call.  Total cost was minimal because we caught it early and addressed it promptly.

And that’s the bottom line: all that throwing blame around would have done was made people angry and lengthen our time to resolution.  Finding somebody to blame doesn’t make you feel justified or more fulfilled somehow, it just makes you tired and frustrated.  So always try to short-circuit the blame loop and move straight into making things better.

Hal Pomeranz, Deer Run Associates

At the end of our recent SANS webcast, Mike Poor closed by emphasizing how important it was for IT and Information Security groups to advertise their operational successes to the rest of the organization (and also to their own people).  Too often these functions are seen as pure cost centers, and in these difficult economic times it’s up to these organizations to demonstrate return value or face severe cutbacks.

The question is what are the right metrics to publish in order to indicate success?  All too often I see organizations publishing meaningless metrics, or even metrics that create negative cultures that damage corporate perception of the organization:

  • It seems like a lot of IT Ops groups like to publish their “look how much stuff we operate” metrics: so many thousand machines, so many petabytes of disk, terabytes of backup data per week, etc.  The biggest problem with these metrics is that they can be used to justify massive process inefficiencies.  Maybe you have thousands of machines because every IT project buys its own hardware and you’re actually wasting money and resources that could be saved by consolidating.  Besides, nobody else in the company cares how big your… er, server farm is.
  • Then there are the dreaded help desk ticket metrics: tickets closed per week, average time to close tickets, percentage of open tickets, etc.  The only thing these metrics do is incentivize your help desk to do a slapdash job and thereby annoy your customers.  There’s only one help desk metric that matters: customer satisfaction.  If you’re not doing customer satisfaction surveys on EVERY TICKET and/or you’re not getting good results then you fail.

So what are some good metrics?  Well I’m a Visible Ops kind of guy, so the metrics that matter to me are things like amount of unplanned downtime (drive to zero), number of successful changes requiring no unplanned work or firefighting (more is better), number of unplanned or unauthorized changes (drive to zero), and projects completed on time and on-budget (more is better).  Of course, if your IT organization is struggling, you might be tempted to NOT publish these metrics because they show that you’re not performing well.  In these cases, accentuate the positive by publishing your improvement numbers rather than the raw data: “This month we had 33% less unplanned downtime than last month.”  This makes your organization look proactive and creates the right cultural imperatives without airing your dirty laundry.

There are a couple of other places where I never fail to toot my own horn:

  • If my organization makes life substantially better for another part of the company then you’d better believe I’m going to advertise that fact.  For example, when my IT group put together a distributed build system that cut product compiles down from over eight hours to less than one hour, it not only went into our regular status roll-ups, but I also got the head of the Release Engineering group to give us some testimonials as well.
  • Whenever a significant new security vulnerability comes out that is not an issue for us because of our standard builds and/or operations environment, I make sure the people who provide my budget know about it.  It also helps if you can point to “horror story” articles about the amount of money other organizations have had to pay to clean up after incidents related to the vulnerability.  This is one of the few times that Information Security can demonstrate direct value to the organization, and you must never miss out on these chances.

What’s That Smell?

If communicating your successes builds a corporate perception of your organization’s value, being transparent about your failures builds trust with the rest of the business.  If you try to present a relentlessly positive marketing spin on your accomplishments your “customers” elsewhere in the company will become suspicious.  Plus you’ll never bamboozle them sufficiently with your wins that they won’t notice the elephant in the room when you fall on your face.

The important things to communicate when you fail are that you understand what lead to the failure, that you have the situational awareness to understand the impact of the failure on the business, and the steps you’re taking to make sure that the same failure never happens again (the only real organizational failure is allowing the same failure to happen twice).  Here’s a simple checklist of items you should have in your disclosure statement:

  • Analysis of the process(es) that led to the failure
  • The duration of the outage
  • How the outage was detected
  • The systems and services impacted
  • Which business units were impacted and in what way
  • Actions taken to end the outage
  • Corrective processes to make sure it never happens again

Note that in some cases it’s necessary to split the disclosure across 2-3 messages.  One is sent during the incident telling your constituents, “Yes, there’s a problem and we’re working it.”  The next is the “services restored at time X, more information forthcoming” message.  And then finally your complete post-mortem report.  Try to avoid partial or incomplete disclosure or idle speculation without all of the facts– you’ll almost always end up with egg on your face.


If you don’t communicate what’s happening in your IT and/or InfoSec organization then the the other business units are basically going to assume you’re not doing anything during the time when you’re not directly working on their requests. This leads to the perception of IT as nothing more than “revenue sucking pigs“.

However, you also have to communicate in the right way.  This means communicating worthwhile metrics and metrics which don’t create bad cultural imperatives for your organization.  And it also means being transparent and communicating your failures– in the most proactive way possible– to the rest of the organization.

Calabrese’s Razor

February 26, 2009

Hal Pomeranz, Deer Run Associates

I’ve long held the opinion that the community of “Information Security Experts” agree with each other 90% of the time, but waste 90% of their time arguing to the death with other InfoSec Experts about the remaining 10%.  This was painfully brought home to me several years ago as I was facilitating the consensus process around the Solaris Security document published by the Center for Internet Security.  You won’t believe the amount of time we spent arguing about seemingly trivial things like, “Should the system respond to echo broadcast?”  And as the consensus circle widened, we ended up wasting more time on these issues and repeating debates over and over again as new people joined the discussion.  In short, it was killing us.  People were burning out and failing to provide constructive feedback and we were failing to deliver updates in a timely fashion.

I see these kind of debates causing similar mayhem in the IT Ops and InfoSec groups at many organizations.  The problem is that in these cases the organizations are not simply debating the content of a document full of security recommendations, they’re arguing about matters of operational policy.  This seems to promote even more irrational passions, and also raises the stakes for failing to come to consensus and actually move forward.

At the low point of our crisis at the Center for Internet Security, the person who was most responsible for finding the solution was Chris Calabrese, who was facilitating the HP-UX benchmark for the Center. At roughly the same time as our issues at the Center, the IT Ops and InfoSec teams at Chris’ employer had gotten bogged down over similar kinds of issues and had decided to come up with an objective metric for deciding which information security controls were important and which ones were just not worth arguing about.  Suddenly the discussion of these issues was transformed from matters of opinion to matters of fact.  Consensus arrived quickly and nobody’s feelings got hurt.

Overview of the Metric

So we decided to adapt the metric that Chris had used to our work at the Center.  After some discussion, we decided that the metric had to account for two major factors: how important the security control was and how much negative operational impact the security control would impose.  Each of the two primary factors was made up of other components.

For example, the factors relating to the relative importance of a security control include:

  • Impact (I): Is the attack just a denial-of-service condition, or does it allow the attacker to actually gain access to the system? Does the attack allow privileged access?
  • Radius (R): Does the attack require local access or can it be conducted in an unauthenticated fashion over the network?
  • Effectiveness (E): Does the attack work against the system’s standard configuration, or is the control in question merely a backup in case of common misconfiguration, or even just a “defense in depth” measure that only comes into play after the failure of multiple controls?

Similarly, the administrative impact of a control was assessed based on two factors:

  • Administrative Impact (A): Would the change require significant changes to current administrative practice?
  • Frequency of Impact (F): How regularly would this impact be felt by the Operations teams?

The equation for deciding which controls were important simply evolved to: “(I * R * E) – (A* F)”.  In other words, multiply the terms related to the importance of the control to establish a positive value and then subtract the costs due to the administrative impact of the control.

The only thing missing was the actual numbers.  It turns out a very simple weighting scheme is sufficient:

  • Impact (I): Score 1 if attack is a denial-of-service, 2 if the attack allows unprivileged access, and 3 if the attack allows administrative access (or access to an admin-equivalent account like “oracle”, etc)
  • Radius (R): Score 1 for attacks that require physical access or post-authenticated unprivileged access, and 2 for remote attacks that can be conducted by unauthenticated users
  • Effectiveness (E): Score 1 if the control requires multiple configuration failures to be relevant, 2 if the control is a standard second-order defense for common misconfiguration, and 3 if the attack would succeed against standard configurations without the control in question
  • Administrative Impact (A): Score 1 if the administrative impact is insignificant or none, 2 if the control requires modifications to existing administrative practice, and 3 if the control would completely disable standard administrative practices in some way
  • Frequency of Impact (F): Score 1 if the administrative impact is to a non-standard process or arises less than once per month, 2 if the administrative impact is to a standard but infrequent process that occurs about once per month, and 3 if the impact is to a regular or frequent administrative practice

In the case where a single control can have different levels of impact in different scenarios, what turned out best for us (and avoided the most arguments) was to simply choose the highest justifiable value for each term, even if that value was not the most common or likely impact.

Applying the Metric

Let’s run the numbers on a couple of controls and see how this works out.  First we’ll try a “motherhood and apple pie” kind of control– disabling unencrypted administrative access like telnet:

  • Impact (I): Worst case scenario here is that an attacker hijacks an administrative session and gains control of the remote system.  So that’s administrative level access, meaing a score of 3 for this term.
  • Radius (R): Anybody on the network could potentially perform this attack, so this term is set to 2.
  • Effectiveness (E): Again you have to go with the maximal rating here, because the session hijacking threat is a standard “feature” of clear-text protocols– score 3.
  • Administrative Impact (A): Remember, we’re not discussing replacing clear-text administrative protocols with encrypted protocols at this point (justifying encrypted access is a separate conversation).  We’re discussing disabling unencrypted access, so the score here is 3 because we’re planning on completely disabling this administrative practice.
  • Frequency of Impact (F): If telnet is your regular router access scheme, then this change is going to impact you every day.  Again, the score is then 3.

So what’s the final calculation?  Easy: (3 * 2 * 3) – (3 * 3) = 9.  What’s that number mean?  Before I answer that question, let’s get another point of comparison by looking at a more controversial control.

We’ll try my own personal nemesis, the dreaded question of whether the system should respond to echo broadcast packets:

  • Impact (I): Worst case scenario here ends up being a denial of service attack (e.g. “smurf” type attack), so score 1.
  • Radius (R): Depends on whether or not your gateways are configured to pass directed broadcast traffic (hint: they shouldn’t be), but let’s assume the worst case and score this one a 2.
  • Effectiveness (E): Again, being as pessimistic as possible, let’s assume no other compensating controls in the environment and score this one a 3.
  • Administrative Impact (A): The broadcast ping supporters claim that disabling broadcast pings makes it more difficult to assess claimed IP addresses on a network and capture MAC addresses from systems (the so-called “ARP shotgun” approach).  Work-arounds are available, however, so let’s score this one a 2.
  • Frequency of Impact (F): In this case, we have what essentially becomes a site-specific answer.  But let’s assume that your network admins use broadcast pings regularly and score this one a 3.

So the final answer for disabling broadcast pings is: (1 * 2 * 3) – (2 * 3) = 0.  You could quibble about some of the terms, but I doubt you’re going to be able to make a case for this one scoring any higher than a 2 or so.

Interpreting the Scores

Once we followed this process and produced scores for all of the various controls in our document, a dominant pattern emerged.  The controls that everybody agreed with had scores of 3 or better.  The obviously ineffective controls were scoring 0 or less.  That left items with scores in the 1-2 range as being “on the bubble”, and indeed many of these items were generating our most enduring arguments.

What was also clear was that it wasn’t worth arguing about the items that only came in at 1 or 2.  Most of these ended up being “second-order” type controls for issues that could be mitigated in other ways much more effectively and with much less operational impact.  So we made an organizational decision to simply ignore any items that failed to score at least 3.

As far as arguments about the weighting of the individual terms, these tended to be few and far between.  Part of this was our adoption of a “when in doubt, use the maximum justifiable value” stance, and part of it was due to choosing a simple weighting scheme that didn’t leave much room for debate.  Also, once you start plugging the numbers in, it’s obvious that arguing over a 1 point change in a single term isn’t usually enough to counteract the other factors enough to get a given control to reach the overall qualifying score of 3.

Further Conclusions

What was also interesting about this process is that it gave us an objective measure for challenging the “conventional wisdom” about various security controls.  It’s one thing to say, “We should always do control X”, and quite another to have to plug numbers for the various terms related to “control X” into a spreadsheet.  It quickly becomes obvious when a control has minimal security impact in the real world.

This metric also channelled our discussion into much more productive and much less emotional avenues.  Even the relatively coarse granularity of our instrument was sufficient to break our “squishy” matters of personal opinion into discrete, measurable chunks.  And once you get engineers talking numbers, you know a solution is going to emerge eventually.

So when your organization finds itself in endless, time-wasting discussions regarding operational controls, try applying Chris’ little metric and see if you don’t rapidly approach something resembling clarity.  Your peers will thank you for injecting a little sanity into the proceedings.

Chris Calbrese passed away a little more than a year ago from a sudden and massive heat attack, leaving behind a wife and children.  His insight and quiet leadership are missed by all who knew him.  While Chris developed this metric in concert with his co-workers and later with the input of the participants in the Center for Internet Security’s consenus process, I have chosen to name the metric “Calabrese’s Razor” in his memory.