Lately Gene Kim, Kevin Behr, and I have been on a nearly messianic crusade against IT suckage. Much of our discussion has centered around The Visible Ops Handbook that Gene and Kevin co-authored with George Spafford. Visible Ops is an extremely useful playbook containing four steps that IT groups can follow to help them become much higher performing organizations.
However, I will admit that Visible Ops is sometimes a hard sell. That’s because the first step of Visible Ops is to create a working change management process within the IT organization– with functional controls and real consequences for people who subvert the change management process. Aside from being a difficult task in the first place, just the mere concept of change management causes many IT folks to start looking for an exit. “We hate change management!” they say. “Don’t do this to us!” What I quickly try to explain to them is that they don’t hate change management, they just hate bad change management. And, unfortunately, bad change management is all they’ve experienced to date, so they don’t know there’s a better way.
What are some of the hallmarks of bad change management processes? See if any of these sound familiar to you:
1. Just a box-checking exercise: The problem here is usually that an organization has implemented change management only because their auditors told them they needed it. As a result, the process is completely disconnected from the actual operational work of IT in the organization. It’s simply an exercise in filling out and rubber-stamping whatever ridiculous forms are required to meet the letter of the auditors’ requirements. It does not add value or additional confidence to the process of making updates in the environment. Quite the contrary, it’s just extra work for an already over-loaded operations staff.
2. No enforcement: The IT environment has no controls in place to detect changes, much less unauthorized changes. If the process is already perceived as just a box-checking exercise and IT workers know that no alarms will be raised if they make a change without doing the paperwork, do you think they’ll actually follow the change management process? Visible Ops has a great story about an organization that implemented a change management process without controls. In the second month changes were down by 50%, and another 20% in month three, yet the organization was still in chaos and fighting with constant unplanned outages. When they finally implemented automated change controls, they discovered that the rate of changes was constant, it’s just the rate of paperwork that was declining.
3. No accountability: What does the organization do when they detect an unauthorized change? The typical scenario is when a very important member of the operations or development staff makes an unauthorized change that ends up causing a significant outage. Often this is where IT management fails their “gut check”– they fear angering this critical resource and so the perpetrator ends up getting at worst a slap on the wrist. Is it any wonder then that the rest of the organization realizes that management is not taking the change management process seriously and thus the entire process can be safely ignored without individual consequences?
I firmly believe that change management can actually help an organization get things done faster, rather than slower. Seems counter-intuitive, right? Let me give you some recommendations for improving your change management process and talk about why they make things better:
1. Ask the right questions: What systems, processes, and business units will be affected? During what window will the work be done? Has this change been coordinated with the affected business units and how has it been communicated? What is the detailed implementation plan for performing the change? How will the change be tested for success? What is the back-out plan in case of failure?
Asking the right questions will help the organization achieve higher rates of successful changes, which means less unplanned work. And unplanned work is the great weight that’s crushing most low-performing IT organizations. As my friend Jim Hickstein so eloquently put it, “Don’t do: think, type, think, type, think, type, `shit’! Do: think, think, think, type, type, type, `beer’!” Also, coordinating work properly with other business units means less business impact and greater overall availability.
2. Learn lessons: The first part of your change management meetings should be reviewing completed changes from the previous cycle. Pay particular attention to changes that failed or didn’t go smoothly. What happened? How can we make sure it won’t happen next time? What worked really well? Like most processes, change management should be subject to continuous improvement. The only real mistake is making the same mistake twice.
Again the goal of these post-mortems should be to drive down the amount of unplanned work that results from changes in the IT environment. But hopefully you’ll also learn to make changes better and faster, as well as stream-lining the change management process itself.
3. Keep appropriate documentation: Retain all documentation around change requests, approvals, and implementation details. The most obvious reason to do this is to satisfy your auditors. If you do a good job organizing this information as part of your change management process, then supplying your auditors with the information they need really should be as easy as hitting a few buttons and generating a report out of your change management database.
However, where all this documentation really adds value on a day-to-day basis is when you can tie the change management documentation into your problem resolution system. After all, when you’re dealing with an unplanned outage on a system, what’s the first question you should be asking? “What changed?” Well, what if your trouble tickets automatically populated themselves with the most recent set of changes associated with the system(s) that are experiencing problems? Seems like that would reduce your problem resolution times and increase availability, right? Well guess what? It really does.
4. Implement automated controls and demand accountability: If you want people to follow the change management process, they have to know that unplanned changes will be detected and consequences will ensue. As I mentioned above, management is sometimes reluctant to following through on the “consequences” part of the equation. They feel like they’re held hostage to the brilliant IT heroes who are saving the day on a regular basis yet largely ignoring the change management process. What management needs to realize is that it’s these same heroes who are getting them into trouble in the first place. The heroes don’t need to be shown the door, just moved into a role– development perhaps– where they maybe don’t have access to the production systems.
Again, the result is less unplanned work and higher availability. However, it’s also my experience that having automated change controls also teaches you a huge amount about the way your systems and the processes that run on them are functioning. This greater visibility and understanding of your systems leads to a higher rate of successful changes.
The great thing about the steps in Visible Ops is that each step gives back more resources to the organization than it consumes. The first step of implementing proper and useful change management processes is no exception. You probably won’t get it completely right initially, but if you’re committed to continuous improvement and accountability, I think you’ll be amazed at the results.
When benchmarking the high-performing IT organizations identified in Visible Ops, the findings were that these organizations performed 14 times more changes with one quarter the change failure rate of low-performing organizations, and furthermore had one third the amount of unplanned work and 10x faster resolution times when problems did occur. For the InfoSec folks in the audience, these organizations were five times less likely to experience a breach and five times more likely to detect one when it occurred. Further these organizations spent one-third the time on audit prep compared to low-performing organizations and had one quarter the number of repeat audit findings.
If change management is the first step on the road to achieving this kind of success, why wouldn’t you sign up for it?
March 26, 2009
I’m eternally amazed at how much cheaper computers, disks, networking gear, and pretty much everything IT-related has become since I started working in this industry. In general, it’s a great thing. But my friend Bill Schell recently pointed out one of the darker aspects of this trend during a recent email exchange. Back in the mid-90’s Bill was running the Asia-Pacific network links for a large multi-national. The “hub” of the network was a large Cisco router that cost upwards of a quarter of a million dollars. As Bill pointed out, the company thought nothing of paying Bill a loaded salary of roughly half the purchase price of that router in order to keep it and the corporate WAN running smoothly.
Fifteen years later, you can get the same functionality in a device that costs an order of magnitude or two less. And guess what? Companies are expecting the costs associated with supporting these devices and the services they provide to be dropping at roughly the same rate as the cost of the equipment. This translates to loss of IT jobs, or at least their migration to other IT initiatives. It doesn’t matter that the functionality of the newer, cheaper devices is the same or perhaps even more complicated than the more expensive equipment they’re replacing. Nor does it matter that the organization is expecting the same service levels or indeed even increased support for new applications and protocols. “Do more with less” is the mantra.
This trend has all sorts of implications: hidden inefficiencies because reduced support levels impact critical business processes, significant security holes allowed to remain open due to insufficient levels of staffing and expertise, etc. But what I want to talk about today is the implications for the career path of my fellow IT workers who are reading this blog. And let me cut right to the bottom-line. If you want your IT career to be long and profitable, make sure you’re supporting technology that costs a lot of money. When you see the price of the equipment you’re managing dropping precipitously, start retraining on something new.
Let me give you an example from the early part of my career. My first job out of college was doing IT support in an environment where they were dumping their Vax systems that cost hundreds of thousands of dollars for Unix workstations that cost tens of thousands of dollars. Bye-bye Vax administrators, welcome the new, smaller coterie of workstation admins. And it’s worth noting also that the Vax admins had replaced a small army of mainframe support folks from the previous generation.
And now 20 years later, commodity hardware and virtualization are forcing my generation of system administrators to move up the food chain in search of employment. Some folks were lucky enough to keep their jobs in pursuit of server consolidation efforts, but notice that they’re now supporting orders of magnitude more systems in order to justify their salaries in the face of reduced equipment costs. Storage technology was a nice pot of money to chase for a while there, and many of my people made the transition into SAN administration and similar jobs. But again downward price pressure is being felt in this arena and the writing is on the wall– “do more with less.”
Some IT career choices seem to have historically provided safe havens. The cost of database installations seems to have held steady or even increased as organizations have wanted to harness the power of larger and larger data sets and as the number of databases in organizations has exploded. So DBA has always been a good career choice. Information Security has also been a steady career choice because its budget is typically a constant fraction of total IT spending, rather than being tied to any particular technology. Plus all of the recent regulatory requirements have ensured that Information Security’s percentage of the total IT budget has been going up, even as total IT budgets are shrinking.
So please keep these thoughts in the back of your mind as you’re plotting your next career moves in this difficult economy. I’ve seen too many good friends pushed out the door in the name of “efficiency”.
March 24, 2009
March 24 is Ada Lovelace Day. To honor one of the first female computer scientists, the blogosphere has committed to posting articles about women role models in the computer industry. This is certainly a scheme that I can get behind, and it also gives me the opportunity to talk about one of my earliest mentors.
When I graduated from college in the late 1980’s, my first job was doing Unix support at AT&T Bell Labs Holmdel. I learned a huge amount at that job, and a lot of it was due to my manager, Barbara Lee. “Tough broad” are the only words I can think of to describe Barbara, and I think she’d actually take those words as a compliment. Completely self-taught, Barbara had worked her way up from the bottom and had finally smacked into a glass ceiling after becoming manager of the Unix administrators for the Holmdel Computing Center. Barbara was also extremely active in the internal Bell Labs Computer Security Forum, and had earned her stripes tracking down and catching an attacker who had been running rampant on the Bell Labs networks many years earlier.
My vivid mental picture of Barbara is her banging away on her AT&T vt100 clone, composing some crazy complex ed or sed expression to pull off some amazing Unix kung fu, while occasionally taking drags on her cigarette (yes kids, you could still smoke in offices in those days). Unfortunately, it was those cigarettes that ultimately led to Barbara’s death.
As tough and combatative as Barbara was when dealing with most people, she also had a strong caring streak that she mostly kept hidden. Part Cherokee, Barbara arranged for much of our surplus equipment to make it to reservation schools whenever possible. As I recall, we even shipped an entire DEC Vax to a reservation while I was there. I always wondered what they did with that machine, but I’m sure it got put to good use.
And though she didn’t suffer fools gladly, Barbara occasionally took ignorant young savages like me under her wing. Seeing that I had an interest in computer security, Barbara actually took me along to some of the Bell Labs Computer Security Forum meetings and to the USENIX Security Conference. Less than I year out of college and I was getting to hang with folks like Bill Cheswick and Steve Bellovin. How cool was that? Without this early prodding from Barbara, I doubt my career would have turned out the way it did.
My favorite Barbara Lee story, however, involves an altercation I got into with the manager of another group. At Bell Labs, the Electricians’ Union handled all wiring jobs, including network wiring. I was doing a network upgrade one weekend and had arranged for the Electricians to run the cabling for me in advance of the actual cutover. Unfortunately, Friday afternoon rolled around and the wiring work hadn’t even been started.
So I called the manager for that group and asked what the status was. He told me that he was understaffed due to a couple of his people being unexpectedly out of the office and wouldn’t be able to get the work done. The conversation went down hill from there, and ended up with me getting a verbal reaming and the promise of the Union taking the matter up with Barbara first thing Monday morning.
Needless to say, I was sweating bullets all weekend. And I can remember the sinking feeling in the pit of my stomach when Barbara walked into my office Monday morning. “Hal,” she said to me, “you just can’t talk to other managers like you talk to me.” Then she turned around and walked out and never said another word to me about the incident again.
I’d have walked through fire for that woman.
March 23, 2009
Some months ago, a fellow Information Security professional posted to one of the mailing lists I monitor, looking for security arguments to refute the latest skunkworks project from her sales department. Essentially, one of the sales folks had developed a thick client application that connected to an internal customer database. The plan was to equip all of the sales agents in the field with this application and allow them to connect directly back through the corporate firewall to the production copy of the database over an unencrypted link. This seemed like a terrible idea, and the poster was looking to marshal arguments against deploying this software.
The predictable discussion ensued, with everybody on the list enumerating the many reasons why this was a bad idea from an InfoSec perspective and in some cases suggesting work-arounds to spackle over deficiencies in the design of the system. My advice was simpler– refute the design on Engineering principles rather than InfoSec grounds. Specifically:
- The system had no provision for allowing the users to work off-line or when the corporate database was unavailable.
- While the system worked fine in the corporate LAN environment, bandwidth and latency issues over the Internet would probably render the application unusable.
Sure enough, when confronted with these reasonable engineering arguments, the project was scrapped as unworkable. The Information Security group didn’t need to waste any of their precious political capital shooting down this obviously bad idea.
This episode ties into a motto I’ve developed during my career: “Never sell security as security.” In general, Information Security only gets a limited number of trump cards they can play to control the architecture and deployment of all the IT-related projects in the pipeline. So anything they can do to create IT harmony and information security without exhausting their hand is a benefit.
It’s also useful to consider my motto when trying to get funding for Information Security related projects. It’s been my experience that many companies will only invest in Information Security a limited number of times: “We spent $35K on a new firewall to keep the nasty hackers at bay and that’s all you get.” To achieve the comprehensive security architecture you need to keep your organization safe, you need to get creative about aligning security procurement with other business initiatives.
For example, file integrity assessment tools like Tripwire have an obvious forensic benefit when a security incident occurs, but the up-front cost of acquiring, deploying, and using these tools just for the occasional forensic benefit often makes them a non-starter for organizations. However, if you change the game and point out that the primary ongoing benefit of these tools is as a control on your own change management processes, then they become something that the organization is willing to pay for. You’ll notice that the nice folks at Tripwire realized this long ago and sell their software as “Configuration Control”, not “Security”.
Sometimes you can get organizational support from even further afield. I once sold an organization on using sudo with the blessings of Human Resources because it streamlined their employee termination processes: nobody knew the root passwords, so the passwords didn’t need to get changed every time somebody from IT left the company. When we ran the numbers, this turned out to be a significant cost-savings for the company.
So be creative and don’t go into every project with your Information Security blinders on. There are lots of projects in the pipeline that may be bad ideas from an Information Security perspective, but it’s likely that they have other problems as well. You can use those problems as leverage to implement architectures that are more efficient and rational from an Engineering as well as from an Information Security perspective. Similarly there are critical business processes that the Information Security group can leverage to implement necessary security controls without necessarily spending Information Security’s capital (or political) budget.
March 19, 2009
Lately I was reading another excellent blog post from Peter Thomas where he was discussing different metaphors for IT projects. As Peter points out, it’s traditional to schedule IT projects is if they were standard real-world construction projects like building a skyscraper. Peter writes in his blog:
Building tends to follow a waterfall project plan (as do many IT projects). Of course there may be some iterations, even many of them, but the idea is that the project is made up of discrete base-level tasks whose duration can be estimated with a degree of accuracy. Examples of such a task might include writing a functional specification, developing a specific code module, or performing integration testing between two sub-systems. Adding up all the base-level tasks and the rates of the people involved gets you a cost estimate. Working out the dependencies between the base-level tasks gets you an overall duration estimate.
Peter goes on to have some wise thoughts about why this model may not be appropriate for specific types of IT projects, but his description above got me thinking hard about an IT project management issue that I’ve had to grapple with during my career. The problem is that the kind of planned project work that Peter is discussing above is only one type of work that your IT staff is engaged in. Outside of the deliverables they’re responsible for in the project schedule, your IT workers also have routine recurring maintenance tasks that they must perform (monitoring logs, shuffling backup media, etc) as well as losing time to unplanned work and outages. To stretch our construction analogy to its limits, it’s as if you were trying to build a skyscraper with a construction crew that moonlighted as janitors in the neighboring building and were also on-call 24×7 as the local volunteer fire department. You were expecting the cement for the foundation to get poured on Thursday, but the crew was somewhere else putting out a fire and when they got done with that they had to polish the floors next door, so now your skyscraper project plan is slipping all over the place.
I’ve developed some strategies for dealing with these kinds of issues, but I don’t feel like I’ve discovered the “silver bullet” for creating predictability in my IT project schedules. Certainly one important factor is driving down the amount of unplanned work in your IT environment. Constant fire fighting is a recipe for failure in any IT organization, but how to fix this problem is a topic for another day. Another important strategy is to rotate your “on-call” position through the IT group so that only a fraction of your team is engaged in fire fighting activities at any given week. When a person is on-call, I normally mark their resources as “unavailable” on my project schedule just as if they were out of the office, and then resource leveling allows you to more accurately predict the date for deliverables that they’re responsible for.
Finally, I recognize that IT workers almost never have 100% of their time available to work on IT projects, and I set their project staffing levels accordingly. I may only be able to schedule 70% of Charlene’s time to project Whiz-Bang, because Charlene is our Backup Diva and loses 30% of her time on average to routine backup maintenance issues and/or being called in to resolve unplanned issues with the backup system. And notice the qualifier “on average” there– some weeks Charlene may get caught up in dealing with a critical outage with the backup system and not be able to make any progress on her Project Whiz-Bang deliverables. When weeks like this happen, you hope that Charlene’s deliverables aren’t on the critical path and that she can make up the time later in the project schedule– or you bring in other resources to pick up the slack.
Which brings me to another important piece of strategy I’ve picked up through the years: IT project slippage is inevitable, so you want to catch it as quickly as possible. The worst thing that can happen is that you get to the milestone date for a multi-week deliverable only to discover that work on this segment of the project hasn’t even commenced. This means you need to break your IT projects down into small deliverables that can be tracked individually and continuously. I’m uncomfortable unless the lowest-level detail in my project schedule has durations of a few days or less. Otherwise your project manager is almost guaranteed to be receiving some nasty surprises much too late to fix the problem.
These are some of the strategies I’ve come up with for managing IT projects, but I still admit to some large amount of trepidation when wrangling large IT efforts. I’m curious to hear if any of you reading this blog have useful strategies that you’ve developed for managing IT projects in your environment? Let’s discuss them in the comments!
March 12, 2009
Early in my career, I had the opportunity to listen to a talk by Bill Howell on “managing your manager”. I don’t recall much about the talk, but one item that stuck with me was his advice, “Never argue with your boss, because even if you ‘win’, you lose.”
At the time, I was young and cocksure and tended towards confrontation in my interactions with co-workers. If I disagreed with somebody, we each threw down our best technical arguments, wrangled over the problem, and may the biggest geek win. Being “right” was the most important thing. So Bill’s advice seemed outright wrong to me at the time. Of course one should argue with their boss! If they were “wrong”, then let’s mix it up and get to the “correct” solution.
Flash forward a few years later and I was working as a Senior Sys Admin at a company in the San Francisco Bay Area. We were trying to roll out a new architecture for supporting our developer workstations, and I was clashing with my boss over the direction we should go in. Worse still, the rest of the technical team was in favor of the architecture that I was championing. True to form, I insisted on going for the no-holds-barred public discussion. This, of course, transformed the situation from a simple technical disagreement into my completely undercutting my boss’ authority and basically engineering a mutiny in his group.
Matters came to a head at our weekly IT all-hands meeting. Because of the problems our group was having, both my boss and his boss were in attendance. Discussion of our new architecture got pretty heated, but I had an answer for every single one of my boss’ objections to my plan. In short, on a technical level at least, I utterly crushed him. In fact, in the middle of the meeting he announced, “I don’t need this s—“, and walked out of the meeting. I had “won”, and boy was I feeling good about it.
Then I looked around the table at the rest of my co-workers, all of whom were staring at me with looks of open-mouthed horror. I don’t think they could have been more shocked if I had bludgeoned my boss to death with a baseball bat. And frankly I couldn’t blame them. If I was willing to engineer a scene like had just transpired in our all-hands meeting, how could they trust me as a member of their team? I might turn on them next. Suddenly I didn’t feel so great.
I went home that night and did a great deal of soul-searching. Bill Howell’s words came back to me, and I realized that he’d been right. Admittedly, my case was an extreme situation, but if I had followed Bill’s advice from the beginning, things need never have escalated to the pitch that they finally reached. The next morning, I went in and apologized to my boss and agreed to toe the line in the future, though it certainly felt like a case of too little too late. I also started looking for a new job, because I realized nobody there really wanted to work with me after that. I was gone a month later, and my boss lasted several more years.
My situation in this case was preventable. As I look back on it now, I realize that my boss and I could have probably worked out some face-saving compromise behind closed doors before having any sort of public discussions. Of course, sometimes you find yourself in an impossible situation: whether because of incompetence, malice, or venality on the part of your management. In these cases you can sit there and take it (hoping that things will get better), fight the good fight, or “vote with your feet” and seek alternate employment. The problem is that fighting the good fight often ends with you seeking alternate employment anyway, so be sure to start putting out feelers for a new job before entering the ring. Sitting there and taking it should be avoided if at all possible– I’ve seen too many of my friends’ self-esteem totally crippled by psycho managers.
Bottom line is that one of the most important aspects of any job is making your boss look good whenever possible. This doesn’t mean you can’t disagree with your boss. Just make sure that you don’t have those disagreements publicly and make it clear at all times that you’re not attempting to pre-empt your manager’s authority. “Managing up” is a delicate skill that needs to be honed with experience, but as a first step at least try to avoid direct, public disagreements with those above you in the management chain.
And thanks for the advice, Bill. Even if I didn’t listen to you the first time.
March 9, 2009
Fifteen years ago or more I was listening to a presentation by Vint Cerf where he was advocating for the adoption of CIDR as a solution to many of the routing issues the core Internet providers were facing at the time. In responding to his critics, he made an off-handed comment to the effect that, “People say to me, ‘Vint, you can have my IP prefix when you pry it out of my cold, dead fingers.’ To which I respond, ‘Remember you said dead.'”
Needless to say, this got a huge laugh out of the audience. But the kernel of this little comment is a nugget of IT wisdom that applies in so many different situations. To state it more plainly, there are times of compelling change when the most sensible course is to simply ignore the current installed base issues and just move forward. By the time you’ve finished your new roll-out, today’s installed base will be completely subsumed into the new technology.
In Vint’s case, his position was spectacularly vindicated of course, because the number of Internet-connected hosts grew by an order of magnitude during the period when CIDR was being rolled-out. But this kind of thinking also applies equally well on a smaller scale to operational issues faced by many IT operations. Have new baseline images you want to roll out to your organization but are meeting resistance from your user community? Roll the images out on newly deployed systems only and wait for attrition to take care of the existing installed base. Given the cycle rate of technology in most organizations, that gives you a half-life of change in about 18 months.
March 5, 2009
“A strange game. The only winning move is not to play.”
from the movie “War Games” (1983)
One classic pathology of low-performing IT organizations is that when an outage occurs they spend an inordinate amount of time trying to figure out who’s fault it is, rather than working the problem. Dr. Jim Metzler has even coined a new metric for this activity: Mean Time To Innocence (MTTI), defined as the average time it takes each part of the IT Operations organization to demonstrate that it’s not responsible for the outage. Whether you call it a “Witch Hunt” or “The Blame Game” or identify it with some other term, it’s a huge waste of the time and ends up making everybody involved look like a complete ignoramus. It’s also one of the classic signs to the rest of the business that IT Operations is completely out of touch, because otherwise they’d be trying to solve the problem rather than working so hard at finding out whose fault it is.
I’m so intolerant of this kind of activity that I will often accept the blame for things I’m not responsible for just so we can move out of the “blame” phase into the “resolution” phase. As the CIO in Metzler’s article so eloquently put it, “I don’t care where the fault is, I just want them to fix it.” At the end of the day, nobody will remember whose fault it is, because once the problem is addressed they’ll forget all about it in the rush of all the other things they have to do. At most they’ll remember the “go-to guy/gal” who made the problem go away.
To illustrate, let me tell you another story from my term as Director of IT for the Internet skunkworks of a big Direct Mail Marketing company. We were rolling out a new membership program, and as an incentive we were offering the choice of one of three worthless items of Chinese-made junk with each new membership. I’m talking about the kind of stuff you see as freebie throw-ins on those late-night infomercials– book lights, pocket AM/FM radios, “inside the shell” egg scramblers, etc. The way the new members got their stuff is that we passed a fulfillment code to the back-end database at corporate that triggered the warehouse mailing the right piece of junk to the new member’s address.
About a week and a half into the campaign our customer support center started getting lots of angry phone calls: “Hey! I requested the egg scrambler and got this crappy book light instead.” This provoked a very urgent call from one of the supervisors at the call center. I said it sounded like there was a problem somewhere in the chain from our web site into fulfillment and I’d get to the bottom of it, and in the meantime we agreed that the best policy was to tell the customer to keep the incorrectly shipped junk as our gift and we’d also send them the junk they requested.
Once we started our investigation, the problem was immediately obvious. We had an email from the fulfillment folks with the code numbers for the various items, and those were the code numbers programmed into our application. However, when we checked the list of fulfillment codes against the back-end data dictionary, we realized that they’d transposed the numbers for the various items when they sent us the email. Classic snafu and an honest mistake. Once we figured out the problem, it took seconds to fix the codes and only a few minutes to run off a report listing all of the new members who were very shortly going to be receiving the wrong items.
So the question then became how to communicate the problem and the resolution to the rest of the business. I settled for simplicity over blame:
“We have recently been made aware of a problem with the fulfillment process in the new rollout of member service XYZ, resulting in new members receiving the wrong promotional items. This was due to fulfillment codes being incorrectly entered into the web application. We have corrected the problem and have provided Customer Service and Fulfillment with the list of affected members so that they can ship the appropriate items immediately.”
You will note that I carefully didn’t specify whose “fault” it was that the incorrect codes were inserted into the application. I’m sure the rest of the business assumed it was my team’s fault. I’m sure of this because the Product Manager in charge of the campaign called me less than fifteen minutes after I sent out the email and literally screamed at me– I was holding the phone away from my ear– that he knew it wasn’t our fault (he’d seen all the email traffic during the investigation) and how could I let the rest of the company assume we were at fault?
And I told him what I’m going to tell you now: nobody else cared whose fault it was. Fulfillment was grateful that I’d jumped on this particular hand grenade and saved them from the shrapnel. My management was impressed that we’d resolved the problem less than two hours after the initial reports and had further produced a list of the affected members so that Customer Service could get out ahead of the problem rather than waiting for irate customers to call. Total cost was minimal because we caught it early and addressed it promptly.
And that’s the bottom line: all that throwing blame around would have done was made people angry and lengthen our time to resolution. Finding somebody to blame doesn’t make you feel justified or more fulfilled somehow, it just makes you tired and frustrated. So always try to short-circuit the blame loop and move straight into making things better.
March 2, 2009
At the end of our recent SANS webcast, Mike Poor closed by emphasizing how important it was for IT and Information Security groups to advertise their operational successes to the rest of the organization (and also to their own people). Too often these functions are seen as pure cost centers, and in these difficult economic times it’s up to these organizations to demonstrate return value or face severe cutbacks.
The question is what are the right metrics to publish in order to indicate success? All too often I see organizations publishing meaningless metrics, or even metrics that create negative cultures that damage corporate perception of the organization:
- It seems like a lot of IT Ops groups like to publish their “look how much stuff we operate” metrics: so many thousand machines, so many petabytes of disk, terabytes of backup data per week, etc. The biggest problem with these metrics is that they can be used to justify massive process inefficiencies. Maybe you have thousands of machines because every IT project buys its own hardware and you’re actually wasting money and resources that could be saved by consolidating. Besides, nobody else in the company cares how big your… er, server farm is.
- Then there are the dreaded help desk ticket metrics: tickets closed per week, average time to close tickets, percentage of open tickets, etc. The only thing these metrics do is incentivize your help desk to do a slapdash job and thereby annoy your customers. There’s only one help desk metric that matters: customer satisfaction. If you’re not doing customer satisfaction surveys on EVERY TICKET and/or you’re not getting good results then you fail.
So what are some good metrics? Well I’m a Visible Ops kind of guy, so the metrics that matter to me are things like amount of unplanned downtime (drive to zero), number of successful changes requiring no unplanned work or firefighting (more is better), number of unplanned or unauthorized changes (drive to zero), and projects completed on time and on-budget (more is better). Of course, if your IT organization is struggling, you might be tempted to NOT publish these metrics because they show that you’re not performing well. In these cases, accentuate the positive by publishing your improvement numbers rather than the raw data: “This month we had 33% less unplanned downtime than last month.” This makes your organization look proactive and creates the right cultural imperatives without airing your dirty laundry.
There are a couple of other places where I never fail to toot my own horn:
- If my organization makes life substantially better for another part of the company then you’d better believe I’m going to advertise that fact. For example, when my IT group put together a distributed build system that cut product compiles down from over eight hours to less than one hour, it not only went into our regular status roll-ups, but I also got the head of the Release Engineering group to give us some testimonials as well.
- Whenever a significant new security vulnerability comes out that is not an issue for us because of our standard builds and/or operations environment, I make sure the people who provide my budget know about it. It also helps if you can point to “horror story” articles about the amount of money other organizations have had to pay to clean up after incidents related to the vulnerability. This is one of the few times that Information Security can demonstrate direct value to the organization, and you must never miss out on these chances.
What’s That Smell?
If communicating your successes builds a corporate perception of your organization’s value, being transparent about your failures builds trust with the rest of the business. If you try to present a relentlessly positive marketing spin on your accomplishments your “customers” elsewhere in the company will become suspicious. Plus you’ll never bamboozle them sufficiently with your wins that they won’t notice the elephant in the room when you fall on your face.
The important things to communicate when you fail are that you understand what lead to the failure, that you have the situational awareness to understand the impact of the failure on the business, and the steps you’re taking to make sure that the same failure never happens again (the only real organizational failure is allowing the same failure to happen twice). Here’s a simple checklist of items you should have in your disclosure statement:
- Analysis of the process(es) that led to the failure
- The duration of the outage
- How the outage was detected
- The systems and services impacted
- Which business units were impacted and in what way
- Actions taken to end the outage
- Corrective processes to make sure it never happens again
Note that in some cases it’s necessary to split the disclosure across 2-3 messages. One is sent during the incident telling your constituents, “Yes, there’s a problem and we’re working it.” The next is the “services restored at time X, more information forthcoming” message. And then finally your complete post-mortem report. Try to avoid partial or incomplete disclosure or idle speculation without all of the facts– you’ll almost always end up with egg on your face.
If you don’t communicate what’s happening in your IT and/or InfoSec organization then the the other business units are basically going to assume you’re not doing anything during the time when you’re not directly working on their requests. This leads to the perception of IT as nothing more than “revenue sucking pigs“.
However, you also have to communicate in the right way. This means communicating worthwhile metrics and metrics which don’t create bad cultural imperatives for your organization. And it also means being transparent and communicating your failures– in the most proactive way possible– to the rest of the organization.