Skunkworks in the Clouds

Skunks... clouds... get it?I was recently asked to make a guest appearance on a podcast related to information security in “the cloud”.  One of the participants brought up an interesting anecdote from one of his clients.  Apparently the IT group at this company had been approached by a member of their marketing team who was looking for some compute resources to tackle a big data crunching exercise.  The IT group responded that they were already overloaded and it would be months before they could get around to providing the necessary infrastructure.  Rebuffed but undeterred, the marketing person used their credit card to purchase sufficient resources from Amazon’s EC2 to process the data set and got the work done literally overnight for a capital cost of approximately $1800.

There ensued the predictable horrified gasping from us InfoSec types on the podcast.  Nothing is more terrifying than skunkworks IT, especially on infrastructure not under our direct control.  “Didn’t they realize how insecure it was to do that?” “What will happen when all of our users realize how easily and conveniently they can do this?” “How can an organization control this type of risky behavior?” We went to bed immersed in our own paranoid but comfortable world-view.

Since then, however, I’ve had the chance to talk with other people about this situation.  In particular, my friend John Sechrest delivered an intellectual “boot to the head” that’s caused me to consider the situation in a new light.  Apparently getting the data processed in a timely fashion was so critical to the marketing department that they figured out their own self-service plan for obtaining the IT resources they needed. If the project was that critical, John asked, was it reasonable from a business perspective for the IT group to effectively refuse to help their marketing department crunch this data?

Maybe the IT group really was overloaded– most of them are these days.  However, the business of the company still needs to move forward, and the clever problem-solving monkeys in various parts of the organization will figure out ways to get their jobs done even without IT support. “Didn’t they realize how insecure it was to do that?”  No, and they didn’t care.  They needed to accomplish a goal, and they did.

“What will happen when all of our users realize how easily and conveniently they can do this?” My guess is they’re going to start doing it a lot more.  Maybe that’s a good thing.  If the IT group is really overloaded, then perhaps it should think about actually empowering their users to do these kind of “one off” or prototype projects on their own without draining the resources of the core IT group.  Remember that if you let a thousand IT projects bloom, 999 of them are going to wither and die shortly thereafter.  Perhaps IT doesn’t need to waste time managing the death of the 999.

“How can an organization control this type of risky behavior?” You probably can’t.  So perhaps your IT group should provide a secure offering that’s so compelling that your users will want to use your version rather than the commodity offerings that are so readily available.  This solution will have to be tailored to each company, but I think it starts with things like:

  • Pre-configured images with known baseline configurations and relevant tools so that groups can get up an running quickly without having to build and upload their own images.
  • Easy toolkits for migrating data and out of these images in a secure fashion, with some sort of DLP solution baked in.
  • Secure back-end storage to protect the data at rest in these images with no extra work on the part of the users.
  • Integration with the organization’s existing identity management and/or AAA framework so that users don’t have to re-implement their own solutions.
  • Integration with the organization’s auditing and logging infrastructures so you know what’s going on.

Putting together the kind of framework described above is a major IT project, and will require input and participation from your user community.  But once accomplished, it could provide massive leverage to overtaxed IT organizations.  Rather than IT having to engineer everything themselves, they provide secure self-service building blocks to their customers and let them have at it.

Providing architecture support and guidance in the early stages of each project is probably prudent.  After all, the one hardy little flower that blooms and refuses to die may become a critical resource to the organization that may eventually need to be moved back “in house”.  While the fact that the building blocks that were used to create the service are already well-integrated with the organization’s centralized IT infrastructure will help, having a reasonable architectural design from the start will also be a huge help when it comes time to migrate and continue scaling the service.

Am I advocating skunkworks IT?  No, I like to think I’m advocating self-service IT on a grand scale.  You’ll see what skunkworks IT looks like if you ignore this issue and just let your users develop their own solutions because you’re too busy to help them.

Advertisements

It Doesn’t Take That Much Longer To Do It Right

 

awful-wiring-2-smallerOne of my personal and professional mantras is, “It doesn’t take that much longer to do it right.”  Sure, you can always do a half-assed job on some project just to shove it out the door, but there are inevitably down-stream costs.  Whether it’s time lost having to go back and fix your broken junk or customer frustration and negative perception, your short-cut decision usually ends up costing you more in the long run.

Of course you know I have a story to illustrate this principle.  During the peak of the dot-com boom I was helping a small start-up company move out of their sub-lease arrangement into their new permanent home.  The weekend of the move, we had contracted with a moving company to do the physical move of all the equipment and personal items and I was helping the IT team for the company do the setup, server room build out, telephony configuration, etc.  The building we were moving into was a two-story affair, and the plan for the equipment that was moving onto the second floor was to arrange it on pallets, wrap the pallet securely, and forklift the pallets through a second-story window.  Pretty standard practice, actually.

The moving company, however, had negotiated a fixed-price bid.  So it was in their best interests to get the work done as quickly as possible.  Without our knowledge, they decided to throw caution to the wind and not wrap the palleted equipment.  Sure enough, the first pallet went up on the forklift and as the pallet was moving forward through the window, one of the computers toppled off the edge and fell a full story onto concrete.  The punchline is that the computer belonged to the office manager for the company and was the computer that was going to cut the check to the moving company.  We were actually able to recover the hard drive from the twisted chassis and boot it in another PC, but the moving company had to pay a significant penalty that more than erased any savings that they might have achieved from not wrapping the pallets.  Also, after that incident, we made them stop and wrap all the pallets anyway, which cost them even more time.

Once the equipment was loaded in, the IT team and I could get started with our part of the project.  And it was a big project:  from our Friday evening start to late Sunday night, I think we got maybe 12 hours of downtime to sleep.  But there we were Sunday night and everything appeared to be ready for business to resume Monday morning.

Exhausted, but feeling pretty good about ourselves, we were standing admiring the new server room and somebody pointed out that we forgot to put the covers back on the cable raceways.  There was a collective groan.  We were “done”– surely replacing the covers could wait until the following week when we were all less tired?  But when we all looked each other in the eye, we knew that the upcoming week was going to be so chaotic that if we put off doing it now we’d never get around to it in a timely fashion.  So, without a word being spoken, we all grabbed some covers and spent another half hour or so fitting them into place.

When the execs toured the facility the following morning, we actually did get some compliments about how “neat” and “professional” the server room looked, but I don’t think that’s really why a crew of exhausted geeks spent an extra half hour re-assembling cable raceways.  Nor was it because we’re anal retentive or budding masochists.  I think it was because (a) we wanted to establish a culture of “doing things right” in the new server room so that entropy wouldn’t set in so quickly, and (b) we knew that going back and fixing the problem later would take longer than doing it right then.

So when you find yourself looking for excuses not to “finish” a project and just “get ‘er done”, take a moment and think about whether expediency is really the best policy.  Remember that bugs are always cheaper to fix in development than after the product ships.  It doesn’t take that much longer to do it right.

Avoiding Avoidance

At a recent tech event, I ended up having a conversation with Wendy Kincade and Colleen Dick about how we sometimes let critical projects slide. Wendy said a very wise thing, which is that when people are avoiding a project, it’s most often because they don’t think they’ll reach a successful outcome.  In some sense it becomes a vicious cycle: we know we really should be making progress on that big project, but yet we somehow always find other things to be working on that allow us to avoid it.  Of course, the longer we avoid it, the less time we have to complete the project and it only becomes bigger and scarier as a result of the shrinking time window.

Full speed into the unknown

I’ve certainly come to recognize this behavior in myself.  It’s much more comfortable to spend your days doing small, tactical tasks that have short completion times and satisfying outcomes.  It’s a huge leap of faith to set out to tackle and enormous project when you’re not sure if you have all the skills necessary to accomplish the task, where the end of the project is not clearly in sight, and where the cost of failure may be high.  It gives me an uncomfortable feeling in the pit of my stomach, not unlike the feeling you get when contemplating stepping off from a great height.

When I recognize this feeling in myself, I immediately take steps to start tackling the project, because I know from previous experience that if I let it linger the situation is only going to get worse.  Here are some tactics that I’ve developed for getting over the hump:

1. Break it up: Vast monolithic projects are daunting, so break the project up into a set of deliverables, milestones, and dependencies.  Then outline the steps necessary to reach each component of the project.  You don’t have to create a formal project plan– in fact, I’ve seen people spend all their time grooming a plan in MS Project, just to avoid getting started on the actual work.  A simple outline format is fine.

2. Pick an easy one: Once you’ve got a notion of the individual tasks you need to accomplish to finish your project, pick one of the tasks that you think you can complete quickly and get it done.  During our conversation, Colleen commented, “I know that if I can just knock one thing down, that gives me energy to push further into the project.”

3. Make it fun: What motivates you?  I really enjoy figuring out and mastering new technology.  So if there’s a component of the project that requires me to do a bunch of research to figure something out, I’ll tend to do that first plus give myself leeway to spend extra time really getting mastery of that subject.  While I need to be careful to prevent turning the research into an avoidance exercise in and of itself, I also know that any mastery I acquire will be useful at some point in the future, even if it’s not directly relevant to the project at hand.  Some people reward themselves after completing a particular part of the project– take a break to play your favorite video game, hang out with friends, go for a hike, whatever.

4. Consider past success: Reflect on the fact that you’ve accomplished difficult tasks in the past.  Remember the satisfaction you felt when you finally shipped those projects.  Use these feelings to reinforce your belief that you’ll be successful at the project you’re currently embarking on.

While I’ve come to know my own avoidance behaviors and learned to take steps to work around them, I don’t think I’ll ever be entirely free of them.  I think it’s just a natural human risk aversion response.  However, I also recognize that one has to take risks to accomplish great things.  I have a quote from the explorer Magellan on the wall in my office and I look at it often:

Unlike the mediocre, the intrepid spirits seek victory over those things that seem impossible… They embark on the most daring of all endeavors… to meet the shadowy future without fear and conquer the unknown.

Change Management: It’s Not Just for Big Companies

In a comment to Monday’s post on Change Management, John Moore wrote:

I would argue, however, that this level of change management is only appropriate once you reach a certain size company. If the company is more than 100 people, you need to have these policies in place and you must have enforcement, or the cost for running the IT team in a manner that benefits the company is impossible.

In startups, where I have spent much of the last decade, the change management systems you have defined above would be overly prohibitive and remove the flexibility that is critical for success.

John’s not alone in expressing this view.  I’ve heard similar sorts of comments from companies of all different sizes– some of which were substantially larger than John’s suggested 100 person threshold.  But I think that change management is important regardless of what size you’re at, and it doesn’t have to remove any “flexibility” or “agility” from the organization.  Quite the contrary, appropriate change management should enable the organization to move more rapidly because it reduces failed changes an unplanned work that suck resources that could otherwise be more productively channeled.

The key word there is “appropriate”.  Of course the change management process in a 3-10 person start-up looks completely different from the process in a company with hundreds of employees.  In an early-stage start-up you’ve typically got a team of people working very closely together with laser focus on a single line of business.  You don’t tend to have the kind of “process flow control” issues that larger companies do, where you need change review meetings to balance competing priorities and competing schedule issues.

But even three-person start ups need to make production changes thoughtfully and with rigor.  It’s easy to think “we know what we’re doing” and get yourself into a lot of trouble and cause a significant outage.  It doesn’t take that much longer to sit down and write a detailed implementation plan, have one of your co-workers review it, and then execute it (Hickstein’s, “Think, think, think, type, type, type, `beer’!”).  And the bonus is that history of implementation plans helps you when you need to grow your infrastructure, because now you have the documented list of configuration changes necessary to produce replicas of your existing systems.

Do I think a three-person start-up needs formal change control meetings?  Heck no!  If you have regular Engineering meetings, set aside a little time to mention scheduled production updates (if any) and solicit feedback.  If you don’t have regular meetings, set up an email alias where notices of production changes can be posted.  That way, at least everybody will be aware of the current state of affairs on the production systems (or can refer back to the archives as appropriate), which is critical information for them to know as they’re developing code for those platforms.

I would, however, recommend that you implement some sort of configuration control process on your production systems.  It could be as simple as implementing an Open Source utility like AIDE or Samhain, just to keep an eye on what’s happening on the system.  Aside from alerting you to cockpit error on the part of your own people, these kinds of tools can also alert you to more nefarious activity and are part of a good baseline security posture.

At some point in the growth cycle of the company, you’re going to start getting feedback from developers that they “don’t care” about the production update notices.  Congratulations! You’ve just reached a major milestone in your company’s maturation process– the beginning of separation of duties.  This is probably also around the time you’ll be hiring your first full-time IT person, so start soliciting resumes.

Your change management processes will also start adjusting to your new realities.  Your new IT person is going to become the keeper of the implementation plans and other change documentation.  They’ll probably also start pushing you for more formal outage windows, just so they can have some predictability in the environment.  And they’re also going to start pushing back on the developers to keep them from making direct changes on the production systems.  Let these things happen.

The next thing you know, you’re going to look up and realize you’ve got several IT folks and they’ve got their own manager.  Furthermore, you’ve got several products now being developed concurrently.  This is the stage where John suggests that your company needs to start embracing a formal change management process like the one described in Visible Ops, and I agree.  Hopefully you figure out you’ve reached this stage before you have a production outage caused by multiple, badly coordinated updates.

Just like the wrong time to fix bugs in your product is after the product has shipped, it’s wrong to try and build a culture of change management from scratch in an established company.  It is very hard to change a “cowboy culture” once it’s been allow to establish itself.  Visible Ops has a quote from Dr. Bob Doppelt, who was actually speaking of public health matters when he uttered it, but it is nonetheless appropriate: “The righter we do the wrong things, the wronger we become.” The problem is that inattention to change management can appear to work for a period of time– mostly because nobody’s bothering to track the amount of time lost to firefighting and unplanned work.  But suddenly an organization wakes up and realizes that they’ve become utterly crushed by the tyranny of unplanned work.  Digging out of this hole is painful.

So resist the notion that change management is “only for big companies”.  Don’t you hope to be a big company some day? Well you’re not going to receive an angelic visitation complete with fully-functioning change management process on the magic day you somehow cross the “big company” threshold.  Better instead to be a small company that believes strongly in change management and grows naturally into a formal change management process.

You Don’t Hate Change Management (You Hate Bad Change Management)

Lately Gene Kim, Kevin Behr, and I have been on a nearly messianic crusade against IT suckage.  Much of our discussion has centered around The Visible Ops Handbook that Gene and Kevin co-authored with George Spafford. Visible Ops is an extremely useful playbook containing four steps that IT groups can follow to help them become much higher performing organizations.

However, I will admit that Visible Ops is sometimes a hard sell.  That’s because the first step of Visible Ops is to create a working change management process within the IT organization– with functional controls and real consequences for people who subvert the change management process.  Aside from being a difficult task in the first place, just the mere concept of change management causes many IT folks to start looking for an exit.  “We hate change management!” they say.  “Don’t do this to us!”  What I quickly try to explain to them is that they don’t hate change management, they just hate bad change management.  And, unfortunately, bad change management is all they’ve experienced to date, so they don’t know there’s a better way.

What are some of the hallmarks of bad change management processes?  See if any of these sound familiar to you:

1. Just a box-checking exercise: The problem here is usually that an organization has implemented change management only because their auditors told them they needed it.  As a result, the process is completely disconnected from the actual operational work of IT in the organization.  It’s simply an exercise in filling out and rubber-stamping whatever ridiculous forms are required to meet the letter of the auditors’ requirements.  It does not add value or additional confidence to the process of making updates in the environment.  Quite the contrary, it’s just extra work for an already over-loaded operations staff.

2. No enforcement: The IT environment has no controls in place to detect changes, much less unauthorized changes.  If the process is already perceived as just a box-checking exercise and IT workers know that no alarms will be raised if they make a change without doing the paperwork, do you think they’ll actually follow the change management process?  Visible Ops has a great story about an organization that implemented a change management process without controls.  In the second month changes were down by 50%, and another 20% in month three, yet the organization was still in chaos and fighting with constant unplanned outages.  When they finally implemented automated change controls, they discovered that the rate of changes was constant, it’s just  the rate of paperwork that was declining.

3. No accountability: What does the organization do when they detect an unauthorized change?  The typical scenario is when a very important member of the operations or development staff makes an unauthorized change that ends up causing a significant outage.  Often this is where IT management fails their “gut check”– they fear angering this critical resource and so the perpetrator ends up getting at worst a slap on the wrist.  Is it any wonder then that the rest of the organization realizes that management is not taking the change management process seriously and thus the entire process can be safely ignored without individual consequences?

I firmly believe that change management can actually help an organization get things done faster, rather than slower.  Seems counter-intuitive, right?  Let me give you some recommendations for improving your change management process and talk about why they make things better:

1. Ask the right questions: What systems, processes, and business units will be affected? During what window will the work be done? Has this change been coordinated with the affected business units and how has it been communicated? What is the detailed implementation plan for performing the change? How will the change be tested for success? What is the back-out plan in case of failure?

Asking the right questions will help the organization achieve higher rates of successful changes, which means less unplanned work.  And unplanned work is the great weight that’s crushing most low-performing IT organizations.  As my friend Jim Hickstein so eloquently put it, “Don’t do: think, type, think, type, think, type, `shit’! Do: think, think, think, type, type, type, `beer’!”  Also, coordinating work properly with other business units means less business impact and greater overall availability.

2. Learn lessons: The first part of your change management meetings should be reviewing completed changes from the previous cycle.  Pay particular attention to changes that failed or didn’t go smoothly. What happened? How can we make sure it won’t happen next time?  What worked really well?  Like most processes, change management should be subject to continuous improvement.  The only real mistake is making the same mistake twice.

Again the goal of these post-mortems should be to drive down the amount of unplanned work that results from changes in the IT environment.  But hopefully you’ll also learn to make changes better and faster, as well as stream-lining the change management process itself.

3. Keep appropriate documentation: Retain all documentation around change requests, approvals, and implementation details. The most obvious reason to do this is to satisfy your auditors.  If you do a good job organizing this information as part of your change management process, then supplying your auditors with the information they need really should be as easy as hitting a few buttons and generating a report out of your change management database.

However, where all this documentation really adds value on a day-to-day basis is when you can tie the change management documentation into your problem resolution system.  After all, when you’re dealing with an unplanned outage on a system, what’s the first question you should be asking?  “What changed?”  Well, what if your trouble tickets automatically populated themselves with the most recent set of changes associated with the system(s) that are experiencing problems?  Seems like that would reduce your problem resolution times and increase availability, right?  Well guess what?  It really does.

4. Implement automated controls and demand accountability: If you want people to follow the change management process, they have to know that unplanned changes will be detected and consequences will ensue.  As I mentioned above, management is sometimes reluctant to following through on the “consequences” part of the equation.  They feel like they’re held hostage to the brilliant IT heroes who are saving the day on a regular basis yet largely ignoring the change management process.  What management needs to realize is that it’s these same heroes who are getting them into trouble in the first place.  The heroes don’t need to be shown the door, just moved into a role– development perhaps– where they maybe don’t have access to the production systems.

Again, the result is less unplanned work and higher availability.  However, it’s also my experience that having automated change controls also teaches you a huge amount about the way your systems and the processes that run on them are functioning.  This greater visibility and understanding of your systems leads to a higher rate of successful changes.

The great thing about the steps in Visible Ops is that each step gives back more resources to the organization than it consumes.  The first step of implementing proper and useful change management processes is no exception.  You probably won’t get it completely right initially, but if you’re committed to continuous improvement and accountability, I think you’ll be amazed at the results.

When benchmarking the high-performing IT organizations identified in Visible Ops, the findings were that these organizations performed 14 times more changes with one quarter the change failure rate of low-performing organizations, and furthermore had one third the amount of unplanned work and 10x faster resolution times when problems did occur.  For the InfoSec folks in the audience, these organizations were five times less likely to experience a breach and five times more likely to detect one when it occurred.  Further these organizations spent one-third the time on audit prep compared to low-performing organizations and had one quarter the number of repeat audit findings.

If change management is the first step on the road to achieving this kind of success, why wouldn’t you sign up for it?

Pondering IT Project Management Issues

Lately I was reading another excellent blog post from Peter Thomas where he was discussing different metaphors for IT projects.  As Peter points out, it’s traditional to schedule IT projects is if they were standard real-world construction projects like building a skyscraper.  Peter writes in his blog:

Building tends to follow a waterfall project plan (as do many IT projects). Of course there may be some iterations, even many of them, but the idea is that the project is made up of discrete base-level tasks whose duration can be estimated with a degree of accuracy. Examples of such a task might include writing a functional specification, developing a specific code module, or performing integration testing between two sub-systems. Adding up all the base-level tasks and the rates of the people involved gets you a cost estimate. Working out the dependencies between the base-level tasks gets you an overall duration estimate.

Peter goes on to have some wise thoughts about why this model may not be appropriate for specific types of IT projects, but his description above got me thinking hard about an IT project management issue that I’ve had to grapple with during my career.  The problem is that the kind of planned project work that Peter is discussing above is only one type of work that your IT staff is engaged in.  Outside of the deliverables they’re responsible for in the project schedule, your IT workers also have routine recurring maintenance tasks that they must perform (monitoring logs, shuffling backup media, etc) as well as losing time to unplanned work and outages.  To stretch our construction analogy to its limits, it’s as if you were trying to build a skyscraper with a construction crew that moonlighted as janitors in the neighboring building and were also on-call 24×7 as the local volunteer fire department.  You were expecting the cement for the foundation to get poured on Thursday, but the crew was somewhere else putting out a fire and when they got done with that they had to polish the floors next door, so now your skyscraper project plan is slipping all over the place.

I’ve developed some strategies for dealing with these kinds of issues, but I don’t feel like I’ve discovered the “silver bullet” for creating predictability in my IT project schedules.  Certainly one important factor is driving down the amount of unplanned work in your IT environment.  Constant fire fighting is a recipe for failure in any IT organization, but how to fix this problem is a topic for another day. Another important strategy is to rotate your “on-call” position through the IT group so that only a fraction of your team is engaged in fire fighting activities at any given week.  When a person is on-call, I normally mark their resources as “unavailable” on my project schedule just as if they were out of the office, and then resource leveling allows you to more accurately predict the date for deliverables that they’re responsible for.

Finally, I recognize that IT workers almost never have 100% of their time available to work on IT projects, and I set their project staffing levels accordingly.  I may only be able to schedule 70% of Charlene’s time to project Whiz-Bang, because Charlene is our Backup Diva and loses 30% of her time on average to routine backup maintenance issues and/or being called in to resolve unplanned issues with the backup system.  And notice the qualifier “on average” there– some weeks Charlene may get caught up in dealing with a critical outage with the backup system and not be able to make any progress on her Project Whiz-Bang deliverables.  When weeks like this happen, you hope that Charlene’s deliverables aren’t on the critical path and that she can make up the time later in the project schedule– or you bring in other resources to pick up the slack.

Which brings me to another important piece of strategy I’ve picked up through the years: IT project slippage is inevitable, so you want to catch it as quickly as possible.  The worst thing that can happen is that you get to the milestone date for a multi-week deliverable only to discover that work on this segment of the project hasn’t even commenced.  This means you need to break your IT projects down into small deliverables that can be tracked individually and continuously.  I’m uncomfortable unless the lowest-level detail in my project schedule has durations of a few days or less.  Otherwise your project manager is almost guaranteed to be receiving some nasty surprises much too late to fix the problem.

These are some of the strategies I’ve come up with for managing IT projects, but I still admit to some large amount of trepidation when wrangling large IT efforts.  I’m curious to hear if any of you reading this blog have useful strategies that you’ve developed for managing IT projects in your environment?  Let’s discuss them in the comments!

‘Remember You Said Dead’

Fifteen years ago or more I was listening to a presentation by Vint Cerf where he was advocating for the adoption of CIDR as a solution to many of the routing issues the core Internet providers were facing at the time.  In responding to his critics, he made an off-handed comment to the effect that, “People say to me, ‘Vint, you can have my IP prefix when you pry it out of my cold, dead fingers.’ To which I respond, ‘Remember you said dead.'”

Needless to say, this got a huge laugh out of the audience.  But the kernel of this little comment is a nugget of IT wisdom that applies in so many different situations.  To state it more plainly, there are times of compelling change when the most sensible course is to simply ignore the current installed base issues and just move forward.  By the time you’ve finished your new roll-out, today’s installed base will be completely subsumed into the new technology.

In Vint’s case, his position was spectacularly vindicated of course, because the number of Internet-connected hosts grew by an order of magnitude during the period when CIDR was being rolled-out.  But this kind of thinking also applies equally well on a smaller scale to operational issues faced by many IT operations.  Have new baseline images you want to roll out to your organization but are meeting resistance from your user community?  Roll the images out on newly deployed systems only and wait for attrition to take care of the existing installed base.  Given the cycle rate of technology in most organizations, that gives you a half-life of change in about 18 months.