atp

Atp's external memory

Agile Operations

the role of the interruption monkey

I promised matt carter I'd start writing up aspects of how we run agile operations (aka devops) at LMAX. Apologies this is so overdue.

One of the major tensions in any small techical operations team is the tension between project work and "interrupt driven" work. Interrupt work is hard to define, other than the slightly circular "being anything that is not project work". In practise this ranges from problems in production through to getting users working mice.

Small IT teams are usually pretty bad at dealing with interruptions, so we've developed the concept of an "Interruption Monkey" to keep things manageable.

Why Interruptions are Bad

There was an interesting article in the Schumpeter column of the Economist July 2nd 2011 edition. Titled "Information Overload" it talks about the problem of there being too many interruptions to deal with - from email, phones, IM, SMS, or old fashioned walk ups. The complaint was that with too many interruptions,  nothing actually gets done.

The practical solutions offered were actually quite simple. They involve will power, schedules and systems where you take a break from sources of interruptions, and allow yourself time to get things done.

Particular quotes I liked were "Ask yourself if what you are doing is constructive, or a mere 'activity'", followed by the exhortation to "focus on a narrow range of objectives and filter out everything else".

On a busy day its normal to get interrupted verbally, and have emails and IMs flying in. Sometimes you find yourself  with a quiet spot just pressing send/receive on the mail client because you know that something is bound to come in soon, and its not worth starting something "big" before it does. At the end of the day, you can feel you achieved pretty much nothing. You've been busy, but not actually got anything done. There are a bunch of emails you've written or replied to, and perhaps a couple of tickets resolved.

When you consider it on the way home, in reality what you managed to get done would have probably taken you all of 1/2 hour to do in a quiet room. Where did the rest of the day go? The answer is usually "in context switching". An interruption can take you out for ten minutes even if its just a 30 second conversation. Your focus is lost, you make a cup of tea/coffee, you check email...

In short when you deal with a lot of interrupts, they end up driving you and how you work, and its easy to end up in a semi-dazed state just waiting for the next bell to ring. All those good intentions you had for what you'd get done today drowned in a sea of trivia. The important has been overruled by the urgent.

Multiplied across the team, that means that a burst of interruptions can kill any sort of productivity quite rapidly.

The Interruption Monkey

Which brings us onto the Interruption Monkey, which was originally a token that sits on the desk of a duty systems administrator(s). However "duty systems admin" is not as catchy, so we now call the sysadmins doing the job the interruption monkeys.

The role of an interruption monkey is quite simple; he shields the team from interruptions, and allows people to get on with story cards (projects). Each interrupt he handles saves 10-15 minutes lost work from someone else.

In more detail;

  1. The Interruption monkey handles walkups and requests.
  2. The IM handles short tickets and keeps the queues moving and under control.
  3. If it takes more than 1/2 hour to do something, it should be deferred.
  4. If there are no short ones, then the highest priority longer ticket should  be started.
  5. Actively intercepts inbound issues away from other team members.

The IM(s) are also responsible for keeping the ticket queue length down.

In terms of metrics, the numbers of tickets raised, resolved and the ticket queue length at the end of an iteration are the chief metrics, and are shared with the other teams in technology and the business.

In the same way that delivery and prioritisation of features is discussed with the dev teams, the amount of effort devoted to BAU work can then be agreed with the rest of the company.

If the ticket queues are rising, we may devote more people to the IM role, at the expense of story delivery. On the other hand, we have had situations where the business owner decides that stories are more important, and deprioritises BAU work.

That can be a surprise. If you're working as a focussed small company, people will put up with broken printers, and dodgy keyboards, provided that they know why. And it is the job of the business owner to explain the decision to the rest of the company. The tickets don't get done. But people will know why. If they don't then thats the business owners problem.
 
The benefits are that you can run a small team on both projects and BAU work, and keep a handle on the impact that the very interrupt driven and bursty work will have on project delivery.

And in Practice...

It requires quite a bit of discipline in the team to make this actually work, as well as training the users to find the interruption monkey.

The problems typically encountered are;

  • Random walkups grabbing the closest engineer. Most tech guys are happy to help, and you need to work hard to get them to refer the walkup to the interruption monkey.
  • "Interesting problems". Again, you need to work against the tech guys nature. If someone comes over with an interesting problem, or god forbid a new flashy bit of electronics, then pretty soon you'll have the entire department pitching in.
  • If the IM is away, and you get a walk up, then you need a plan in place to handle that. We normally point them at the IM on the floor, or ask them to raise a ticket, and politely tell them we'll send the IM over once he is back. Its very rare that that does not do the job.
  • Siloing of skills/knowledge. If only one person knows how to build laptops or fix a database, then they're going to get interrupted. This can be fixed with pairing, mentoring and of course rotation across IM work and stories.
  • "Just doing this ticket while I wait for..." - or helping out. Team members tend to get nervous about ticket queue length and will work on tickets even if they're not the IM. Sometimes to the detriment of their project or story work.

All of these tend to sap effort away from project work onto BAU. All require the team to understand the role of the IM, and the team leads in particular to take an active hand in ensuring they're handled properly.

The last one is particularly bad, because team members think they're helping, when all they're doing is subverting the system.

For the reason why its so bad consider the following scenario;

A dedicated engineer (DE) decides every day to knock off a couple of tickets on the quiet for a couple of hours. As a result the following happen;

  1. You lose some velocity on the story card. 
  2. You lose visibility of how hard that ticket was
  3. The ticket queue drops - which you attribute to the IM.
  4. Your metrics become incorrect.

In practice our dedicated engineer seems a lot slower than he is in completing the card. The team velocity drops. You size up similar cards next time because this one took 3 days, not 2.

As a manager, you conclude that the current resource level of IMs is correct, as the ticket queue is getting shorter. So you may even cut the number of IMs.

Even though people are trying to help, this can be very counter productive. The mitigation for this is to explain the above to the team, and get the team leads to keep an eye on who is in the ticket system.

Fundamentally this is the dedicated engineer (for the best of intentions) second guessing the corporate priorities and the allocation of resources.

Conclusion.

When this works well, the IM can keep BAU work under control and prevent it from impacting the delivery of project work as measured by the velocity of the team in delivering story cards.

It allows you to involve the business owner in the tradeoff of customer service (good bau, short ticket queues) with throughput and story delivery.

To use a technical analogy, you can optimise for interrupts and responsiveness (as most desktop OS schedulers do) or you can optimise for throughput at the cost of responsiveness (as server OS schedulers do). You need desktop scheduling for BAU work, and server scheduling for project/story delivery.

At LMAX we are now pretty good at training the rest of the company to use the IM, and this works for us well. Even during those weeks from hell, we can deliver story points, as long as people remember that BAU and Stories don't mix.

Written by atp

Sunday 04 September 2011 at 1:00 pm

Posted in Default

Leave a Reply