Péter: Hey everyone.
Welcome to the Retrospective, the engineering leadership podcast, where
we discuss topics about engineering management, technical leadership, and
similar areas today, as usual, with me is Jeremy and I'm Peter Sasz.
Welcome.
Jeremy: Welcome everybody.
Péter: What is today's topic Jeremy?
Jeremy: Yeah, so today I wanted to talk about a challenge that, I think
every engineering team faces, which is constantly being interrupted, context
switching, and I wanted to talk about some of the approaches that people
take and suggest my preferred approach based on some of the experience
I've had, which is this concept of a dedicated firefighter role in the team
that rotates regularly and handles all the interruptions for the team.
So we're going to try and to cover a little bit about how to implement a
firefighter role some of the pitfalls that people experience when they try and
do it and the best ways to be successful.
Péter: Awesome.
This is really a good topic.
I saw this working very well and I saw the concrete impact it had
on the focus performance and even external perception of the team.
So I'm really curious.
Let's dig into this.
Tell me a bit about unplanned work and interruptions.
Why are there so bad?
We are a team we need to be able to handle this.
Right.
Jeremy: Yeah, Yeah exactly.
It's a reality, right?
So you teams get bugs you have support contacting you you have other teams
in the organization who need something from you, or they have a request for,
for you to do something And people like me generate lots of slack messages.
I do I joke because like obviously slack is there.
It's an ever present.
Maybe some people still have email and use email and work too.
I would also say, you know So it's just there's just constant buzz at work and
some of this is actually really good.
Ultimately interruptions are you know, when you have a support
request you're helping customers, You need to unblock other teams.
Someone in your team needs a PR reviewed, or an external team
needs someone to review something that those are all good things.
When you have all these messages sharing info and stuff.
It's about getting everybody up to the same level of knowledge so
we can all make better decisions.
There is a good aspect to all of it.
I think we always jump to the negative of interruptions and blocking me or
unplanned stuff that's coming in.
There's a benefit to it, but there's also a cost, right?
And I think that's what we focused on.
We always want more deep work, and we want to be in, flow and all the rest.
And, but I think it's really hard and I think it's getting increasingly harder.
I remember early in my career it was mainly email and emails
weren't that frequent, now Slack and instant messages are non-stop!
I remember when hip chat became really big that was like the first moment in
work where this whole thing kicked off.
I think there's one study that I've often heard quoted, which is that every switch
you make has an impact of 23 minutes.
There was another one which I think is a really interesting
study by UC Irvine that over half of all tasks get interrupted.
And I think they also say that Any task that gets interrupted isn't very
likely to be finished in the same day.
And then there's the mental impact, stuff not getting finished cognitive
overhead, interruptions are very bad.
Péter: Yeah.
Tell me more about that, that study because it's, it's super interesting.
I think intuitively we all know the negative effects of interruptions.
We feel it on ourselves especially if interruptions get reinterrupted again,
and then you just don't even know where you started, but it's great to hear
that there are studies about that.
Tell me more about this.
Jeremy: Yeah.
I don't know.
Just on a personal level, when, we work in manager mode, you have the maker mode
and manager mode and part of manager mode is, dealing with interrupts and having
a schedule that's like really fragmented.
But yeah, even then we still have to produce stuff.
And what this study I found really interesting was that it, it said that
they, they did a study of people working in knowledge work and they
found that people were switching tasks every three minutes, which is scary.
Yeah.
When I look at myself, I'm wondering if I'm doing the same things.
And I know this bit is definitely true.
We interrupt ourselves more than others interrupt us.
So
Péter: Oh,
Jeremy: we handle interrupts as a team, we're still our own worst enemy.
Péter: I think it's like I don't want to be very gloomy and everything, but
I think today's age with all these micro contents and micro interruptions from
my phones are, are contributing to this in a, in a negative way, like Yeah.
Yeah.
I can see on people that not just the addiction aspect,
but the attention span aspect.
Like if a video is longer, and by long I mean 40 seconds instead of 10 seconds,
they don't watch it till the end.
I see the effects on myself also.
So all these interruptions are super, super draining.
Jeremy: Yeah.
I wonder if we ran this, if they ran this study again,
would the results be even worse?
So yeah, I have to say that I'm, I find I've definitely noticed my
ability to focus on long form is getting harder because of the amount
of short form stuff that we're getting
hit with.
Péter: We interrupted ourselves.
Let's, let's get back to this.
Jeremy: No, it's okay.
But what's, well, we both like feeling guilty about whatever
stuff we look at, on our phones!
What was interesting was and it's fairly obvious in a way, but the
more complex the task, the more the bigger it costs to recover from it.
They said the afternoon interruptions were hit harder.
The more people you work with in a team, obviously, which is obvious,
the more interruptions you experience.
So that's definitely another case for the small team approach and
another obvious one, but the more projects you have in flight, so
the more work in progress you have.
The bigger the impact.
And then it was interesting also was if you're interrupting yourself,
like we said, if I'm switching and then going back, there's actually
a faster recovery time in terms of being able to be back on task.
Whereas if you have an external interruption, it was a longer recovery.
And that if you're interrupted if you're in flow in a deep task, you're less
likely to get back or maybe it says, you might never get back to peak focus.
And yeah, so this is the thing I mentioned earlier, but only
40 percent of interrupted tasks get completed the same day.
Péter: That's a problem because I guess if you add the lost time and different
focus for the night then it's just so much harder to get back into the focus
the next day and finish the task.
Yeah.
Jeremy: Yeah.
Even if you're interrupted in the day just the ability to get back to that
deep focus to finish out the task.
And here's, this is the final interesting bit.
This is how we're all coping, right?
This is what I find interesting.
We all work faster.
And we're getting more stressed because of it and I can feel it, like I feel
more stressed today with everything that the way that the world feels
like it's spinning faster because you, and you have to step up your pace.
and Yeah, the most productive hours are early morning.
And that's true.
I find I'm up early and I can get stuff done in a focused way that
I struggle later in the day to do.
Péter: I'm curious to know if, if, if this is because you have
less external interruptions because others are still sleeping.
Jeremy: Yeah, exactly but even then I see people who get into work early, that's
their productive moments before then there's a buzz in the office or whatever.
The biggest way that we interrupt ourselves is checking email or messaging
where I guess in this case, Slack.
And I know that's the case for myself.
What I find interesting about all of this is there's two parts.
There's like a personal productivity and interrupting yourself, but
there's also the bigger cost
on a team that's trying to get projects done.
And it's trying to get big tasks done and they have interruptions
coming into that team.
And it's a huge cost to actually the productivity of the team.
Péter: Yeah.
That's, that's what I wanted to ask you about.
Like, like, I think we talked a bit about the personal aspect.
Let's elevate it to the team aspect and see what, what, how does the
worst case scenario look like?
Describe a team that's constantly interrupted.
Jeremy: They have a lot of projects in flight.
They're constantly being interrupted.
They're in a, what I would call like a reactive mode and they're not in charge
of their timeline and delivery anymore.
Quality of what they do is decreasing because they have to ship something
because there's, more pressure but they never had their deadlines struggling
with tech debt because it's a spiral.
You get worse and worse out of control.
And so you make things worse and worse over time that I think the energy in
the team and the morale decreases.
And obviously your external stakeholders and customers of the team start
to lose trust with you because of, The way the team is functioning.
It's, it can be really dire.
Things can turn out really bad if teams don't find a way to deal with this.
Péter: It's, it sounds like really a lose, lose situation.
And it's, it's, it's very sad and ironic because it's coming out of
a need to be better and respond to all the interruptions on time.
But in reality, the result is negative.
So how, how did you see team solve this problem?
Cause as you say it drains team energy, morality it errodes, customer trust.
So, so they know that this is not a good.
Place to be at.
What, what are the common approaches you've seen?
Mm
Jeremy: And in, it can work in the early stage, small team.
It can work in an early stage startup as well.
But it's essentially, that you don't really have a process team
just decides, you triage stuff and you prioritize as it comes in.
And it's if you think about your time, you have many slices every time an interrupt
comes and you have a, you try and have a fast, flexible way to handle stuff.
And it can work.
And this is, it can work at a certain level.
Especially if a team is taking on board some of the common advice that you are
given for handling interruptions, which generally work at this stage, which is
having some kind of focus time in the team where and where the, you try and
avoid having too many topics in flight for members and for the team and having
some kind of Way that some kind of framework for communicating with the
outside world, maybe a specific channel.
Maybe some kind of okay this is a bug.
It's not too serious.
I'm going to put it in the backlog and we'll do it in the
next sprint kind of approach.
Like all of this, having some kind of lightweight framework.
This is where teams.
Most teams sit and are most, like that's like the, almost the default mode.
And I think that works, right?
I think it works for a certain stage.
But it, I think the wheels come off quite, can come off quite
quickly as teams become bigger.
So you have we talked about the connections between the team
Great.
More interrupts.
If a business becomes more successful or your product is growing, you have
more users, more customers, more support requests more things going on
or you're part of a busy organization where there's just a lot going on, I
think then what you end up happening is that your backlog explodes.
Your team is constantly context switching your sprint start to,
if you're using sprints, you might not be meeting your sprint goals.
Or you'll see in if you're in Kanban mode, your stuff will get your tasks
will get stuck not be advancing and then you'll be bunching up.
You'll fall into all sorts of traps of trying to still push more and
ultimately you're not predictable.
Your delivery is not predictable anymore because of all those interruptions
that are sidetracking you all the time.
Péter: Yeah, yeah, yeah, this is a good summary.
The way I see this ad hoc approaches or early stage approach is that
It focuses a bit more on the individual than on the team.
Like it has some, some good ideas and prescriptions for individuals the focus
time, the limiting task shifts, but it doesn't have a holistic approach
to the whole problem of interruption.
So, but,
Jeremy: love that framing actually, I think that makes a lot of sense,
Péter: and I, I, I think.
I mean, I don't know what you're going to come up with, but I
think later approaches, I hope that they are a bit more holistic
and acknowledging approaching the interruption problem better.
What else did you see working in bigger teams or more successful
teams that were handling this?
Jeremy: yeah so then I think there's this kind of approach it's been
more popularized by Basecamp and shape up which is the cool down or a
dedicated sprint approach, essentially.
You try and ignore all of the majority of your interrupts and then
you handle them in a batch mode.
I see different variations of this one day a week or, every sprint, we have a
cool down after it or, whatever the kind of like in, for example, in shape up,
they have a six week, Sprints and then a cool down and then another six week.
Obviously I think there's an aspect of this that starts to
go to a really good principle, which is timeboxing and batching
And so it's a great idea.
Teams get desperate and they say we've got loads of bugs, so
let's stick it into a cool down.
And I've seen that, several times now.
Oh, that's a cool down task.
We'll do that.
And then the cool down starts to, it works, but I think there's
some things that break down.
First of all, the cool down can get over filled.
And, I think the other part is this does not work.
Where you need to respond quickly to something in a timely way.
Péter: Yeah, you cannot say to all requests that thanks
for your, your request.
I will handle it next week.
Jeremy: Yeah, exactly.
Bugs become outdated, right?
If, I have a bug and then I push it out by even two weeks, how can I reproduce it?
And just that context switch also the thing about bugs is the longer you
leave it to fix you, the longer you are from when you originally wrote the
code and the harder it is to switch
back into the context of that.
So at least for me, I feel like bugs should be, we want to shift, we want to
triage and fix bugs as soon as possible and get and fix them as early, shift left,
but fix them as early in the process.
I think those cooldowns can overflow and obviously the user
experience of this is not great.
We talk a little bit about it.
Péter: I mean, it's if you talk about user experience like the customer
perception of the team and the service the team provides, arguably there
there is an improvement because at least the team is more predictable
than than the ad hoc handling.
So the stakeholder might be upset that that his problem is only going
to be handled in two weeks, but that's that's something they can plan with.
As opposed to the ad hoc when one request gets handled immediately
and the other one in three months.
Jeremy: That's exactly it.
So you are trading off on being more predictable on delivery for
the responsiveness and so on.
But I think, and so this can work, and it does work.
And where you'll see it's not working is you have these crises that interrupt.
And I think, that your team with this mode is also going to struggle on that
user serve, the user responsiveness
So I think it's very hard to run in this mode and avoid interruption still,
Péter: Yeah.
Jeremy: because you end up, what you end up with is a combination
of cool down and ad hoc
to operate properly.
Péter: Guess it, it depends on, on, on the discipline and how well you can keep it.
But to your point, like if, if production is down, then, then you cannot just
push it to the cooldown period.
So, so yeah, whatever discipline you keep and however strong you
push back, there are always things, interruptions that are creeping through.
Okay.
So thanks for these approaches.
Tell us a bit more about the firefighter role because it seems
like that's a good compromise.
Jeremy: Honestly, this is the pattern that I found works really well, especially
when you have a bit more of an established organization regular, support, regular
things flowing into the team, or you're
part of a bigger organization.
And no matter how much we try to remove dependencies between teams
we still have to communicate.
And or especially and this is also.
An important role, I would say, for platform teams that provide a service
internally to other teams in organization, where they have both a build and a run
and I, most teams have some run element.
So the firefighter is a, it's taking, instead of having team interrupts, it's
having a rotating role in the team who is dedicated To handling all of those
interrupts it allows everybody else to stay focused on their work and, The
firefighter is doing all the run tasks.
They're handling any kind of production incidents, customer support requests
requests from other teams and so on.
And they're like the first line support triage before.
We do decide, and sometimes you still have to interrupt the team,
but they're definitely, they handle the majority of those things.
And then this is a really important part of this role, which I think teams get
wrong and where they fail is they don't do this on top of their normal work.
you have to plan for that person to not be in, for the period of
their rotation, you need to plan for them to be not doing any work.
And the work that they should be doing beyond all of those support requests is.
learning what causes the team to have those requests and
preventing them in the future.
So it might be fixing bugs, but it's more identifying patterns and then figuring
out, Oh, we need some better monitoring to avoid this, or I need to create some
quick tooling or something self service.
I'm going to improve the docs because we getting these requests.
This piece of tech debt is really.
Impacting us every time I'm going to work on this and ideally those
are more interruptible tasks because you're doing you know, you're the one
that's supposed to be interrupted.
But I think, this is a person that can work in that more ad hoc mode, but just
that one individual with a prioritization.
So anytime that teams are struggling to deal with interruptions, this model
has worked well consistently for me.
Péter: Hmm.
That's that's interesting.
Let's talk a bit about the practical aspects.
Like how, how do you implement something like this in a team?
What are the things to pay attention to?
And then maybe you will answer before me posing some of the
questions about the risks.
Hmm.
Mm
Jeremy: so I think the rotation is something, so when you're implementing, I
think there's some patterns for success.
I think the minimum.
of a rotation for the role is one week, though often two weeks is what people
do because you have enough time to actually tackle some of the debt stuff.
And essentially you need long enough time to build some context.
But it can't be so long that you get frustrated or burnt
out from all of the interrupts.
If you working in sprints, it makes sense to wrap it, to have it aligned
with the sprints that your team is doing, having the same cadence as the sprints.
And I think it's important to have some kind of handover
between the people who are, who's finishing and doing the next one.
Providing, maybe some kind of status on, on ongoing stuff if they're going beyond.
There's also having some kind of clear criteria for that person to, that you
develop over time, to escalate into the team if you can't handle things yourself.
Also if you're being overwhelmed by a lot of Issues, then that's also an
important part, by the way, if teams are, if that one person is being overwhelmed
all the time and every time, then I think the team has an issue that goes
beyond needing a firefighter role.
I think at that point you need to, maybe reconsider what is triggering a needing
so much that more than that, it takes more than one person to constantly handle those
Péter: Yeah, I, I think in this case is the person who is, who
cannot handle it because there are just too many interrupts.
I think that person is going to have a very good idea of how to make an
impactful improvement in the systems and, and just, just get into a better stage.
Maybe it's documentation, maybe it's setting up some self serve system.
Maybe it's just tech debt.
Yeah.
Yeah.
Thank you.
Yeah, that's
Jeremy: if the team can't handle that in the course of that firefighter time
slot, then they need to prioritize those tasks as part of the actual sprint or
put it in their, prioritize it in their backlog to make it a whole team focus.
And I think just the last thing is having some kind of tracking of the
kind of requests coming in is helpful to be able to spot the patterns and to
look back and be a bit more data driven.
I'm trying to say this in a way that I'm not just saying create tickets because
Péter: what I wanted to ask
Jeremy: that can be really.
Can be really evil and over heavyweight and so on, but yeah, some kind of
way to, to just understand what, all the requests that you have.
Péter: Well, what I like working is that the teams keep open.
They're usually Slack or some internal chat tool.
They're open to requests.
They have a dedicated room and the firefighter is immediately responding
that, okay, I will start to take a look at it and says, while I'm looking at
it or starting to address, can you just make sure to create a ticket for this?
You can link this chat, whatever, just because communicates that the issue is
not held up until the person creates all the administration, but still puts the
responsibility on the original requester to create the ticket and enforces this
culture that that's the culture of documentation being, being database.
Jeremy: Yeah.
Maybe I would say, I think that there are even better ways to
handle that now than say, I've got this, but please create a ticket.
I still think create a ticket is a horrible thing to
Péter: Yeah, yeah, I hear it.
Jeremy: And.
Now you can, if you're using Slack and I think most people are or, most,
I hope most people are hope you're not condemned to teams, but I think
you can do this with teams too, but
Péter: I think some of our listeners are so
Jeremy: yeah, sorry for you.
But what you can do is you can have emojis that you can tag on an issue on a Slack
message that triggers a workflow that creates a ticket in whatever in Jira or
linear or whatever tool you're using.
And And it just keeps the thread of that message in sync with your ticketing.
And then afterwards, you can do whatever triage you want.
And maybe some of your emojis that you can add can help like triage that.
But saying create a ticket for me is it's just not something
that I'm really a fan of,
you know.
Péter: really a service attitude.
Jeremy: Exactly.
It's not user friendly.
It's not user focused.
It's not customer focused.
So many rant over, but that, that's, I wish that people would
focus more on making it easier.
And especially if you're dealing with customer support or sales, it's such
a barrier for them, especially sales.
When they have requests they're running in the field.
Trying to chase customers usually not very technical and we need to meet them where
they are and not put straight jackets and processes around them, but that's maybe,
Péter: And I often saw the frustration of someone reaching out with something,
Hey, I noticed the bug in your systems.
It doesn't hold me back.
It's not a problem for me.
I'm just helping you out by letting you know and receiving a feedback saying
that, okay, create a ticket for that.
It's really frustrating and teaches the person not to reach out next time.
Okay, let's, let's, let's get out of this rabbit hole of communication
culture and just continue.
I really like this approach of, of the firefighter because externally.
It shows up like a 24 seven support always available approach.
And that can be really good for, for, for everyone interacting with the team.
And it also creates focus time for the people who are not on on, on the
firefighting role and they can do some deep work, what are the common pitfalls?
Why, why, how do you see this failing or what are the things that we are
trading for this better service?
Jeremy: Yeah.
So some things that I've seen so first of all, the trap of making the firefighter
a permanent part of somebody's job,
either, you know, like the best engineer because they know everything,
which is awful because you need to push like junior members of the team.
They might need some support.
Yeah.
but this is how you learn.
And, we need to take the training wheels off, but maybe it's a pairing,
but there needs to be a point where the junior people in the team need
to be able to handle this role.
The other trap that I see, which I really dislike also is the manager of the team.
Fulfills the role.
I think if you're a hands on manager, it's perfectly acceptable for you
to have a rotation and you should be having a rotation as a firefighter,
just like everybody else in the team.
But what I don't think it's good is if you Permanently do it.
I think I just think it's a real anti pattern.
The other one that this is where this implementation always fails
down when people think of it, and they make the firefighters still
have work in the sprint or whatever.
And this is like the, this is why all, a lot of this fails, because
you just overwhelm someone, and their work is still late because of all
the reasons we talked about earlier.
And then the other ones we briefly talked about not having a plan when
you have multiple things kicking off that are overwhelming the
individual and not having a handover.
And so you stuff falls in the cracks when you rotate and you don't
have a handover of the in flight.
Topics.
I think then customer service suffers as well, or what's worse is you take
it with you into your next sprint.
So some, you have to, if you're going to take it in, you need to plan it as work in
the next sprint that you're going to do.
But it needs to be really explicit and not like a side thing because,
oh, I was helping that person during my firefighting rotation and now
I'm just going to continue somehow.
Those are all the danger.
Danger.
Yeah.
Péter: juniors on on firefighter.
I saw this working amazingly in one of the places I work that there was this
this junior engineer who was Not very experienced, and she was afraid of this
role, but she really shown in this, this, this place because she had very
good soft skills and do help people who are reaching out with requests.
And it was very motivating for her to, to, to see the impact of, of her help can do.
And she balanced very well when to ask for help.
Actually, I think this is one of the risks.
If you put junior people in this role that they don't recognize.
The point where they should ask for help and not try to figure themselves out.
So if you teach them to be comfortable with not knowing things and asking for
help and how to escalate efficiently, then this can be an amazing, motivating
and learning experience for them.
Yeah,
Jeremy: podcast episode we did about what makes a senior engineer, because
this is exactly the skills that this is part of that growing of a junior
person into a senior and the kind of the skillset that they need to learn
in order to be to improve and grow.
So, yeah.
Péter: quick question before we summarize and say goodbye.
You mentioned the importance of knowledge transfer and to make sure
that handover is happening efficiently.
And I agree because I think one of the values in this firefighter situations
is that the internal knowledge is increasing in the team if it's done well.
So I'm curious, what are some practical ways you found working about
this handover to make it efficient?
Jeremy: I think the first is to discuss this in the team retrospective.
And ideally if you're doing this around some kind of sprint cadence of two weeks,
then you're having a retrospective.
And so it's an important part of building it into the retrospective of the team.
And I think then it's maybe obviously sharing what you're learning
continuously in the daily and so on, that's an important part of it.
And finally, When you're handing over it's probably not just a one to one,
but maybe like with the manager, you have a three way, the outgoing and
the incoming and maybe the manager.
I'm not sure I would have the whole team involved in
Péter: that's what I'm thinking about.
It, I guess it
Jeremy: you, that's, you cover that in the retrospective.
Yeah.
Sorry.
I would just cover that in the retro.
Yeah.
Péter: yeah, because I'm thinking if, if you're talking about a nine person
team, then, then I think you're right.
It's a waste of time or, or not the best way to spend the time, but if
it's a very small team, like three people, four people, then, then,
then maybe you can combine this and just talk about briefly what happened
during the firefighter session.
So everyone learns.
Jeremy: Yeah.
And usually the, you need the firefighter, the bigger the team goes.
So maybe in a three or four person team, you're less likely to need it and you're
on top of stuff, but and you may not need to dedicate a full time firefighter also.
I still think that even in small teams, this pattern works way better
because you have one person who's the point of contact for that week.
Péter: Yeah.
Yeah.
And to those who say that, Oh, but in a four person team having a dedicated
firefighter, it means that we are running at 75 percent capacity.
That's not true.
That's really not true because that 25 percent is super valuable work and
it creates the stability, resilience and scalability for the team by not
just answering requests, but also the little things you were talking about.
We didn't go deep into it, but.
This person, when it's not interrupted and just waiting to help this person
can do a lot of very impactful tech that work, tidying up a documentation,
refactoring something, looking into an obscure bug that the team never had time.
So yeah, push back.
If somebody tries to oppose this,
Jeremy: Yeah, plus you're just making the invisible visible by
having that dedicated capacity.
And a lot of our work is just about making stuff visible and visualized.
And if the true capacity of the team is that, then that it's just much better.
Péter: all right, we are nearing our end.
Let's wrap it up.
What are the key takeaways, how to handle interruptions in a team?
Jeremy: Interruptions are good.
Interruptions are bad.
And we, what we were trying to achieve is not to eliminate all of them, but
to handle them more efficiently and learn from the interruptions by
constantly and regularly improving how the team works and a great pattern for
that when ad hoc handling of incoming requests starts to break down is to
have a dedicated person in the team.
In this case, I call it the firefighter.
And they shield the whole team.
You regularly rotate that role and you have some kind of handover . And what
makes this work well is if during that rotation, you actually are doing
improvements For how the team works or how the team interfaces with others.
Things like documentation, automation, self service.
And obviously the team is just learning from what that person is doing.
To share knowledge inside the team of the interruptions to be
able to make bigger improvements.
Péter: Awesome.
Yeah.
Nice wrap up.
Very practical.
Thank you, Jeremy.
And to our listeners, if you have someone in your personal or
professional circle whose team is struggling with interruptions,
send them over this episode.
And see you in two weeks when we're going to tackle another
engineering leadership problem.
Jeremy: Yeah.
Brilliant.
And we'll put some links to some great resources in the show notes as well.