You're on your way to the next level! Join the Kudos program to earn points and save your progress.
Level 1: Seed
25 / 150 points
1 badge earned
Challenges come and go, but your rewards stay with you. Do more to earn more!
What goes around comes around! Share the love by gifting kudos to your peers.
Keep earning points to reach the top of the leaderboard. It resets every quarter so you always have a chance!
Join now to unlock these features and more
I'm John Allspaw, co-founder of Adaptive Capacity Labs, where we help teams use their incidents to learn and improve. We bring research-driven methods and approaches to drive effective incident analysis in software-reliant organizations.
Previously, I was Chief Technology Officer at Etsy. I also have a Master’s degree in Human Factors and Systems Safety at Lund University. I’ve worked at various other companies in software systems engineering and operations roles, and authored several books on these topics.
I’ve noticed a gap exists between how we tend to think incidents happen and how they actually happen in reality. Real incidents are messy. And common measurements around incidents tend to generate very little insight about the events, or what directions a team or company might take to make improvements
But with training and guidance, I believe teams can better understand what their incidents are trying to tell them and really take advantage of these important events to shed light on areas critically valuable for the business.
Start submitting questions now below (yes, it says “answers") and tune in Tuesday, December 11th from 11 am to 12 pm PST (2-3pm EST) when I'll be answering questions live.
See you then,
Hey Kat -
Indeed, sometimes life comes at us fast!
One way to think of this is that the quality of what happens after a post-incident review meeting critically depends on what happens before (and during!) that group meeting. A significant opportunity that many organizations miss is what happens in the time between an incident and whatever form of post-incident meeting takes place. These group meetings (sometimes called “post-mortems”) can be really expensive and therefore preparing for them is time well spent.
A significant part of effective preparation for a group post-incident review meeting is finding what areas will yield the greatest participation by the group. A common pattern we see in practice is when a post-mortem meeting is a simple re-hash of what was said on the phone bridge or #warroom chat log and not exploration about what fueled that dialogue. Look for indications that people responding to the incident were:
If you can look for what was difficult for people in the incident (not just descriptions of what the tech was doing) then you’ll have a better chance that people will reach those “ooohhhhh!” moments. When you can create the conditions where people can realize gaps in their understandings of how the “systems” are supposed to work, then what comes after the meeting will be clearer and more valuable.
Incidents are complicated surprises, and effective incident analysis explores the guts of what made them surprising. Produce more “a ha!” moments in the meetings, and more people will start coming to them because they’ll become known as the place you can learn things that you can’t anywhere else. :)
Nick's question below about what to include in post-incident documents is also relevant to your question as well. :)
During my Six Sigma process improvement days with tracking and managing incidents with an eye out for that magic 80/20 set of issues that could be addressed through some form of change, one of the biggest detriments I'd run into is that indecent management systems often either have no data, bad data, or the wrong data around root cause analysis.
What would you recommend adding to a tracking system to help capture root cause in a way that can be used to build control charts for data analysis?
To be sure, it's genuinely difficult to determine where to apply effort (read: spend money). The main challenge is that data collections don't speak for themselves and the typical pace of change in software companies generate new paths to failure that are not represented in past data.
That said, some systems fail long after thoughtful experts recognized they were "going sour." Sometimes the problem is not that people don't see the pattern evolving, but that they are unable to articulate the growing risk in persuasive ways. We've seen pretty big systems 'fall over' in ways that people predicted well before the overt incident started.
To answer your question, the data that we tend to find un-examined (forget about captured or recorded) are for the most part qualitative in nature. Real incidents in the wild are messy and despite the traditional belief that they can (and should) be compared and contrasted against each other via what I would call typically shallow metrics to find “trends” -- what we see in practice reveals that these are for the most part illusory (akin to a Rorschach test).
Instead, see the answer I gave to Kat and Nick’s questions above. I realize that we’re not used to gathering qualitative data in software worlds and it’s a shift from the comforts of statistics. :)
Effective learning from incidents means getting as much out of them as possible, even beyond localized ‘fixes’ to the tech. Compelling narratives about what the teams were surprised by can go a very long way to bootstrap new hires, for example.
Taking a look at “The Field Guide To Understanding Human Error” (3rd edition) by Sidney Dekker is what I would suggest!
Thanks for joining us, John :D
It seems a lot of teams are getting on board with drilling into discovering the more nuanced, holistic understandings of incident causation factors.
But then there are the "up the flagpole" teams (managers, VPs, C-suite) who still expect to be given a simplistic, single "root cause."
How do you set expectations with these teams? How do you set expectations with the upper levels of management that they will not be given a single-sentence root cause explanation of a recent incident?
This is a great question and I’d like to answer it in a couple of different ways. Before I do that, though, I think it’s important to acknowledge that moving from simplistic and/or singular explanations (“root cause”) to a more holistic and systems-thinking view on accidents is challenging and many domains outside of software (aviation, medicine, etc.) still grapple with this fundamental shift in perspective.
<snark>One response to an exec asking for a single-sentence explanation of an incident is to ask them to boil down their board meetings or earnings calls to a single sentence “root cause”.</snark> :)
All snark aside, I would first attempt to understand what drives their desire for simple answers to what are effectively complex events. Maybe they want a high-level explanation because they foresee needing to describe it to different audiences like auditors or board members or the press?
We avoid the term 'cause'; we prefer to characterize 'sources' and 'contributors'. Complex, heavily defended systems fail because of the combination of multiple interacting conditions and contributors, often minor or even apparently trivial ones. [See "How Complex Systems Fail"]. Frankly, we find that hands-on engineers get this quite easily. Anyone will a modicum of experience with production software systems understands how surprising events can emerge and how baffling its behavior can be -- even to its authors!
When I was CTO at Etsy, the most important concern I had regarding incidents was: are the people responsible for these systems able to “listen” to what these incidents were trying to “tell” them? Are they getting the support (time, resources, training, etc.) they need to continually up their game on extracting the most valuable insights? If not, what can I do to support them better?
Here's the thing: if people are looking hard enough for simple answers to complex challenges, they will find them. If people look hard enough for a "root cause" -- they will find it.
However, the comfort that both of those provide is pretty fleeting and don't produce much in the way of shining critical light in the areas that need it.
Something to ponder when being asked for a 'root cause' or other forms of simplistic explanations: ask the same questions about the *absence* of incidents. What was the 'root cause' of NOT having an incident? :)
What tips would you suggest for writing a "good" postmortem "report"? Structure of postmortems is something many companies (in my experience leading ITSM at Symantec and talking to many customers) seem to struggle with e.g. what to include, what really would make a difference?
Hey Nick -
I think you’ve captured a core challenge people doing incident analysis face: there is really no end to the data and context that can be put into a written exploration of an incident! What to include, what to focus on or highlight, what to skim over or dismiss as valuable...all of these are judgment calls that need to be made by the author(s) of these documents.
My short answer is:
I will assume that the audience is internal hands-on engineers for the rest of my answer. :) Quite often, these reports are seen as just a simple vehicle upon which “remediation” or follow-up action items are placed. This is a mistake and a missed opportunity!
Capturing tasks that might help in the future is important, but not sufficient for real learning. A key thing to remember is that these documents need to be written with the reader in mind.
What will the reader want to know about the context of the incident? We could imagine that indicating when an outage happened (Cyber Monday? Same day as your IPO? When your CEO is on stage giving an interview?) might be important in some cases and not really important in others, for example.
What about the handling of the incident might be important to highlight to the reader? For example: sometimes, resolving incidents requires the arcane knowledge of long-tenured engineers who understand the weird behaviors of that legacy stuff that usually “just works.” In that case, noting who those engineers are and what they know might be very important. Maybe all the engineers needed to respond to the incident are attending a conference at the time? Capturing this data in a document can be really valuable for readers of the report who wasn’t there at the time or who might want the backstory on why some piece of tech is the way it is.
Effective incident analysis requires developing (and maintaining!) the skills required to not only writing these narratives but also knowing what data to collect, what people to interview, what diagrams to include, etc. Organizations that acknowledge this expertise and invest in getting better at it will find that it’s a competitive advantage. :)
Hi John! Big fan of your work with the fault injection and resiliency culture at Etsy.
I used to work on a team that would get so many reoccurring issues in production, that it was customary to let them try to auto resolve themselves before investigation. Incidents would be closed frequently without implementing a fix for the root cause.
"Oh its that issue again, reboot this instance to fix it."
I've heard the best thing to do is not to close the incident, but keep it open till a fix for the root cause is in place - which makes sense.
Do you have any recommendations around starting a culture that prevents a negative cycle like this? I love your write up on blameless post-mortems, and I feel like "Just Culture" could be applied here as well...
Do I push for a new process or tool to help accomplish a shift in perspective? Or push for education? Any recommendations on how to tackle culture shift in general would be helpful too. Thanks so much for you time!
Hey Darrell -
As you might imagine from my other answers in this AMA, I tend to generate more questions in response to questions. :)
In theory, all incidents will be investigated sufficiently as they arise and they will be resolved in prompt fashion. In practice, we tend to see a very different picture.
I would consider how clear and unambiguous the definition for an "incident" is - what we find is that what constitutes an "incident" (much like "severity level") is much more negotiable and contextual than many would like to admit.
On the one hand, this simple example utterance: "Oh its that issue again, reboot this instance to fix it." could be seen as a signal of laziness or undisciplined recording of an issue.
On the other hand, we could also construct an explanation for that differently:
"Culture" is a tough term to tackle (I suspect you'd rather me not be "hand-wavy") in my answers!) because it can refer to incentives, motivations, norms, behaviors, or many other phenomena.
An alternative to looking at how to get people to do a specific thing that they're not (or unevenly) doing is to look closely at what they actually are doing. When we do assessments for organizations, we find a great deal of work that is both critically important and for the most part hidden from view!
I think it's quite a good idea to explore at a meta-level how people perceive their work and what value they think recording/capturing the 'incident' data is. But help them tell their contexts for what they pay attention to (and what they don't) before embarking on persuading them to be diligent about something they might not find valuable.
i am curious about your stance on why timing metrics are meaningless (MTTD, MTTR and such). I agree that they don't actually tell us much about incidents, but then how do we measure success in Incident Management over time? and more importantly how do you explain that to leadership?
Hi Vanessa -
Indeed, this is a significant challenge right?
Preventative design is a goal that is always out of reach. "Bug free" software and "surprise free" architectures do not exist. Therefore, you want to get the most out of the incidents that you do have.
What does success look like? Here are a few indicators that things are going well with respect to extracting meaning and insight from incidents:
We have seen these things, it's possible! :)
For sure, that situation is a tough one. I understand and have experience with the "shoveling sand against the tide" perspective. :)
Remember that effective incident analysis (like engineering teams!) need to be adaptive. If the tempo of incidents is high or outpaces the ability for a team to make progress on genuinely valuable measures, then I would consider:
In the end, it's still possible to get "squashed" with time. If these events are genuinely understood as incidents to explore then it is worthwhile to spend some attention on what these follow-up action items are actually helping with. :/
Well, my short answer is that my company does that training. :)
The longer answer would be to spend some time working out what expertise is needed that you don't have, and what it would look like if you had that expertise.
Would incidents be easier to work through? Do you think they'd be shorter?
Hey John -
You mention that organizations tend to lack direction once insights into root cause are determined.
What would you say the top three biggest pitfalls are once the 'storm' had ended and we're left to make the same mistakes again?
How do you work around those biggest pitfalls and make incident management a more iterative and collaborative experience for everyone?
Hi Meg -
Only three? :)
There are a couple of notions to unpack in your questions. I think you're spot-on with the idea that incident management (pre, during, and post!) can be an iterative and collaborative experience. In fact, I'd say that gathering and contrasting multiple (and sometimes quite different!) perspectives and experiences is inherent in doing this sort of thing well.
Three significant pitfalls are:
Counterfactual reasoning is a force to be reckoned with. Quite often we'll find people listing actions that were *not* taken as a contributor (or a "cause") of an incident. Or conditions that were *not* present.
For example: "The engineer didn't double-check the test results before running the next command." or "The incident happened because there were no alerts set up for X."
Follow me here...what counterfactuals (did not have, should not have, could have, etc.) do is describe (quite literally!) a reality that didn't exist, as a way to explain a reality that did exist. These descriptions are always given in hindsight, and aren't helpful. What we want to know instead is what people did do, and what brought them to do what they did. People do what makes sense to them at the time; this is what we want to understand...how it made sense for them to do (or not do) what they did.
Narrowing "micro-fixes" means scoping your exploration of an incident only based on what you can change or influence in the future. This limits the exploration and constrains the set of potential routes to take and insights to share.
By "linear or simplistic causality" I mean the most common manifestation of hindsight bias, which is the tendency to construct a narrative of the form "A then B which makes C which makes D, etc." as if it were a chain of events that simply needed breaking at some point. This isn't very helpful either because complex system (software being included in that) don't follow linear chains but instead represent a network of conditions and forces and sources and triggers.
I mentioned above, but the best resource I could suggest is "The Field Guide To Understanding Human Error" by Sidney Dekker. Get the 3rd edition. :)
I think you hit the nail on the head with simplistic causality, which kind of influences the other two you mentioned.
We see issues as the very narrow thread between "it broke - let's fix" and most organizations lack the ability to zoom out, if you will, and see how the part affects the whole.
Thanks for the book recommendation, I'll be sure to check it out.
1) where should one start implementing right incident management frameworks where currently things are not in order ?
2) Should we always think about incident management from scratch? OR we clone existing best knowledge out in world(cloning anyone of like google, Netflix etc)?
Hey Suraj -
#1 - I would say that it depends critically on what actually is not in order. Organizations can't just take an "off the shelf" incident management framework and apply it. Well, they can, but one-size-fits-all approaches like this never really work.
#2 - Even if groups think they're "cloning" practices from other groups, they're still starting from scratch. See #1 above.
Not likely to be satisfying answers, but they are ones I feel firmly about. :)