I'm John Allspaw, Ask Me Anything about incident analysis and postmortems

John Allspaw December 6, 2018

I'm John Allspaw, co-founder of Adaptive Capacity Labs, where we help teams use their incidents to learn and improve. We bring research-driven methods and approaches to drive effective incident analysis in software-reliant organizations.

Previously, I was Chief Technology Officer at Etsy. I also have a Master’s degree in Human Factors and Systems Safety at Lund University. I’ve worked at various other companies in software systems engineering and operations roles, and authored several books on these topics.

I’ve noticed a gap exists between how we tend to think incidents happen and how they actually happen in reality. Real incidents are messy. And common measurements around incidents tend to generate very little insight about the events, or what directions a team or company might take to make improvements

But with training and guidance, I believe teams can better understand what their incidents are trying to tell them and really take advantage of these important events to shed light on areas critically valuable for the business.

Start submitting questions now below (yes, it says “answers") and tune in Tuesday, December 11th from 11 am to 12 pm PST (2-3pm EST) when I'll be answering questions live.

See you then,

John 

---

Note from Atlassian: For more on John’s work, check out Adaptive Capacity Labs. To learn how Atlassian manages incidents, check out our incident handbook and Jira Ops.

10 answers

5 votes
Kat Warner
Marketplace Partner
Marketplace Partners provide apps and integrations available on the Atlassian Marketplace that extend the power of Atlassian products.
December 6, 2018

Do you have tips for what happens AFTER the meeting? So often notes are written up from the meeting but after that there is a new situation, another retrospective, and the same lessons (not) learnt.

John Allspaw December 11, 2018

Hey Kat -

Indeed, sometimes life comes at us fast! 

One way to think of this is that the quality of what happens after a post-incident review meeting critically depends on what happens before (and during!) that group meeting. A significant opportunity that many organizations miss is what happens in the time between an incident and whatever form of post-incident meeting takes place. These group meetings (sometimes called “post-mortems”) can be really expensive and therefore preparing for them is time well spent.

A significant part of effective preparation for a group post-incident review meeting is finding what areas will yield the greatest participation by the group. A common pattern we see in practice is when a post-mortem meeting is a simple re-hash of what was said on the phone bridge or #warroom chat log and not exploration about what fueled that dialogue. Look for indications that people responding to the incident were:

  • Confused (about how things were supposed to work, what they were seeing in graphs and logs, etc.)
  • Uncertainty (about who to call for more help, about what was happening, about what to do next, etc.)
  • Ambiguity (when faced with doing A or B, weighing what they think their options were, what the consequences of getting it wrong were, etc.)

If you can look for what was difficult for people in the incident (not just descriptions of what the tech was doing) then you’ll have a better chance that people will reach those “ooohhhhh!” moments. When you can create the conditions where people can realize gaps in their understandings of how the “systems” are supposed to work, then what comes after the meeting will be clearer and more valuable.

Incidents are complicated surprises, and effective incident analysis explores the guts of what made them surprising. Produce more “a ha!” moments in the meetings, and more people will start coming to them because they’ll become known as the place you can learn things that you can’t anywhere else. :)

Nick's question below about what to include in post-incident documents is also relevant to your question as well. :)

Like # people like this
4 votes
Scott Theus
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
December 10, 2018

During my Six Sigma process improvement days with tracking and managing incidents with an eye out for that magic 80/20 set of issues that could be addressed through some form of change, one of the biggest detriments I'd run into is that indecent management systems often either have no data, bad data, or the wrong data around root cause analysis. 

What would you recommend adding to a tracking system to help  capture root cause in a way that can be used to build control charts for data analysis?

-Scott

Lee Winbush December 11, 2018

I'm living this right now as a Black Belt analyzing prior incidents.  Eager to hear @John Allspaw's thoughts.

Like blakethorne likes this
John Allspaw December 11, 2018

To be sure, it's genuinely difficult to determine where to apply effort (read: spend money). The main challenge is that data collections don't speak for themselves and the typical pace of change in software companies generate new paths to failure that are not represented in past data.


That said, some systems fail long after thoughtful experts recognized they were "going sour." Sometimes the problem is not that people don't see the pattern evolving, but that they are unable to articulate the growing risk in persuasive ways. We've seen pretty big systems 'fall over' in ways that people predicted well before the overt incident started.

To answer your question, the data that we tend to find un-examined (forget about captured or recorded) are for the most part qualitative in nature. Real incidents in the wild are messy and despite the traditional belief that they can (and should) be compared and contrasted against each other via what I would call typically shallow metrics to find “trends” -- what we see in practice reveals that these are for the most part illusory (akin to a Rorschach test).

Instead, see the answer I gave to Kat and Nick’s questions above. I realize that we’re not used to gathering qualitative data in software worlds and it’s a shift from the comforts of statistics. :)

Effective learning from incidents means getting as much out of them as possible, even beyond localized ‘fixes’ to the tech. Compelling narratives about what the teams were surprised by can go a very long way to bootstrap new hires, for example.

Taking a look at “The Field Guide To Understanding Human Error” (3rd edition) by Sidney Dekker is what I would suggest!

Like # people like this
3 votes
blakethorne
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
December 10, 2018

Thanks for joining us, John :D

It seems a lot of teams are getting on board with drilling into discovering the more nuanced, holistic understandings of incident causation factors.

But then there are the "up the flagpole" teams (managers, VPs, C-suite) who still expect to be given a simplistic, single "root cause."

How do you set expectations with these teams? How do you set expectations with the upper levels of management that they will not be given a single-sentence root cause explanation of a recent incident? 

John Allspaw December 11, 2018

This is a great question and I’d like to answer it in a couple of different ways. Before I do that, though, I think it’s important to acknowledge that moving from simplistic and/or singular explanations (“root cause”) to a more holistic and systems-thinking view on accidents is challenging and many domains outside of software (aviation, medicine, etc.) still grapple with this fundamental shift in perspective.

<snark>One response to an exec asking for a single-sentence explanation of an incident is to ask them to boil down their board meetings or earnings calls to a single sentence “root cause”.</snark> :)

All snark aside, I would first attempt to understand what drives their desire for simple answers to what are effectively complex events. Maybe they want a high-level explanation because they foresee needing to describe it to different audiences like auditors or board members or the press?

We avoid the term 'cause'; we prefer to characterize 'sources' and 'contributors'. Complex, heavily defended systems fail because of the combination of multiple interacting conditions and contributors, often minor or even apparently trivial ones. [See "How Complex Systems Fail"]. Frankly, we find that hands-on engineers get this quite easily. Anyone will a modicum of experience with production software systems understands how surprising events can emerge and how baffling its behavior can be -- even to its authors!

When I was CTO at Etsy, the most important concern I had regarding incidents was: are the people responsible for these systems able to “listen” to what these incidents were trying to “tell” them? Are they getting the support (time, resources, training, etc.) they need to continually up their game on extracting the most valuable insights? If not, what can I do to support them better?

Here's the thing: if people are looking hard enough for simple answers to complex challenges, they will find them. If people look hard enough for a "root cause" -- they will find it.

However, the comfort that both of those provide is pretty fleeting and don't produce much in the way of shining critical light in the areas that need it.

Something to ponder when being asked for a 'root cause' or other forms of simplistic explanations: ask the same questions about the *absence* of incidents. What was the 'root cause' of NOT having an incident? :)

Like # people like this
blakethorne
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
December 11, 2018

Thanks, John! Very helpful advice.

3 votes
Nick Coates
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
December 7, 2018

What tips would you suggest for writing a "good" postmortem "report"? Structure of postmortems is something many companies (in my experience leading ITSM at Symantec and talking to many customers) seem to struggle with e.g. what to include, what really would make a difference?

John Allspaw December 11, 2018

Hey Nick -

I think you’ve captured a core challenge people doing incident analysis face: there is really no end to the data and context that can be put into a written exploration of an incident! What to include, what to focus on or highlight, what to skim over or dismiss as valuable...all of these are judgment calls that need to be made by the author(s) of these documents.

My short answer is:

  1. Consider the audience for this particular document. Is it hands-on practitioners who are in the “trenches” of production systems? Is it the general public or customers under contract? Is it executive leaders, board members, or investors? The answer to this influences almost every part of the document’s construction!
  2. There is no One True and Complete Structure™ of a post-incident review document; all incidents have different insights to shed light on. Therefore, it’s up to the person(s) analyzing the incidents to recognize what the case has the potential to reveal about the teams, the context, the systems, etc. and focus on those.
  3. Getting good at this (not just writing post-incident documents, but data collection/collation, group interviews, etc.) takes time and practice.

I will assume that the audience is internal hands-on engineers for the rest of my answer. :) Quite often, these reports are seen as just a simple vehicle upon which “remediation” or follow-up action items are placed. This is a mistake and a missed opportunity!

Capturing tasks that might help in the future is important, but not sufficient for real learning. A key thing to remember is that these documents need to be written with the reader in mind.

What will the reader want to know about the context of the incident? We could imagine that indicating when an outage happened (Cyber Monday? Same day as your IPO? When your CEO is on stage giving an interview?) might be important in some cases and not really important in others, for example.

What about the handling of the incident might be important to highlight to the reader? For example: sometimes, resolving incidents requires the arcane knowledge of long-tenured engineers who understand the weird behaviors of that legacy stuff that usually “just works.” In that case, noting who those engineers are and what they know might be very important. Maybe all the engineers needed to respond to the incident are attending a conference at the time? Capturing this data in a document can be really valuable for readers of the report who wasn’t there at the time or who might want the backstory on why some piece of tech is the way it is.


Effective incident analysis requires developing (and maintaining!) the skills required to not only writing these narratives but also knowing what data to collect, what people to interview, what diagrams to include, etc. Organizations that acknowledge this expertise and invest in getting better at it will find that it’s a competitive advantage. :)

Like # people like this
2 votes
Darrell Pappa December 10, 2018

Hi John! Big fan of your work with the fault injection and resiliency culture at Etsy.

I used to work on a team that would get so many reoccurring issues in production, that it was customary to let them try to auto resolve themselves before investigation. Incidents would be closed frequently without implementing a fix for the root cause.

"Oh its that issue again, reboot this instance to fix it."

I've heard the best thing to do is not to close the incident, but keep it open till a fix for the root cause is in place - which makes sense.

Do you have any recommendations around starting a culture that prevents a negative cycle like this? I love your write up on blameless post-mortems, and I feel like "Just Culture" could be applied here as well...

Do I push for a new process or tool to help accomplish a shift in perspective? Or push for education? Any recommendations on how to tackle culture shift in general would be helpful too.  Thanks so much for you time!

John Allspaw December 11, 2018

Hey Darrell -

As you might imagine from my other answers in this AMA, I tend to generate more questions in response to questions. :)

In theory, all incidents will be investigated sufficiently as they arise and they will be resolved in prompt fashion. In practice, we tend to see a very different picture.

I would consider how clear and unambiguous the definition for an "incident" is - what we find is that what constitutes an "incident" (much like "severity level") is much more negotiable and contextual than many would like to admit.

On the one hand, this simple example utterance: "Oh its that issue again, reboot this instance to fix it." could be seen as a signal of laziness or undisciplined recording of an issue.

On the other hand, we could also construct an explanation for that differently:

  • "just reboot it" could be a reasonable response because its replacement architecture that is robust against that particular flaw is coming shortly
  • it could also be reasonable because the one person who knew how to actually fix it permanently has left the company and you need to buy time before someone else can get up to speed
  • (I'm sure we could imagine other explanations.) 

"Culture" is a tough term to tackle (I suspect you'd rather me not be "hand-wavy") in my answers!) because it can refer to incentives, motivations, norms, behaviors, or many other phenomena.

An alternative to looking at how to get people to do a specific thing that they're not (or unevenly) doing is to look closely at what they actually are doing. When we do assessments for organizations, we find a great deal of work that is both critically important and for the most part hidden from view!

We find:

  • people solving problems that many others don't know even exist
  • people making adjustments to their work that are not in the procedures for that work, because if they were to follow the procedure by the letter, things would go kablooey!
  • people who have obscure esoteric knowledge of systems behavior (critical to know during a tricky incident!) that they don't share with others because they think everyone else already knows it!

I think it's quite a good idea to explore at a meta-level how people perceive their work and what value they think recording/capturing the 'incident' data is. But help them tell their contexts for what they pay attention to (and what they don't) before embarking on persuading them to be diligent about something they might not find valuable. 

Like # people like this
1 vote
Vanessa Huerta December 11, 2018

Hi John,

i am curious about your stance on why timing metrics are meaningless (MTTD, MTTR and such).  I agree that they don't actually tell us much about incidents, but then how do we measure success in Incident Management over time? and more importantly how do you explain that to leadership?

John Allspaw December 11, 2018

Hi Vanessa -

Indeed, this is a significant challenge right?

Preventative design is a goal that is always out of reach. "Bug free" software and "surprise free" architectures do not exist. Therefore, you want to get the most out of the incidents that you do have.

What does success look like? Here are a few indicators that things are going well with respect to extracting meaning and insight from incidents:

  • Group post-incident review meetings are voluntarily attended, enough that people ask to hold them in a bigger room.
  • Transcripts of the conversation in the post-incident review meetings are read and discussed
  • New architecture design diagrams and documents reference documents of past incidents
  • Code comments include references to past incidents
  • Post-incident review documents have materials beyond simple "template" fields and contain diagrams, whiteboard drawings, related past projects, etc.
  • Post-incident review documents and discussions are used as part of training materials for engineering or customer support new hires
  • People describe the meetings and documents as uniquely valuable. One organization we've worked with had staff members report that they "go to these meetings to participate and read and contribute to these docs because I can learn things there about our systems that I can't learn anywhere else."

We have seen these things, it's possible! :)

Like # people like this
1 vote
mappuji December 10, 2018

Hi @John Allspaw

 

I read your books and it is very interesting.

 

My question, how about if there are a lot of incidents in certain time and if we do a post-mortem for each incident we will have no time to work on the improvement at that time after the incident?

John Allspaw December 11, 2018

Hello!

For sure, that situation is a tough one. I understand and have experience with the "shoveling sand against the tide" perspective. :)

Remember that effective incident analysis (like engineering teams!) need to be adaptive. If the tempo of incidents is high or outpaces the ability for a team to make progress on genuinely valuable measures, then I would consider:

  • revisiting the commonly-held definition of "incident" (see my answer to Darrell, above)
  • take a divide-and-conquer approach to capturing data about these events
  • revisit the "improvement" items - how are they generated? how are they evaluated as being valuable?

In the end, it's still possible to get "squashed" with time. If these events are genuinely understood as incidents to explore then it is worthwhile to spend some attention on what these follow-up action items are actually helping with. :/

1 vote
Deleted user December 10, 2018

Hi @John Allspaw,

What training and guidance would you suggest for a Project Management department or PMO that would like to implement incident management and communication with stakeholders?

Thanks 

John Allspaw December 11, 2018

Well, my short answer is that my company does that training. :)

The longer answer would be to spend some time working out what expertise is needed that you don't have, and what it would look like if you had that expertise.

Would incidents be easier to work through? Do you think they'd be shorter?

1 vote
Meg Holbrook
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
December 10, 2018

Hey John -

You mention that organizations tend to lack direction once insights into root cause are determined.

What would you say the top three biggest pitfalls are once the 'storm' had ended and we're left to make the same mistakes again?

How do you work around those biggest pitfalls and make incident management a more iterative and collaborative experience for everyone? 

John Allspaw December 11, 2018

Hi Meg -

Only three? :)

There are a couple of notions to unpack in your questions. I think you're spot-on with the idea that incident management (pre, during, and post!) can be an iterative and collaborative experience. In fact, I'd say that gathering and contrasting multiple (and sometimes quite different!) perspectives and experiences is inherent in doing this sort of thing well.

Three significant pitfalls are:

  • Counterfactual reasoning.
  • Narrowing "micro-fixes".
  • Linear or simplistic causality.

Counterfactual reasoning is a force to be reckoned with. Quite often we'll find people listing actions that were *not* taken as a contributor (or a "cause") of an incident. Or conditions that were *not* present.

For example: "The engineer didn't double-check the test results before running the next command." or "The incident happened because there were no alerts set up for X."

Follow me here...what counterfactuals (did not have, should not have, could have, etc.) do is describe (quite literally!) a reality that didn't exist, as a way to explain a reality that did exist. These descriptions are always given in hindsight, and aren't  helpful. What we want to know instead is what people did do, and what brought them to do what they did. People do what makes sense to them at the time; this is what we want to understand...how it made sense for them to do (or not do) what they did.

Narrowing "micro-fixes" means scoping your exploration of an incident only based on what you can change or influence in the future. This limits the exploration and constrains the set of potential routes to take and insights to share.

By "linear or simplistic causality" I mean the most common manifestation of hindsight bias, which is the tendency to construct a narrative of the form "A then B which makes C which makes D, etc." as if it were a chain of events that simply needed breaking at some point. This isn't very helpful either because complex system (software being included in that) don't follow linear chains but instead represent a network of conditions and forces and sources and triggers.

I mentioned above, but the best resource I could suggest is "The Field Guide To Understanding Human Error" by Sidney Dekker. Get the 3rd edition. :)

Like # people like this
Meg Holbrook
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
December 12, 2018

I think you hit the nail on the head with simplistic causality, which kind of influences the other two you mentioned. 

We see issues as the very narrow thread between "it broke - let's fix" and most organizations lack the ability to zoom out, if you will, and see how the part affects the whole. 

Thanks for the book recommendation, I'll be sure to check it out. 

0 votes
Suraj Pokhriyal January 6, 2019

Hi john,

1) where should one start implementing right incident management frameworks where currently things are not in order ?

2) Should we always think about incident management from scratch? OR we clone existing best knowledge out in world(cloning anyone of like google, Netflix etc)?

John Allspaw January 7, 2019

Hey Suraj - 

#1 - I would say that it depends critically on what actually is not in order. Organizations can't just take an "off the shelf" incident management framework and apply it. Well, they can, but one-size-fits-all approaches like this never really work.

#2 - Even if groups think they're "cloning" practices from other groups, they're still starting from scratch. See #1 above. 

Not likely to be satisfying answers, but they are ones I feel firmly about. :) 

John

Suggest an answer

Log in or Sign up to answer
TAGS
AUG Leaders

Atlassian Community Events