Best practices for incident management
15 min
Intermediate
By the end of this lesson, you’ll be able to:
- Implement best practices for managing and learning from major incidents
- Best practices for each stage of the incident management process
Can we avoid incidents?
Designing an incident process to eliminate major incidents is often tempting, but you can't eliminate all incidents.
Organizations often introduce safety gates, checkpoints, and other overheads to their software design, development, and release processes in an attempt to eliminate incidents. They carefully scrutinize every detail of the change so no bugs slip through. This is an understandable reaction to incidents, but it’s not the right approach.
Change gates and checkpoints slow the organization’s rate of change, which tends to reduce the rate of incidents in the short term but reduces momentum. Backlogged changes often pile up, leading to bigger batches of change less frequently, making them more risky. The net effect is that your company can't make changes as fast as before, there is more overhead, and incidents still occur.
Best approach to reducing incidents
The only way to prevent incidents is to stop change. However, organizations need to make changes to compete and survive, and all change entails risk. The right approach is to reduce risk in ways that allow us to continue to pursue progress.
Having an incident doesn’t mean that you've failed. Instead, regard incidents as the beginning of a process of improvement.
👇 Click the tabs below to explore where to focus when responding to and learning from major incidents.
How well does your organization respond to major incidents? A good incident response:
- Quickly turns detection into response
- Escalates to the right people quickly
- Communicates clearly and keeps customers in the loop
- Is simple enough to follow under stress
- Is broad enough to work for a variety of incidents
The primary goal of an incident process is fast resolution, so speed and simplicity are most important.
Incidents are going to happen whether we want them to or not. Your job as the owner of an incident and post-incident review process is to:
- Make this “investment” as cheap as possible (in terms of reducing incident impact and duration).
- Extract value from each incident (in the form of learnings and mitigations).
Best practices for each stage in the incident management process
Detect
Incidents can be detected by monitoring and alerting tools or by customers reporting them on the customer portal. A balanced service includes enough monitoring and alerting to detect incidents proactively, even before your customers do. The best monitoring alerts you to problems before they become incidents.
JSM Operations can integrate with many monitoring systems (such as Datadog). The challenge here is to identify what matters most in the noise of alerts.
Using JSM Operations, you can automatically notify on-call users when specific alert criteria are met.
You can also create action policies that automatically run diagnostic or remediation actions in response to incoming alerts. Through integration with 3rd-party automation platforms (such as AWS Systems Manager), JSM Operations will trigger your response playbooks when an alert meets your predefined criteria. The system can take corrective action without involving your on-call engineers, reducing alert fatigue and MTTR.
Classify and respond
👇Click the boxes below to learn more about the best practices for the classify and respond stage.
Communicate
Communicating with your customers and stakeholders during a major incident is crucial. Keeping them informed about what's happening and what you're doing to fix the problem builds trust.
👇Click each tab to learn best practices for incident communication.
Quickly acknowledge the issue, briefly summarize the known impact, promise further updates, and, if possible, alleviate any concerns about security or data loss.
👇Watch this video to view some examples of communication.
Investigate and recover
At the investigate and recover stage, you need to restore service as quickly as possible. Never hesitate to resolve an incident quickly to minimize the impact on your customers.
High-performing incident response teams use a collection of the right tools and practices. Let's look at some of the team practices.
👇Watch this video to learn about the roles in an incident response team.
There’s no single process that will resolve all incidents. Instead, iterate on the following procedure to quickly adapt to incident response scenarios:
- Observe what's going on and share and confirm those observations.
- Develop theories about why the incident is happening.
- Develop experiments that prove or disprove those theories and carry them out.
- Repeat until the incident is resolved.
An incident is resolved when normal service has resumed and the current or imminent business impact has ended.
At this point, the biggest challenges for the incident commander are maintaining the team’s discipline and keeping up the team's collaboration.
The incident commander should always be asking:
- Is the team communicating effectively?
- What are the current observations, theories, and streams of work?
- Are we making decisions effectively?
- Are we making changes intentionally and carefully? And even do we know what changes we’re making?
- Are people doing their jobs? Do we need to escalate to more teams?
As an incident commander, it's essential that:
- You don't panic and stay calm; the rest of the team will take that cue.
- Keep an eye on team fatigue and plan team handovers.
A dedicated team can risk burning themselves out when resolving complex incidents.
Learn and improve
To prevent the same types of major incidents from happening again, teams need to learn from them. This is one of the most critical stages in incident management.
Create a post-incident review to:
- Understand the contributing causes
- Document the incident for future reference and pattern discovery
- Take actions to reduce the likelihood of recurrence or reduce MTTR
If you think of an incident as an unscheduled investment in the reliability of your services, then the post-incident review is how you maximize the return on that investment.
👇Click the boxes below to learn more about the best practices for writing the post-incident review.
Use the Atlassian Team Playbook
Having the “right” tools and following the “right” processes often aren’t enough. By building a strong culture and adopting team practices based on collaboration and transparency, organizations can develop behaviors that make them resilient and more adaptable to change.
We recommend service teams utilize the following plays from the free Atlassian Team Playbook.
👇 Click each tab to explore recommended Atlassian Team Playbook plays.
The Incident values play helps you identify what your team values most during incident response and create a plan to live those values consistently.
For more information, visit the Atlassian Team Playbook.
How was this lesson?
next lesson
What is incident management?
- What are incidents?
- Incident severity
- Incident management
- Effective incident management
- Have a strong process