Opsgenie is an Atlassian tool used for incident management and alert notifications. Now you might be asking yourself, why do I need it? Well the simple answer is that you don't want to spend time putting out fires would you! Then again, understanding the benefits of Opsgenie will help you make better decisions and manage those fires when they do arise.
You can use Opsgenie to manage all your Incident / Problem / Service Management use cases. Once a Service request is created, notifications can be triggered in the form of
To a specific user / agent to alert them on the ongoing incident and what is the action required. There are different rotations which can be scheduled to specifically alert a user. This can be done using the On-Call Schedule feature, which follows a structured timeline. Subsequently, you can create an “Escalation policy” to show the series of levels when a notification has not been acknowledge (i.e. the notification process flow hasn’t moved to the next stage once received).
Likewise, there’s also a Routing rule, which can further define how a notification could be routed either from L1 to L2 or even L3 as defined in the Process flow of the RFI or based on certain users work Schedule by following the On-Call Schedule.
When you’re deploying an ITSM technological service, your services are expected to be running 24/7 and when something goes wrong, team members are suppose to respond immediately. This process is called Incident management and it is a complex challenge among big or small companies.
An incident is an event that causes disruption to or a reduction in the quality of services that requires an emergency response. An incident is resolved when the faulty service resumes its full functionality in its usual way. This includes only those tasks required to restore full functionality and excludes follow-on tasks such as root cause identification and mitigation, which are part of the postmortem.
Detect → Be the first to know
Respond → Escalate
Recover → Resolve quickly
Learn → never blame
Improve → identify the root cause to avoid repeats
Alerting and on-call management: The use of Opsgenie to manage on-call rotations and escalations.
Chat room: a real-time text communication channel is fundamental to diagnosing and resolving the incident as a team.
Video chat: use a tool that can help you quickly discuss and agree on approaches in real time.
Incident tracking: every incident is tracked as a Jira issue, with a follow-up issue created to track the completion of postmortems.
Many companies at the centre of their incident response process use Opsgenie. It helps them rapidly respond to, resolve, and learn from incidents.
Opsgenie streamlines collaboration by automatically posting information to chat tools like Slack and Microsoft Teams, and creating a virtual war room complete with a native video conference bridge. The deep integrations with Jira Software and Jira Service Management provide visibility into all post-incident tasks. And the use of Confluence to share post mortems via blogs.
This involves the incident manager using the available tools to communicate with members of the team once an incident occurs.
What is the impact to customers (internal or external)?
What are customers seeing?
How many customers are affected (some or all)?
When did it start?
How many support cases have customers opened?
Are there other factors, e.g., public attention, security, or data loss?
If you have an outage, in most situations, you’re busy fire fighting the problem. Your support team starts to see queues fill up and you begin to receive messages on social media platform. Good incident response isn’t about getting services back-up quickly but providing upfront and frequent feedback to your customers. Poor communication can have an impact if you’re running a large enterprise Application for multiple teams in your organization. if a fault develops and becomes a problem, having the wrong choice of communication platform can spell doom for you and give you frustrated customers, which can have an impact on the business. A good incident plan is a journey, not a destination. It’s something you constantly improve and iterate on.
Your first responders might be all the people you need in order to resolve the incident, but more often than not, you need to bring other teams into the incident by paging them. We call this escalation.
The key system in this step is a page rostering and alerting tool like Opsgenie
After you escalate to someone and they come online, the IM delegates a role to them. As long as they understand what’s required of their role then they will be able to work quickly and effectively as part of the incident team.
A postmortem is a written record of an incident that describes:
The incident’s impact.
The actions taken to mitigate or resolve the incident.
The incident’s causes.
Follow-up actions taken to prevent the incident from
Opsgenie can automatically generate postmortem reports using prebuilt templates.A postmortem seeks to maximize the value of an incident by understanding all contributing causes, documenting the incident
for future reference and pattern discovery, and enacting effective preventative actions to reduce the likelihood or impact of recurrence.
Opsgenie as a modern incident management tool integrates well with other softwares. one of the ways this is done is via Webhooks and with Jira Service Management Automation, you can easily configure an Automation that watches out for events and can quickly notify Opsgenie which can readily take action on the problem at hand. On the Opsgenie Admin Console, you will need to navigate to the “Alert” page in order to acknowledge that an alert has been responded to.
Easily generate report using the “Analytics” page to see what’s happening with notifications, Alert reports or an Insight of your monthly notifications. Time taken to acknowledge or time to resolve the incident.
Group your notifications / alert into a Team by using the “Teams” page. Create a team and add members to that team. Then configure an On-call schedule tailored for the specific team to act on an incidence once it arises. Integrate the alert with other softwares, define policies for your team and enable custom roles that give the ability to allow what a user can do when adding them as part of a member as you can assign such role to a user in the “Members” tab of the “Teams” page.
“ The sentiment—wanting to reduce the impact of incidents—is correct, but aiming for zero incidents is a mistake ”
Having a change in an organization means risk and risk means incidents. In the presence of risks, incident are bound to happen and an incident should mean the beginning of a process of improvement. Once you understand that, then you can focus on the goals
What you’ve learned from the incidents
How well your organization can respond to incidents
A good incident process is fast and predictable in the sense that it turns detection into response, escalate to the right people in the shortest path, make communications clear and keep your customers in the loop. As the primary goal to an incident process is a fast resolution and your incident process is “how well you respond”.
The primary goal of a postmortem process is to prevent repeats by learning from the incidents. you will have to learn from it, extract learnings from several angles and find the best way to mitigate the issue. Read more on Atlassian incident handbook
In the past, managing IT infrastructure was a hard job. It required a lot of manual effort and it was hard to keep track of all the necessary information (monitoring, scalability etc). Thankfully, as...
Connect with like-minded Atlassian users at free events near you!Find an event
Connect with like-minded Atlassian users at free events near you!
Unfortunately there are no Community Events near you at the moment.Host an event
You're one step closer to meeting fellow Atlassian users at your local event. Learn more about Community Events