Create
cancel
Showing results for 
Search instead for 
Did you mean: 
Sign up Log in

It's not the same without you

Join the community to find out what other Atlassian users are discussing, debating and creating.

Atlassian Community Hero Image Collage

Opsgenie in a nutshell

What is Opsgenie?

Opsgenie is an Atlassian tool used for incident management and alert notifications. Now you might be asking yourself, why do I need it? Well the simple answer is that you don't want to spend time putting out fires would you! Then again, understanding the benefits of Opsgenie will help you make better decisions and manage those fires when they do arise.

incident-management.jpg

What can I do with Opsgenie?

You can use Opsgenie to manage all your Incident / Problem / Service Management use cases. Once a Service request is created, notifications can be triggered in the form of

  • Email

  • SMS

  • Voice

To a specific user / agent to alert them on the ongoing incident and what is the action required. There are different rotations which can be scheduled to specifically alert a user. This can be done using the On-Call Schedule feature, which follows a structured timeline. Subsequently, you can create an “Escalation policy” to show the series of levels when a notification has not been acknowledge (i.e. the notification process flow hasn’t moved to the next stage once received).

Likewise, there’s also a Routing rule, which can further define how a notification could be routed either from L1 to L2 or even L3 as defined in the Process flow of the RFI or based on certain users work Schedule by following the On-Call Schedule.

 

Incident Overview

When you’re deploying an ITSM technological service, your services are expected to be running 24/7 and when something goes wrong, team members are suppose to respond immediately. This process is called Incident management and it is a complex challenge among big or small companies.

What is an Incident?

An incident is an event that causes disruption to or a reduction in the quality of services that requires an emergency response. An incident is resolved when the faulty service resumes its full functionality in its usual way. This includes only those tasks required to restore full functionality and excludes follow-on tasks such as root cause identification and mitigation, which are part of the postmortem.

  1. Detect → Be the first to know

  2. Respond → Escalate

  3. Recover → Resolve quickly

  4. Learn → never blame

  5. Improve → identify the root cause to avoid repeats

Tooling Requirements

  • Alerting and on-call management: The use of Opsgenie to manage on-call rotations and escalations.

  • Chat room: a real-time text communication channel is fundamental to diagnosing and resolving the incident as a team.

  • Video chat: use a tool that can help you quickly discuss and agree on approaches in real time.

  • Incident tracking: every incident is tracked as a Jira issue, with a follow-up issue created to track the completion of postmortems.

Why use Opsgenie?

Many companies at the centre of their incident response process use Opsgenie. It helps them rapidly respond to, resolve, and learn from incidents.

reactive-vs-proactive.png

Opsgenie streamlines collaboration by automatically posting information to chat tools like Slack and Microsoft Teams, and creating a virtual war room complete with a native video conference bridge. The deep integrations with Jira Software and Jira Service Management provide visibility into all post-incident tasks. And the use of Confluence to share post mortems via blogs.

Open Communication

This involves the incident manager using the available tools to communicate with members of the team once an incident occurs.

Assess

  • What is the impact to customers (internal or external)?

  • What are customers seeing?

  • How many customers are affected (some or all)?

  • When did it start?

  • How many support cases have customers opened?

  • Are there other factors, e.g., public attention, security, or data loss?

Transparency with your Customers

If you have an outage, in most situations, you’re busy fire fighting the problem. Your support team starts to see queues fill up and you begin to receive messages on social media platform. Good incident response isn’t about getting services back-up quickly but providing upfront and frequent feedback to your customers. Poor communication can have an impact if you’re running a large enterprise Application for multiple teams in your organization. if a fault develops and becomes a problem, having the wrong choice of communication platform can spell doom for you and give you frustrated customers, which can have an impact on the business. A good incident plan is a journey, not a destination. It’s something you constantly improve and iterate on.

Escalate

Your first responders might be all the people you need in order to resolve the incident, but more often than not, you need to bring other teams into the incident by paging them. We call this escalation.

The key system in this step is a page rostering and alerting tool like Opsgenie

Delegate

After you escalate to someone and they come online, the IM delegates a role to them. As long as they understand what’s required of their role then they will be able to work quickly and effectively as part of the incident team.

What is Postmortem?

A postmortem is a written record of an incident that describes:

  • The incident’s impact.

  • The actions taken to mitigate or resolve the incident.

  • The incident’s causes.

  •  Follow-up actions taken to prevent the incident from

    happening again.

Opsgenie can automatically generate postmortem reports using prebuilt templates.A postmortem seeks to maximize the value of an incident by understanding all contributing causes, documenting the incident
for future reference and pattern discovery, and enacting effective preventative actions to reduce the likelihood or impact of recurrence.

Integrating Opsgenie with Jira Service Management or Jira Software

Opsgenie as a modern incident management tool integrates well with other softwares. one of the ways this is done is via Webhooks and with Jira Service Management Automation, you can easily configure an Automation that watches out for events and can quickly notify Opsgenie which can readily take action on the problem at hand. On the Opsgenie Admin Console, you will need to navigate to the “Alert” page in order to acknowledge that an alert has been responded to.

Reporting with Opsgenie

Easily generate report using the “Analytics” page to see what’s happening with notifications, Alert reports or an Insight of your monthly notifications. Time taken to acknowledge or time to resolve the incident.

Working with Teams

Group your notifications / alert into a Team by using the “Teams” page. Create a team and add members to that team. Then configure an On-call schedule tailored for the specific team to act on an incidence once it arises. Integrate the alert with other softwares, define policies for your team and enable custom roles that give the ability to allow what a user can do when adding them as part of a member as you can assign such role to a user in the “Members” tab of the “Teams” page.

Reducing Incident

The sentiment—wanting to reduce the impact of incidents—is correct, but aiming for zero incidents is a mistake ”

Having a change in an organization means risk and risk means incidents. In the presence of risks, incident are bound to happen and an incident should mean the beginning of a process of improvement. Once you understand that, then you can focus on the goals

  • What you’ve learned from the incidents

  • How well your organization can respond to incidents

A good incident process is fast and predictable in the sense that it turns detection into response, escalate to the right people in the shortest path, make communications clear and keep your customers in the loop. As the primary goal to an incident process is a fast resolution and your incident process is “how well you respond”.

The primary goal of a postmortem process is to prevent repeats by learning from the incidents. you will have to learn from it, extract learnings from several angles and find the best way to mitigate the issue. Read more on Atlassian incident handbook

2 comments

Jimmy Seddon Community Leader Feb 02, 2021

Great write up @Prince Nyeche!  Thank you for sharing!

Glad to share the knowledge @Jimmy Seddon 

Comment

Log in or Sign up to comment
TAGS
Community showcase
Published in Opsgenie

Leveraging Atlassian’s Opsgenie and AWS Cloudformation Registry to stay ahead of incidents

In the past, managing IT infrastructure was a hard job. It required a lot of manual effort and it was hard to keep track of all the necessary information (monitoring, scalability etc). Thankfully, as...

1,849 views 0 11
Read article

Community Events

Connect with like-minded Atlassian users at free events near you!

Find an event

Connect with like-minded Atlassian users at free events near you!

Unfortunately there are no Community Events near you at the moment.

Host an event

You're one step closer to meeting fellow Atlassian users at your local event. Learn more about Community Events

Events near you