How do I use Zebrium + Opsgenie for Root Cause Analysis (RCA)?

May 19, 2021

This article was co-authored by Gavin Cohen of Zebrium. Zebrium has a bi-directional integration with Opsgenie and is a machine learning solution for RCA.

We all know the drill. 💤 You're fast asleep, then your phone blares and buzzes, violently waking you. Through blurry eyes you read, “AppDynamics Alert: latency threshold exceeded on page: shopping_cart_checkout”. You think, "Damn, that’s the service that was upgraded this afternoon."

Screen Shot 2021-05-19 at 4.54.59 PM.png

Coffee in hand, Opsgenie guides you to your APM tool. You click your way through some dashboards. All was fine until 11 minutes ago. Now the shopping cart service is down. No revenue is coming in. You restart the shopping cart service and the pending transactions start going through. And then your phone buzzes again and the dashboards turn red. Time to revert to yesterday’s version of the service. But that doesn’t work either.

You reluctantly open up your logging tool and start searching. There are thousands and thousands of errors and warnings across the different logs. But they seem mostly normal and harmless. The logs are noisy and difficult to parse through, sleep gets farther and farther away.

Automated root cause analysis right inside Opsgenie

But what if it could be different? Imagine that instead of opening different logging tools, and parsing through data you opened Opsgenie and saw the below incident:

Screen Shot 2021-05-19 at 4.55.11 PM.png

Then you clicked into the timeline and found this:
Screen Shot 2021-05-19 at 4.55.19 PM.png

Bingo! The root case. One of the developers must have accidentally left oomtest running on your production node and it's using all the available memory. 🤦🏽

How the root cause, “magically” appears in Opsgenie:

Zebrium uses unsupervised machine learning to automatically find incident root cause by looking for hotspots of correlated anomalous patterns in logs and metrics (read more here). Zebrium can perform its own detection, but it can also take a “signal” from an external application and use that as a trigger to generate a root cause report.

With the Zebrium + Opsgenie integration, the signal to provide a root cause report comes from Opsgenie. The process looks like this.

An APM (or other) tool detects a problem (in this case, a threshold has been exceeded) and opens an Opsgenie incident. The pager goes off and the on-call SRE is woken up.
Opsgenie automatically sends a signal to Zebrium requesting a root cause report.
Zebrium responds with a summary of the root cause report and adds it to the already opened incident.
The On-call SRE now understands what happened, fixes the issue and goes back to sleep!

Setting up the integration
In Opsgenie, from Settings/API Key management, add a new API key and enable it for read, create and update access. Now in the Zebrium UI, set up an Inbound Alert for Opsgenie, and enter the API key you created and the Opsgenie region. This will create an inbound webhook URL that Opsgenie can use.

Screen Shot 2021-05-19 at 4.55.34 PM.png

Next, go to the Opsgenie Integration page and click the Zebrium integration and follow the instructions you see on the screen. You will need to use the webhook URL that was just created.

Once the configuration has been completed, new incidents will trigger a signal to Zebrium, and a root cause summary will show up in the incident. No more hunting for root cause! (Full instructions are here).

Use Zebrium + Opsgenie to detect new/unknown issues
Zebrium constantly scans and automatically creates RCA reports for abnormally correlated anomalies across logs and metrics. By setting up an outgoing alert from Zebrium to Opsgenie, you can trigger the creation of Opsgenie incidents.

Since most environments already have monitoring tools that create incidents for major outages, we recommend configuring Zebrium ML-detected incidents as priority P3. Zebrium is not rules-based, so this can be a very effective way of catching new, rare or unknown issues early. Engineers can use them to improve product quality and proactively fix latent bugs, before they manifest as production P1 incidents. We use this feature extensively at Zebrium and have so far avoided several major problems in this way. Learn more here.

Screen Shot 2021-05-19 at 4.55.46 PM.png

The picture above shows the Opsgenie configuration for two way integration with Zebrium:

Zebrium to Opsgenie - when Zebrium creates a proactive incident, it will create an incident in Opsgenie (recommend creating them as P3 incidents).
Opsgenie to Zebrium – when a new incident is created by any tool in Opsgenie, it will be automatically augmented with a root cause report created by Zebrium’s ML.

Summary
Most environments already have mechanisms to detect major incidents. Using Opsgenie, the incident response process is made as painless as possible. By leveraging the integration between Zebrium and Opsgenie, your team can nail the Incident Response Orchestration and RCA at the same time. SREs and developers can save countless hours hunting through dashboards and logs, and instead view the RCA right within the context of the Incident Timeline. This means resolving the incident faster, and getting back to bed sooner when you're on-call. It's a win-win.

For more information or to sign-up for a free trial of Zebrium, please visit https://www.zebrium.com

Forums

Q&A

Community resources

Support

Top groups

Community resources

Support

Learn

Community resources

Support

Events

Community resources

Support

How do I use Zebrium + Opsgenie for Root Cause Analysis (RCA)?

0 comments

Comment

Was this helpful?

Thanks!

About this author

TAGS

Atlassian Community Events