This article was co-authored by Gavin Cohen of Zebrium. Zebrium has a bi-directional integration with Opsgenie and is a machine learning solution for RCA.
We all know the drill. 💤 You're fast asleep, then your phone blares and buzzes, violently waking you. Through blurry eyes you read, “AppDynamics Alert: latency threshold exceeded on page: shopping_cart_checkout”. You think, "Damn, that’s the service that was upgraded this afternoon."
Coffee in hand, Opsgenie guides you to your APM tool. You click your way through some dashboards. All was fine until 11 minutes ago. Now the shopping cart service is down. No revenue is coming in. You restart the shopping cart service and the pending transactions start going through. And then your phone buzzes again and the dashboards turn red. Time to revert to yesterday’s version of the service. But that doesn’t work either.
You reluctantly open up your logging tool and start searching. There are thousands and thousands of errors and warnings across the different logs. But they seem mostly normal and harmless. The logs are noisy and difficult to parse through, sleep gets farther and farther away.
Automated root cause analysis right inside Opsgenie
But what if it could be different? Imagine that instead of opening different logging tools, and parsing through data you opened Opsgenie and saw the below incident:
Then you clicked into the timeline and found this:
Bingo! The root case. One of the developers must have accidentally left oomtest running on your production node and it's using all the available memory. 🤦🏽
How the root cause, “magically” appears in Opsgenie:
Zebrium uses unsupervised machine learning to automatically find incident root cause by looking for hotspots of correlated anomalous patterns in logs and metrics (read more here). Zebrium can perform its own detection, but it can also take a “signal” from an external application and use that as a trigger to generate a root cause report.
With the Zebrium + Opsgenie integration, the signal to provide a root cause report comes from Opsgenie. The process looks like this.
Setting up the integration
In Opsgenie, from Settings/API Key management, add a new API key and enable it for read, create and update access. Now in the Zebrium UI, set up an Inbound Alert for Opsgenie, and enter the API key you created and the Opsgenie region. This will create an inbound webhook URL that Opsgenie can use.
Next, go to the Opsgenie Integration page and click the Zebrium integration and follow the instructions you see on the screen. You will need to use the webhook URL that was just created.
Once the configuration has been completed, new incidents will trigger a signal to Zebrium, and a root cause summary will show up in the incident. No more hunting for root cause! (Full instructions are here).
Use Zebrium + Opsgenie to detect new/unknown issues
Zebrium constantly scans and automatically creates RCA reports for abnormally correlated anomalies across logs and metrics. By setting up an outgoing alert from Zebrium to Opsgenie, you can trigger the creation of Opsgenie incidents.
Since most environments already have monitoring tools that create incidents for major outages, we recommend configuring Zebrium ML-detected incidents as priority P3. Zebrium is not rules-based, so this can be a very effective way of catching new, rare or unknown issues early. Engineers can use them to improve product quality and proactively fix latent bugs, before they manifest as production P1 incidents. We use this feature extensively at Zebrium and have so far avoided several major problems in this way. Learn more here.
The picture above shows the Opsgenie configuration for two way integration with Zebrium:
Most environments already have mechanisms to detect major incidents. Using Opsgenie, the incident response process is made as painless as possible. By leveraging the integration between Zebrium and Opsgenie, your team can nail the Incident Response Orchestration and RCA at the same time. SREs and developers can save countless hours hunting through dashboards and logs, and instead view the RCA right within the context of the Incident Timeline. This means resolving the incident faster, and getting back to bed sooner when you're on-call. It's a win-win.
For more information or to sign-up for a free trial of Zebrium, please visit https://www.zebrium.com
Kate ClavetAtlassian Team
We know that great teams require amazing project management chops. It's no surprise that great teams who use Jira have strong project managers, effective workflows, and secrets that bring planning ...