Create
cancel
Showing results for 
Search instead for 
Did you mean: 
Sign up Log in
Deleted user
0 / 0 points
Next:
badges earned

Your Points Tracker
Challenges
Leaderboard
  • Global
  • Feed

Badge for your thoughts?

You're enrolled in our new beta rewards program. Join our group to get the inside scoop and share your feedback.

Join group
Recognition
Give the gift of kudos
You have 0 kudos available to give
Who do you want to recognize?
Why do you want to recognize them?
Kudos
Great job appreciating your peers!
Check back soon to give more kudos.

Past Kudos Given
No kudos given
You haven't given any kudos yet. Share the love above and you'll see it here.

It's not the same without you

Join the community to find out what other Atlassian users are discussing, debating and creating.

Atlassian Community Hero Image Collage

Service incident rules management with multi-tenant infrastructure Edited

Hi everyone, I am currently dealing with a situation where I need to monitor a multi-tenant infrastructure. We have several customers and for each customer we might or might not have more installations of our product and so I have some services that need to be monitored by machine (for example IIS, which is shared across different instances) and some that needs to be monitored by installation (such as some particular windows services).

On every physical machine there is an agent (that monitors all the installations (instances)) that sends an API call to OpsGenie when something goes down. In order to reduce alert fatigue, we use aliases and service incident rules, however that 's not a scalable way to proceed because for every new installation of our product we must manually (or via the API) create a service incident rule hard-coding the value used to discriminate whether or not an incident should be created. In addiction, I noticed that the maximum number of incident rules supported for a service is limited to 100.
Suppose one alert is like this:

 

{
    "message":"Service 'x' of 'test' [Instance_1] stopped",
    "alias":"Agent_1-Instance_1-service-x-stopped",
    "description":"Service x stopped at 5/11/2021 8:04:28 AM, reported by Instance_id: Instance_1",
    "tags": ["Service", "x", "test", "Instance_1", "monitor", "shared"],
    "details":{
        "AgentId":"Agent_1",
        "InstanceId":"Instance_1",
        "InstanceCode":"test"
    },
    "priority":"P2"
}



At the moment, this kind of alert will create a new incident for the "Instance_1" because service "x" is one of the services monitored by installation. Likewise when an API call to create an alert for a machine monitored service is sent, there is a service incident rule for the creation of a new incident associated to that particular machine. Example of machine monitored service alert (the agentId is the same as the machineId):


{
"message":"Service 'y' of 'test' [Instance_1] stopped",
"alias":"Agent_1-service-y-stopped",
"description":"Service y stopped at 5/11/2021 8:04:28 AM, reported by Instance_id: Instance_1",
"tags": ["Service", "y", "test", "Instance_1", "monitor", "shared"],
"details":{
"AgentId":"Agent_1"
},
"priority":"P2"
}


The service incident rule are currently of this type:
if alert contains key InstanceId and InstanceId == Instance_1 then create incident with priority 2
else if alert does not contain key InstanceId and contains AgentId with value == Agent_2 then create incident with priority 2

This method is working fine but we have like 70 agents and more than 100 instances.
Is there a way to not manual check the field's value of the AgentId and InstanceId key?

P.S. probably the approach that I'm using is not the correct one, do I need to consider as workaround the creation of multiple services, one for each installation specifying the incident rules creation for each service? In that way I would not have anymore the limit of the 100 incident rules, having several services available, but I will still have the problem of the hard-coded value on each incident rule.

1 answer

1 accepted

1 vote
Answer accepted

Hi Alessandro!

 

Your approach here is the correct one, with the options available in our service/incident framework at the time - and I would recommend going with your idea of adding additional services, perhaps one service for each monitored instance, with each holding its own incident rules. Automating service/incident rule creation through the API is probably your best option here, too, to allow for easier and quicker scaling.

Though, from what I gather about your situation, it sounds like what you're really looking for is something analogous to our alert deduplication functionality (where alerts with matching aliases are deduplicated against each other), but for incidents, instead - does that sound accurate? We do have an internal feature request for something like this - so I will add your information and a +1 to that request now. For reference, that request reference ID is OGS-3404.

 

Hopefully that helps to clarify things, but please let me know how else I can assist!

 

Best, 

 

Justin

Suggest an answer

Log in or Sign up to answer
DEPLOYMENT TYPE
CLOUD
TAGS
Community showcase
Published in Opsgenie

How do I use Zebrium + Opsgenie for Root Cause Analysis (RCA)?

This article was co-authored by Gavin Cohen of Zebrium. Zebrium has a bi-directional integration with Opsgenie and is a machine learning solution for RCA.  We all know the drill. 💤 You'r...

313 views 0 3
Read article

Community Events

Connect with like-minded Atlassian users at free events near you!

Find an event

Connect with like-minded Atlassian users at free events near you!

Unfortunately there are no Community Events near you at the moment.

Host an event

You're one step closer to meeting fellow Atlassian users at your local event. Learn more about Community Events

Events near you