Create
cancel
Showing results for 
Search instead for 
Did you mean: 
Sign up Log in

Service incident rules management with multi-tenant infrastructure

Alessandro Pistola May 11, 2021

Hi everyone, I am currently dealing with a situation where I need to monitor a multi-tenant infrastructure. We have several customers and for each customer we might or might not have more installations of our product and so I have some services that need to be monitored by machine (for example IIS, which is shared across different instances) and some that needs to be monitored by installation (such as some particular windows services).

On every physical machine there is an agent (that monitors all the installations (instances)) that sends an API call to OpsGenie when something goes down. In order to reduce alert fatigue, we use aliases and service incident rules, however that 's not a scalable way to proceed because for every new installation of our product we must manually (or via the API) create a service incident rule hard-coding the value used to discriminate whether or not an incident should be created. In addiction, I noticed that the maximum number of incident rules supported for a service is limited to 100.
Suppose one alert is like this:

 

{
    "message":"Service 'x' of 'test' [Instance_1] stopped",
    "alias":"Agent_1-Instance_1-service-x-stopped",
    "description":"Service x stopped at 5/11/2021 8:04:28 AM, reported by Instance_id: Instance_1",
    "tags": ["Service", "x", "test", "Instance_1", "monitor", "shared"],
    "details":{
        "AgentId":"Agent_1",
        "InstanceId":"Instance_1",
        "InstanceCode":"test"
    },
    "priority":"P2"
}



At the moment, this kind of alert will create a new incident for the "Instance_1" because service "x" is one of the services monitored by installation. Likewise when an API call to create an alert for a machine monitored service is sent, there is a service incident rule for the creation of a new incident associated to that particular machine. Example of machine monitored service alert (the agentId is the same as the machineId):


{
"message":"Service 'y' of 'test' [Instance_1] stopped",
"alias":"Agent_1-service-y-stopped",
"description":"Service y stopped at 5/11/2021 8:04:28 AM, reported by Instance_id: Instance_1",
"tags": ["Service", "y", "test", "Instance_1", "monitor", "shared"],
"details":{
"AgentId":"Agent_1"
},
"priority":"P2"
}


The service incident rule are currently of this type:
if alert contains key InstanceId and InstanceId == Instance_1 then create incident with priority 2
else if alert does not contain key InstanceId and contains AgentId with value == Agent_2 then create incident with priority 2

This method is working fine but we have like 70 agents and more than 100 instances.
Is there a way to not manual check the field's value of the AgentId and InstanceId key?

P.S. probably the approach that I'm using is not the correct one, do I need to consider as workaround the creation of multiple services, one for each installation specifying the incident rules creation for each service? In that way I would not have anymore the limit of the 100 incident rules, having several services available, but I will still have the problem of the hard-coded value on each incident rule.

1 answer

1 accepted

1 vote
Answer accepted
Justin Sitarz
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
June 10, 2021

Hi Alessandro!

 

Your approach here is the correct one, with the options available in our service/incident framework at the time - and I would recommend going with your idea of adding additional services, perhaps one service for each monitored instance, with each holding its own incident rules. Automating service/incident rule creation through the API is probably your best option here, too, to allow for easier and quicker scaling.

Though, from what I gather about your situation, it sounds like what you're really looking for is something analogous to our alert deduplication functionality (where alerts with matching aliases are deduplicated against each other), but for incidents, instead - does that sound accurate? We do have an internal feature request for something like this - so I will add your information and a +1 to that request now. For reference, that request reference ID is OGS-3404.

 

Hopefully that helps to clarify things, but please let me know how else I can assist!

 

Best, 

 

Justin

Suggest an answer

Log in or Sign up to answer
DEPLOYMENT TYPE
CLOUD
TAGS
AUG Leaders

Atlassian Community Events