Service incident rules management with multi-tenant infrastructure

May 11, 2021

Hi everyone, I am currently dealing with a situation where I need to monitor a multi-tenant infrastructure. We have several customers and for each customer we might or might not have more installations of our product and so I have some services that need to be monitored by machine (for example IIS, which is shared across different instances) and some that needs to be monitored by installation (such as some particular windows services).

On every physical machine there is an agent (that monitors all the installations (instances)) that sends an API call to OpsGenie when something goes down. In order to reduce alert fatigue, we use aliases and service incident rules, however that 's not a scalable way to proceed because for every new installation of our product we must manually (or via the API) create a service incident rule hard-coding the value used to discriminate whether or not an incident should be created. In addiction, I noticed that the maximum number of incident rules supported for a service is limited to 100.
Suppose one alert is like this:

{
    "message":"Service 'x' of 'test' [Instance_1] stopped",
    "alias":"Agent_1-Instance_1-service-x-stopped",
    "description":"Service x stopped at 5/11/2021 8:04:28 AM, reported by Instance_id: Instance_1",
    "tags": ["Service", "x", "test", "Instance_1", "monitor", "shared"],
    "details":{
        "AgentId":"Agent_1",
        "InstanceId":"Instance_1",
        "InstanceCode":"test"
    },
    "priority":"P2"
}

At the moment, this kind of alert will create a new incident for the "Instance_1" because service "x" is one of the services monitored by installation. Likewise when an API call to create an alert for a machine monitored service is sent, there is a service incident rule for the creation of a new incident associated to that particular machine. Example of machine monitored service alert (the agentId is the same as the machineId):


{
    "message":"Service 'y' of 'test' [Instance_1] stopped",
    "alias":"Agent_1-service-y-stopped",
    "description":"Service y stopped at 5/11/2021 8:04:28 AM, reported by Instance_id: Instance_1",
    "tags": ["Service", "y", "test", "Instance_1", "monitor", "shared"],
    "details":{
        "AgentId":"Agent_1"
    },
    "priority":"P2"
}

The service incident rule are currently of this type:
if alert contains key InstanceId and InstanceId == Instance_1 then create incident with priority 2
else if alert does not contain key InstanceId and contains AgentId with value == Agent_2 then create incident with priority 2

This method is working fine but we have like 70 agents and more than 100 instances.
Is there a way to not manual check the field's value of the AgentId and InstanceId key?

P.S. probably the approach that I'm using is not the correct one, do I need to consider as workaround the creation of multiple services, one for each installation specifying the incident rules creation for each service? In that way I would not have anymore the limit of the 100 incident rules, having several services available, but I will still have the problem of the hard-coded value on each incident rule.

Product Q&A

Community resources

Support

Top groups

Community resources

Support

Learn

Community resources

Support

Events

Community resources

Support

Get product advice from experts

Join a community group

Advance your career with learning paths

Earn badges and rewards

Connect and share ideas at events

Service incident rules management with multi-tenant infrastructure

1 answer

1 accepted

Suggest an answer

Was this helpful?

Thanks!

DEPLOYMENT TYPE

TAGS

Atlassian Community Events