Recently we have been testing out the deduplication feature for repeating alerts.
In principle, it’s a good idea - I don’t want many duplicate alerts telling me my virtual machines are dying when one is enough.
However, I am trying to work out the most ‘effective’ use of it, especially with use of the slack notifications.
Lets give a scenario.
1. It’s 10PM. A virtual machine dies. Alert is created with alias ‘deadVM’ and the alert description mentions the VM ID and time that it died. Notification sent to slack.
2. Person on out of hours acknowledges alert for later analysis. It’s not quite a P1. Production is still online, it looks like it might have been a one off. They don’t close the alert as further digging will happen tomorrow in office hours.
3. 9 Virtual machines die at 2am. Due to alias, another alert is not created. Deduplication count raises. Description is…unchanged. Activity log shows new deduplication event, but no information on what new VMs died. No slack notification sent.
4. Large scale issue due to many VMs dying and not being investigated overnight.
I’m thinking of turning deduplication off. Is there any way to make this more useful? For example, notify again if deduplication reaches X? The description of an alert is static. There are no further notifications by design but in a lot of cases this leads to lost data rather than added convenience.
Can we add notes automatically on new deduplication or edit the description with the new info? Even if we can add a note, I presume it will automatically create a message on slack, making the deduplication not make sense as we have now notifed anyway 😀 . It seems to me that deduplication might only be effective if using the core Opsgenie notifications, but doesn’t make sense if notifying via slack.
Would be interested to hear other people’s setup.