My hope is that someone from Atlassian who supports or develops the automation engine for Jira Cloud can respond to this question...
What is the expected behavior for rule triggering after an Automation for Jira outage? Specifically, will the events (scheduled or immediate) that happened during the outage window:
I am trying to plan for customer recovery practices after an outage that impacts automation rules.
Thanks, and best regards,
Automation engineer here. Great question. The answer, and this is probably obvious, is your last bullet point. "depends case by case on the nature of the incident leading to the outage".
That said, we can be a bit more informative than that by explaining a bit of the architecture.
I hasten to point out that none of what follows is a promise. Our architecture will change in ways that I can't foresee. We expend great effort to ensure redundancy and good alerting, so any incident which occurs is, by the nature of it being an "incident", a surprise to us.
As events occur in Jira, Automation consumes the resulting webhooks. Mostly. We also consume events from an internal piece of infrastructure which gives access to Atlassian only events. This is how Automation knows how to trigger an Automation. When these events occur, we toss them onto a queue. A worker comes along and consumes the enqueue event. This worker looks up rules which may need to be executed based on the content of the event. If rules are found, these are placed on yet another queue which has yet more working which come along and do the actual work of the automation. The most common "work" for an automation is to make a series of API calls back to jira to perform some data mutation on one or more issues.
This is, of course, a vague sketch of the architecture but as you may be able to see from that description, failure at different points result in very different outcomes:
I hope this helps to answer your question. Our goal is to create a system which is immune to any single point of failure, which fails gracefully when it fails (doesn't propagate failures across tenants), which scales as both Atlassian grows but also as individual tenants increase their usage of automation, and which has strong alerting so that we know about incidents before customers.
Thank you very much for your well-explained response, and reality-based ideas of possible fault points. I truly appreciate the excellent work your team does for Atlassian and the users in the community.
In addition to what you describe as catch-up/recovery processing following an automation impacting incident, I expect that rule-writers/maintainers may also see additional race-track conditions from "compressing" rule execution into narrower time-frames than when the triggering events/conditions occurred in real-time. Something for us to watch for in clean-up activity, or to deal with in rule definition.
One follow-up question: for scheduled trigger rules which do not fire during an impacting outage, are they skipped or batched for execution later? For example, if a scheduled rule triggers every hour, and a 5 hour outage occurs, should we expect no rule runs or multiple back-to-back runs after the outage clears? I completely understand the system isn't a "time machine", such that scheduled rules use logs to accurately recreate the conditions at moments in time over the duration of an outage. Just want to know what to expect.
for scheduled trigger rules which do not fire during an impacting outage, are they skipped or batched for execution later? For example, if a scheduled rule triggers every hour, and a 5 hour outage occurs, should we expect no rule runs or multiple back-to-back runs after the outage clears?
Off the top, a 5 hour outage would be extremely out of character for our team. Interruptions are typically measured in minutes. An hour of not processing rules entails multiple post-morbid meetings and usually slates a fair amount of work to ensure whatever happened doesn't occur again.
As to an actual answer, it depends again on what is causing the incident. Scheduled triggers are handled by Automation's infrastructure which is separate from Jiraª so scheduled triggers are resilient in ways that webhook dependent triggers are not. That said, the first action of most scheduled triggers is to perform a JQL query so if Jira is down, you're probably out of luck.
If something in our infra is down, it seems likely that scheduled rules could be enqueued but unactioned. This would cause the behaviour that you postulated — multiple scheduled items happening closer in time than was intended.
ªThis is… kind of misleading? In the same way that Automation is a part of Jira yet runs on it's on infrastructure, there aren't many parts of "Jira" which could reasonably be considered a single thing once you get just below the UI.
Oh man, great question.
I remember an outage in November, which curiously, is not listed here: https://status.automationforjira.com/uptime?page=2
At the time, I remembered having to manually move a bunch of tickets that were blocked, but being concerned that rules that sent notifications would be "queued", and all fire later (which I believe is what happened).
I see that on the Dec 14 incident, they wrote:
Rules executions will still be delayed during this window.
But yes, it would be good to have some official guidance around what to expect, and hey, maybe some SLAs?
Speaking of SLAs, back in August of last year, I ran into an issue where our Automation rules couldn't send emails because the SES quota was exceeded. :-o
Hopefully they've since put some monitoring on that. I wonder if people started using Automation emails more once it got bundled with Cloud, and Atlassian just didn't anticipate that they'd go over quota.