Create
cancel
Showing results for 
Search instead for 
Did you mean: 
Sign up Log in

what is the expected behavior after an Automation for Jira outage

Bill Sheboy
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
February 5, 2021

Greetings community!

My hope is that someone from Atlassian who supports or develops the automation engine for Jira Cloud can respond to this question...

What is the expected behavior for rule triggering after an Automation for Jira outage?  Specifically, will the events (scheduled or immediate) that happened during the outage window:

  • eventually fire the rules,
  • be lost to the "bit bucket" and so not fire,
  • or depends case by case on the nature of the incident leading to the outage?

I am trying to plan for customer recovery practices after an outage that impacts automation rules.

Thanks, and best regards,

Bill

2 answers

1 accepted

0 votes
Answer accepted
wwalser
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
February 9, 2021

Hi Bill,

Automation engineer here. Great question. The answer, and this is probably obvious, is your last bullet point. "depends case by case on the nature of the incident leading to the outage".

That said, we can be a bit more informative than that by explaining a bit of the architecture.

I hasten to point out that none of what follows is a promise. Our architecture will change in ways that I can't foresee. We expend great effort to ensure redundancy and good alerting, so any incident which occurs is, by the nature of it being an "incident", a surprise to us.

As events occur in Jira, Automation consumes the resulting webhooks. Mostly. We also consume events from an internal piece of infrastructure which gives access to Atlassian only events. This is how Automation knows how to trigger an Automation. When these events occur, we toss them onto a queue. A worker comes along and consumes the enqueue event. This worker looks up rules which may need to be executed based on the content of the event. If rules are found, these are placed on yet another queue which has yet more working which come along and do the actual work of the automation. The most common "work" for an automation is to make a series of API calls back to jira to perform some data mutation on one or more issues.

This is, of course, a vague sketch of the architecture but as you may be able to see from that description, failure at different points result in very different outcomes:

  • If a fault occurs in Jira which causes webhooks to not be sent, Automation will not have the opportunity to execute rules.
  • If a fault occurs in workers or in Jira's ability to respond to our API calls, this usually means that rules become enqueued for longer but will eventually execute. Once the fault is resolved the enqueued items can be churned through rather quickly. This ability to recover is seen by the team as a positive, we have strong alerting around queue size and mean rule end-to-end time.
  • Because workers are stateful only for their lifetimes and are isolated from one another, faults on individual tenants do not propagate to other tenants via automation. I can't speak for Jira's other systems as I do not have deep knowledge of them.
  • Incidents in various pieces of AWS infra affect us in different ways depending on the infra. 

I hope this helps to answer your question. Our goal is to create a system which is immune to any single point of failure, which fails gracefully when it fails (doesn't propagate failures across tenants), which scales as both Atlassian grows but also as individual tenants increase their usage of automation, and which has strong alerting so that we know about incidents before customers.

Best,
Wes

Bill Sheboy
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
February 10, 2021

Hi @wwalser 

Thank you very much for your well-explained response, and reality-based ideas of possible fault points. I truly appreciate the excellent work your team does for Atlassian and the users in the community.

In addition to what you describe as catch-up/recovery processing following an automation impacting incident, I expect that rule-writers/maintainers may also see additional race-track conditions from "compressing" rule execution into narrower time-frames than when the triggering events/conditions occurred in real-time.  Something for us to watch for in clean-up activity, or to deal with in rule definition.

One follow-up question: for scheduled trigger rules which do not fire during an impacting outage, are they skipped or batched for execution later?  For example, if a scheduled rule triggers every hour, and a 5 hour outage occurs, should we expect no rule runs or multiple back-to-back runs after the outage clears?  I completely understand the system isn't a "time machine", such that scheduled rules use logs to accurately recreate the conditions at moments in time over the duration of an outage.  Just want to know what to expect.

Thanks again!

Best regards,

Bill

wwalser
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
February 12, 2021

for scheduled trigger rules which do not fire during an impacting outage, are they skipped or batched for execution later?  For example, if a scheduled rule triggers every hour, and a 5 hour outage occurs, should we expect no rule runs or multiple back-to-back runs after the outage clears?

Off the top, a 5 hour outage would be extremely out of character for our team. Interruptions are typically measured in minutes. An hour of not processing rules entails multiple post-morbid meetings and usually slates a fair amount of work to ensure whatever happened doesn't occur again. 

As to an actual answer, it depends again on what is causing the incident. Scheduled triggers are handled by Automation's infrastructure which is separate from Jiraª so scheduled triggers are resilient in ways that webhook dependent triggers are not. That said, the first action of most scheduled triggers is to perform a JQL query so if Jira is down, you're probably out of luck.

If something in our infra is down, it seems likely that scheduled rules could be enqueued but unactioned. This would cause the behaviour that you postulated — multiple scheduled items happening closer in time than was intended.

----

ªThis is… kind of misleading? In the same way that Automation is a part of Jira yet runs on it's on infrastructure, there aren't many parts of "Jira" which could reasonably be considered a single thing once you get just below the UI.

Like Bill Sheboy likes this
0 votes
Darryl Lee
Community Leader
Community Leader
Community Leaders are connectors, ambassadors, and mentors. On the online community, they serve as thought leaders, product experts, and moderators.
February 5, 2021

Oh man, great question.

I remember an outage in November, which curiously, is not listed here: https://status.automationforjira.com/uptime?page=2

At the time, I remembered having to manually move a bunch of tickets that were blocked, but being concerned that rules that sent notifications would be "queued", and all fire later (which I believe is what happened).

I see that on the Dec 14 incident, they wrote:

Rules executions will still be delayed during this window.

But yes, it would be good to have some official guidance around what to expect, and hey, maybe some SLAs?

Darryl Lee
Community Leader
Community Leader
Community Leaders are connectors, ambassadors, and mentors. On the online community, they serve as thought leaders, product experts, and moderators.
February 5, 2021

Speaking of SLAs, back in August of last year, I ran into an issue where our Automation rules couldn't send emails because the SES quota was exceeded. :-o

Hopefully they've since put some monitoring on that. I wonder if people started using Automation emails more once it got bundled with Cloud, and Atlassian just didn't anticipate that they'd go over quota.

Suggest an answer

Log in or Sign up to answer
TAGS
AUG Leaders

Atlassian Community Events