Come for the products,
stay for the community

The Atlassian Community can help you and your team get more value out of Atlassian products and practices.

Atlassian Community about banner
4,367,066
Community Members
 
Community Events
168
Community Groups

Any strategies for handling Downtime or other Automation Glitches

Edited

I work with a tech company that has about 30 development teams, doing our own form of scaled agile.

Within Jira, I have developed a common implementation - same issue types, workflow, custom fields, permissions, screens, etc.

We also use Advanced Roadmaps and have a hierarchy of:
- Pillar > Initiative > Epic > Product Backlog Item (PBI) > SubTask

I have developed almost 100 Automations - automating workflow state changes, syncing workflow (between Epic, PBI, SubTask), synching fields (between Epic, PBI, SubTask), automatically setting start date/finish date (based upon workflow status), automatically propagating fields (doing copy down from Epic -> PBI -> SubTask or copy up from SubTask -> PBI -> Epic), handling special operations (splitting, cloning, moving, deletion, etc.), calculating cycle time per PBI, calculating Average cycle time and Velocity per team, propagating comments in development tickets over to service desk tickets, etc.

Every once in a while, Automation for Jira goes down. Or, there's a glitch. And the automations don't run or fails. When this occurs, we effectively have "data corruption". That is, what is tracked in Jira doesn't jibe with the state it should have been in, had the Automations run. As an example, maybe the Automation to calculate the cycle time failed (for whatever reason), so the data is now missing.

To combat this, I started developing "monitors" to identify such cases. These "monitors" are scheduled to run daily. It's somewhat arbitrary, but I look for issues modified in the past 2 weeks and look for all sort of anomalies (workflow status incorrectly set, fields between Epic and PBI not matching as expected, etc.). The challenge is that these monitors are quite resource intensive. Also, because we have so many tickets, I need to run the "monitors" for "sets" of teams. As a result, I have ~ 100 monitors running daily.

I've done my best to generalize things, to minimize maintenance cost. I suppose I spend maybe 4 hours/week maintaining this.

Just recently, there was another outage and I started to wonder - do I need even more monitors -- as the teams are expecting the Automations to run flawlessly.

Anyhow, I figured that I would reach out to the community to ask - how do you ensure flawless execution (of your Automations)? Have you built a monitoring framework (to verify that the data is getting propagated as expected)?  If not, what do you do to ensure the data is correct?

Thanks,
Doug

3 answers

1 accepted

0 votes
Answer accepted

 

Hi @Mykenna Cepek@Bill Sheboy , @Matthias Gaiser _K15t_ 

Thanks for your input.  It is very useful.  As I read through your answers, I realized that there are several forms of "downtime":

1) Automation is completely down (this happened recently, due to a AWS outage)

2) Automation cannot complete a rule, due to a glitch of Automation accessing Jira

An example of #2 is an error I received on Monday:

No subsequent actions were performed since JQL condition did not pass due to error running JQL:
"[category = "Agile Project" and issuetype in standardIssueTypes() and issuetype != EPIC]" - Unexpected error calling match API from Jira


This rule was fully tested and has been working for > 1 year.  But every once in a while, an error like the above occurs (issues accessing the Jira API or the Jira API returning some odd error).  These errors are intermittent and generally not associated with any one Automation rule.  There are other examples (as I mentioned, I have about ~ 100 Automations that are invoked easily more than 1,000 times per day).   

As a note, neither #1 nor #2 happen that often.  But when it does, we have "data inconsistencies" (i.e. the data doesn't match what we would expect had the automations run successfully to completion).

A colleague of mine suggested that all Automation events should be placed in an event processor (like Kafka) when failure occurs are automatically replayed.  But that is easier said than done, as some of my Automations contain multi-step updates - and that would require a quite complex commit/rollback mechanism (where an Automation rule commits only if all steps are completed).

Oftentimes, in software, over 80% of the development effort is spent handling corner case scenarios (e.g. the scenarios listed above).  I am curious if anyone out there is  spending time (trying to correct for these corner cases).  Or, alternatively, do we simply accept that there could be times when the data is inconsistent and shrug our shoulders.

As an example @Mykenna Cepek wrote:

An example of category 2a is the previous rule which takes some action when a new issue is created. One could copy the original rule, hack it to find all recently created issues that did not have the action taken, and then perform the action.

I am curious.  Is anyone writing such rules?  As I mentioned, I have ~ 100 Automations.  If I had a recover rule for each Automation, that would double my effort (to develop, test, document, maintain).

Again, thanks for your perspective.  There isn't a right answer.  I am just curious how the community handles this.

Doug 

@Doug Levitt, I completely concur with your point about a high percentage of software is devoted to handling corner/error cases (having been a developer, I know all the attention I paid to exception paths, recovery, logging, etc).

Automation in Jira is a different beast, and "error/outage recovery" is almost always going to be manual. Even using the Jira REST API to programmatically do the equivalent will run into this, as we'll never really know what boinked in Jira (and therefore can't recover correctly in all cases).

You asked if anyone has written rules to fix situations where rules have failed. I sure have! My "category 2a" comes straight from personal experience. However, out of all our rules, I've only had to do this with a few -- and only after some unexpected failure occurred (never proactively).

I outlined the categories to help folks diagnose which rules might have the most concern (and what to do help mitigate any fallout).

Honestly, I just lump all this into a "tool maintenance" category, and accept that some percentage of my time as a Jira admin is going to be handling these corner cases. Automation rules are an art in the vein of software development. We know code isn't born perfect, it evolves. The environment in which it runs is not perfect either.

In a larger sense, the Atlassian tool suite consists of complex applications with complex configuration, complicated by ongoing updates and users wanting changes every week. It's just the nature of the beast. The fun part (assuming you don't hate the effort) is getting paid for it.

HOWEVER, I do want to address the real-life example you mentioned:

Unexpected error calling match API from Jira

I've never seen this. Have you created a Support ticket for this? Or do you know why this happens? Intermittent errors suck, and Atlassian Support might not be able to help with this one -- but maybe they can. I've had good luck bringing my "corner cases" to Support.

With my most recent intermittently problematic rule, the discussion with Support was "part of the journey", and I realized along the way that I could tweak my rule to minimize the likelihood. While it wasn't a real fix, the problem hasn't recurred (yet). I call that progress.

Like Bill Sheboy likes this

@Mykenna Cepek 

Thanks.  Your response answers my question.  Which is:

You asked if anyone has written rules to fix situations where rules have failed. I sure have! My "category 2a" comes straight from personal experience. However, out of all our rules, I've only had to do this with a few -- and only after some unexpected failure occurred (never proactively).

That's been my approach (to date).  I was just wondering if there was a better approach (i.e. was I missing something).

I've never seen this. Have you created a Support ticket for this? Or do you know why this happens? Intermittent errors suck, and Atlassian Support might not be able to help with this one -- but maybe they can. I've had good luck bringing my "corner cases" to Support.

That was only an example of an error.  I have seen other odd errors as well.  They don't occur often.  But it appears that this occurs when we "overly tax" the system.  Below is our execution profile.  I have no idea how typical this is.

 Automations.png

Mykenna Cepek Community Leader Dec 28, 2021

Awesome metrics and visualization, @Doug Levitt!

I'd be curious if the error spikes correlated with something that would be more obvious at a higher resolution on the horizontal axis. This would assume you have hourly (vs daily) execution count data. I might expect, for example, the error potential to be higher during short-term execution spikes. That, in turn, might help isolate the rule(s) in use during those spikes.

While our totals are an order of magnitude smaller, I've still seen problems occur during execution spikes. By tracking down the specific rules, I've always been able to tweak them to eliminate the errors -- eventually, after some trial and error.

Jira Automation has a lot going on "behind the curtain" for optimization, due to framework limitations, etc. My decades of procedural and object oriented debugging skills only take me so far with failing automation rules. But I've followed hunches and implemented rule adjustments to positive effect. More art than science sometimes, I'm afraid.

Yes, and...I am with @Mykenna Cepek on the suggestion to contact support for those JQL anomalies you are seeing.  They could look at logs we cannot see to offer suggestions.

There is always in-progress work on JQL and the REST API, and so you may be seeing problems that others have not found yet.

Thanks!

@Bill Sheboy , @Mykenna Cepek

I have in the past raised tickets for these type of issues (e.g. "SocketException: Socket closed"; e.g. "The issue with the following issue key could no longer be loaded (perhaps it was deleted or permissions were revoked): APD-885"; e.g. "Error adding comment. Could not retrieve project for issue. Please check the rule actor's permissions.").  These occur mostly intermittently.  Though sometimes, they arrive in "batches".  When a "storm" occurs (i.e. a bunch of errors during a particular time frame), I will log a ticket.

Sometimes the Jira APIs just fail and there is nothing Support can do (one wrote- "But seems like Jira APIs suffered a bit at that moment").

I suppose the bottom line is that this mechanism isn't designed to be highly reliable.  It generally works, but there are no guarantees.  Which is why I posted this particular question.  As I was curious how others deal with this (particularly when their organization relies on the data being accurate).

1 vote
Mykenna Cepek Community Leader Dec 23, 2021

My answer is going to ignore the "outage detection" aspect of your question, and focus on the "outage recovery" portion.

My take is that this is a game of diminishing returns. Why? As I've dealt with the aftermath of rules not running for various reasons, I realized that some rules simply can't recover from missing their context at trigger time.

A simple perspective is that rules fall into two categories. If a rule is not able to run when it is scheduled or triggered, then either:

    1) It can successfully be run later and achieve the intended result; OR

    2) It cannot be re-run and achieve the intended result.

A real example of category 1 is a rule that runs twice a day for us that scans all issues to ensure that a certain field always has a value, emailing me with a list of offenders. If I miss the morning run, the afternoon run will sufficiently catch me up.

A real example of category 2 is a rule which takes some action when a new issue is created. Issues created during an outage will not have had those actions taken on them.

I'd further suggest that we can further separate category 2 into:

    2a) Rules that can be manually fudged to complete the original mission, VS

    2b) Rules that can't complete their original mission (either at all, or without excessive effort).

An example of category 2a is the previous rule which takes some action when a new issue is created. One could copy the original rule, hack it to find all recently created issues that did not have the action taken, and then perform the action.

An example of category 2b is a rule with complex conditionals examining various issue fields which may get changed manually during the outage. In that case, the context at trigger time is forever lost. Only a heroic effort would be able to unravel that mess.

So, what to do?

I would suggest that knowing which rules are in category 1 is helpful -- after an outage those rules could be manually triggered to relatively easily "catch up". Biggest bang for the buck.

Rules in category 2b are a lost cause, and it can help to be clear on that. If those rules were doing something really, really important, it might help to revisit their logic to see if there might be a different way to accomplish those goals -- maybe a different approach which can be re-run after an outage. Perhaps even rewriting those rules.

Rules in category 2a are in-between. Perhaps a light refactoring would allow them to be re-run with success. At the very least, you can save the hacked versions for re-use at the next outage.

Hope that helps!

1 vote

Hi @Doug Levitt

I don't have practical experience in monitoring that - but I had a thought in mind which could help you in finding out which automations ran successfully.

My idea is adding a label after your automation ran successfully - you could either apply a general one like automation-success or differentiate it based on your various automations.

Then you could check with a JQL query which issues don't have the label, probably combined with some other JQL attributes to correctly identify your subset. With this JQL, you could then e.g. build a dashboard which lists all the issues with the various cases.

How does that sound?

Cheers,
Matthias.

Hi @Doug Levitt 

Yes, and...to the ideas from @Matthias Gaiser _K15t_ ...

There are several open suggestions to add monitoring and REST API functions for automation rules, and any of these could help with health checks:

And...I just checked you cannot set up a webhook to push out the automation logs yet either (probably because of the REST API gaps).  So...

What if your key rules wrote to an issue's entity properties when the rule runs successfully?  That would not be visible to users in the UX, and it could still be monitored using a webhook to push out the changes.  With a log parsing/monitoring tool (e.g. SumoLogic) you could dashboard rule execution health.  And you could target corrective measures to rules not run.  Of course, with a major outage perhaps those webhooks could also be impacted.

Kind regards,
Bill

Like # people like this

Suggest an answer

Log in or Sign up to answer
TAGS

Atlassian Community Events