Share your best practices for JIRA High Availability (Failover Strategies)

Stefan Broda Atlassian Team Nov 11, 2012

Atlassian recently released its Failover for JIRA guide. As JIRA becomes a mission critical resource in your organization, maximizing application uptime becomes a prime consideration. This best practices guide assembles some of the best advice from our customers, our partners, and internal staff on setting up a failover solution for JIRA that could limit unexpected production downtime to a few minutes.

  • High Availability measures possible with JIRA
  • Guidance on how to implement a failover scenario with JIRA
<colgroup><col width="24"><col></colgroup>

Please note that your feedback and comments are welcome! We would very much value additional lessons learned and experience with alternative scenarios!

6 answers

1 accepted

This widget could not be displayed.

A warning about the following:

This is the result of playing around and, while the basic principals work, not all scenarios have been tested. It also only handles the JIRA application failover - there are plenty of resources for handling database failover/clustering, so I'm not going to try and reinvent that wheel.

This is not a complete how-to guide, but simply a few notes taken during my experimentation. It assumes you have a good understanding of Linux administration and JIRA.

This solution is going to require 3 systems: one postgres database and 2 servers running JIRA. Note: there's some info on JIRA licensing requirements which can be found here: http://confluence.atlassian.com/display/JIRA/Is+Clustering+or+Load+Balancing+JIRA+Possible

All machines are running ubuntu 11.10 server x86_64 in virtualbox with fairly minimal specs (as this was just done for a proof of concept).

JIRA only needs to be installed on a single VM which can then be cloned after the initial setup.

Install heartbeat for High Availability (HA) failover and csync2 for file system syncing

apt-get install heartbeat csync2

Download JIRA from Atlassian (this was done with 5.0.4)

wget http://www.atlassian.com/software/jira/downloads/binary/atlassian-jira-5.0.4-x64.bin
chmod +x atlassian-jira-5.0.4-x64.bin
./atlassian-jira-5.0.4-x64.bin

Follow the installer using the Express install option and install it as a service when prompted.

Set up an external postgres server and create database 'jira-ha'

Work though the setup of JIRA using the new DB.

Disable auto-start of JIRA

update-rc.d jira disable

Shutdown the VM

shutdown -hP now

Clone the VM and re-initialise the MAC address

Start up both VMs

On node1 (jira-ha-1), edit the following files (you may need to create them)

/etc/ha.d/ha.cf
/etc/ha.d/haresources
/etc/ha.d/authkeys
/etc/hosts

Put in all your hosts and their IP's. Refer to the heartbeat man pages if required.

Edit

/etc/csync2.cfg

and sync up the

/var/atlassian

Folder (excluding logs)

In JIRA, set the site url to be the virtual IP being used by heartbeat (or the DNS alias of the virtual IP)

You should now be able to start playing the setup. I found the easiest way to test was to disconnect the network cable in the VM settings.

One setting I set in heartbeat was to turn off auto_failback however I just noticed this has been depricated and replaced with a setting in pacemaker. However you implement this, you don't want the primary server trying to take back control every 10 min if you're getting packet loss, so turn off the failback.

Another change I made was in postgres to only allow connections from the virtual IP to stop the two instances from being able to access it at the same time (they shouldn't try, but who knows).

Something I never tried, but thought about was to run JIRA through apache + mod_proxy on both boxes. It wouldn't need to be handled by heartbeat and could be configured to display a nice loading page during the failover (because JIRA can take quite a while to startup in some environments).

I'd love to hear if anyone else has tried this method in more than just a POC and what other suggestions you have.

This widget could not be displayed.

Thank you for the guide. Although it is helpfull, it is painfull to impliment and explore, especially when profecionals resources are low. It would be very helpfull to have an integrated solution in the product, abstracting all the details from the admin... One example of such a solution may be to allow duplication of DB writes or DB sync on any change done (issued by the application).

This widget could not be displayed.
Stefan Broda Atlassian Team Nov 15, 2012

Thanks Edward for your comment! I'll do some digging and see which feature requests are related to your suggestion.

This widget could not be displayed.

Hi Stefan,

I am not sure of the correctness of this design given the inconsistencies that could potentially arise with writes to JIRA_HOME differing from the database. It would seems to me that JIRA itself would need to be modified to integrate with a HA distributed lock management system such as Apache Zookeeper (ZK) so replicated files (including logs) are consistent.

Here ZK is used to a) ensure files updates are transactionally consistent across the cluster b) provide real-time monitoring rather than polling (since it supports pub sub), c) distributed logging via BookKeeper and d) provides the mechanism to control failover and recovery depending upon the point of failure.

If you are trying to achieve 99.9% or above the detection and failover needs to be automatic, as too the degradation procedure of say displaying an outage page instead.

Furthermore, in terms of HA, all components in the stack are not created equal in that the web tier handles more request than tomcat & than the database. The first thing I have been doing is using web CDN such CloudFlare to handling content reliably.

Apart from DB failures the next critical thing would be for JIRA to either a) operate in hot standby mode or b) be able to start in milliseconds (even if in restricted form) rather than slowly as it does now.

Kind Regards,

Philip

Stefan Broda Atlassian Team Nov 16, 2012

Hi Philip,

would from your point of viesw, home directory inconsistencies occur with any of the listed possible file replication methods?

I agree that JIRA does not have the capability to provide full High Availability with automatic failover. We wrote the guide however to line out what is possible with the current version.

Hi Stefan,

The replication mechanisms described above are not transactional. For example, if a file is uploaded as an attachement a transactional update would see the file appear on each relevant node of the cluster or not at all rather than a batch update after. My concern with non-transactional updates is that you risk corrupting your JIRA instance right at the time you can least afford it (i.e. it is under high load and the server crashes). May be I am being paranoid, but I note from the JIRA upgrade procedures "Inconsistent database backups may not restore correctly".

I think to implement HA with JIRA you should start with the Web Tier back. Thus the reason I bang on about CDN's is that you get good first step towards HA for $20/month or thereabouts with no risk.

The second step to implementing HA would incorporate a Distributed Lock Management plugin (DLM) within JIRA and encapsulate the JIRA instance this with a start / stop monitoring system and perform the mods needed to prevent the possibility of corrupting ones JIRA instance. The outcome of this step would be to allow a cold standby of JIRA instance.

Enabling hot standby would be an obivous next step with database replication the third. That said, I would not commit resourced to this, till one had a JIRA tomcat cluster working properly.

Cheers,

Philip

Hi Philip

When using CloudFlare, did you have to define any page rules? We're finding that some of the cached content is causing problems, e.g. when searching issues. I've currently got a page rule that bypasses the cache for ALL JIRA content but I want to try and reduce that to a subset.

Thanks.

Philip

This widget could not be displayed.

Focusing on JIRA is fine and a great topic, but high availability should also consider external dependencies;

When JIRA is integrated with external LDAP authenticatication/authroization, that LDAP server becomes a single point of failure. After experiencing one outage due the AD node I was using being taken down for unannounced maintenance I needed to find something to fix that.

Active Directory, a common authN/authZ provider that is accessed through the LDAP protocol, supports mirroring, and most large organisations, like my previous, have multiple duplicate copies of that data residing in many redundant servers, which is great. What is not great is that JIRA is still not capable of switching to alternate servers if the current is unavailable, I logged CONF-8867 back in 2007 for this...

To address, I implemented a quick and dirty (free!) TCP proxy on the JIRA server that proxied out to several of these LDAP mirrors (see comment on issue above) if the primary was unavailable, which worked flawlessly. A more Enterpri$e solution would involve an F5 with TCP load balancing.

This widget could not be displayed.

you can just use things like relayd from openbsd and pull off the same thing

Suggest an answer

Log in or Sign up to answer
Community showcase
Posted 9 hours ago in Jira

Atlassian Research Workshop opportunity on Sep. 28th in Austin, TX

We're looking for participants for a workshop at Atlassian! We need Jira admins who have interesting custom workflows, issue views, or boards. Think you have a story to sha...

27 views 1 2
Join discussion

Atlassian User Groups

Connect with like-minded Atlassian users at free events near you!

Find a group

Connect with like-minded Atlassian users at free events near you!

Find my local user group

Unfortunately there are no AUG chapters near you at the moment.

Start an AUG

You're one step closer to meeting fellow Atlassian users at your local meet up. Learn more about AUGs

Groups near you