Atlassian recently released its Failover for JIRA guide. As JIRA becomes a mission critical resource in your organization, maximizing application uptime becomes a prime consideration. This best practices guide assembles some of the best advice from our customers, our partners, and internal staff on setting up a failover solution for JIRA that could limit unexpected production downtime to a few minutes.
Please note that your feedback and comments are welcome! We would very much value additional lessons learned and experience with alternative scenarios!
A warning about the following:
This is the result of playing around and, while the basic principals work, not all scenarios have been tested. It also only handles the JIRA application failover - there are plenty of resources for handling database failover/clustering, so I'm not going to try and reinvent that wheel.
This is not a complete how-to guide, but simply a few notes taken during my experimentation. It assumes you have a good understanding of Linux administration and JIRA.
This solution is going to require 3 systems: one postgres database and 2 servers running JIRA. Note: there's some info on JIRA licensing requirements which can be found here: http://confluence.atlassian.com/display/JIRA/Is+Clustering+or+Load+Balancing+JIRA+Possible
All machines are running ubuntu 11.10 server x86_64 in virtualbox with fairly minimal specs (as this was just done for a proof of concept).
JIRA only needs to be installed on a single VM which can then be cloned after the initial setup.
Install heartbeat for High Availability (HA) failover and csync2 for file system syncing
apt-get install heartbeat csync2
Download JIRA from Atlassian (this was done with 5.0.4)
wget http://www.atlassian.com/software/jira/downloads/binary/atlassian-jira-5.0.4-x64.bin chmod +x atlassian-jira-5.0.4-x64.bin ./atlassian-jira-5.0.4-x64.bin
Follow the installer using the Express install option and install it as a service when prompted.
Set up an external postgres server and create database 'jira-ha'
Work though the setup of JIRA using the new DB.
Disable auto-start of JIRA
update-rc.d jira disable
Shutdown the VM
shutdown -hP now
Clone the VM and re-initialise the MAC address
Start up both VMs
On node1 (jira-ha-1), edit the following files (you may need to create them)
/etc/ha.d/ha.cf /etc/ha.d/haresources /etc/ha.d/authkeys /etc/hosts
Put in all your hosts and their IP's. Refer to the heartbeat man pages if required.
and sync up the
Folder (excluding logs)
In JIRA, set the site url to be the virtual IP being used by heartbeat (or the DNS alias of the virtual IP)
You should now be able to start playing the setup. I found the easiest way to test was to disconnect the network cable in the VM settings.
One setting I set in heartbeat was to turn off auto_failback however I just noticed this has been depricated and replaced with a setting in pacemaker. However you implement this, you don't want the primary server trying to take back control every 10 min if you're getting packet loss, so turn off the failback.
Another change I made was in postgres to only allow connections from the virtual IP to stop the two instances from being able to access it at the same time (they shouldn't try, but who knows).
Something I never tried, but thought about was to run JIRA through apache + mod_proxy on both boxes. It wouldn't need to be handled by heartbeat and could be configured to display a nice loading page during the failover (because JIRA can take quite a while to startup in some environments).
I'd love to hear if anyone else has tried this method in more than just a POC and what other suggestions you have.
Thank you for the guide. Although it is helpfull, it is painfull to impliment and explore, especially when profecionals resources are low. It would be very helpfull to have an integrated solution in the product, abstracting all the details from the admin... One example of such a solution may be to allow duplication of DB writes or DB sync on any change done (issued by the application).
I am not sure of the correctness of this design given the inconsistencies that could potentially arise with writes to JIRA_HOME differing from the database. It would seems to me that JIRA itself would need to be modified to integrate with a HA distributed lock management system such as Apache Zookeeper (ZK) so replicated files (including logs) are consistent.
Here ZK is used to a) ensure files updates are transactionally consistent across the cluster b) provide real-time monitoring rather than polling (since it supports pub sub), c) distributed logging via BookKeeper and d) provides the mechanism to control failover and recovery depending upon the point of failure.
If you are trying to achieve 99.9% or above the detection and failover needs to be automatic, as too the degradation procedure of say displaying an outage page instead.
Furthermore, in terms of HA, all components in the stack are not created equal in that the web tier handles more request than tomcat & than the database. The first thing I have been doing is using web CDN such CloudFlare to handling content reliably.
Apart from DB failures the next critical thing would be for JIRA to either a) operate in hot standby mode or b) be able to start in milliseconds (even if in restricted form) rather than slowly as it does now.
would from your point of viesw, home directory inconsistencies occur with any of the listed possible file replication methods?
I agree that JIRA does not have the capability to provide full High Availability with automatic failover. We wrote the guide however to line out what is possible with the current version.
The replication mechanisms described above are not transactional. For example, if a file is uploaded as an attachement a transactional update would see the file appear on each relevant node of the cluster or not at all rather than a batch update after. My concern with non-transactional updates is that you risk corrupting your JIRA instance right at the time you can least afford it (i.e. it is under high load and the server crashes). May be I am being paranoid, but I note from the JIRA upgrade procedures "Inconsistent database backups may not restore correctly".
I think to implement HA with JIRA you should start with the Web Tier back. Thus the reason I bang on about CDN's is that you get good first step towards HA for $20/month or thereabouts with no risk.
The second step to implementing HA would incorporate a Distributed Lock Management plugin (DLM) within JIRA and encapsulate the JIRA instance this with a start / stop monitoring system and perform the mods needed to prevent the possibility of corrupting ones JIRA instance. The outcome of this step would be to allow a cold standby of JIRA instance.
Enabling hot standby would be an obivous next step with database replication the third. That said, I would not commit resourced to this, till one had a JIRA tomcat cluster working properly.
When using CloudFlare, did you have to define any page rules? We're finding that some of the cached content is causing problems, e.g. when searching issues. I've currently got a page rule that bypasses the cache for ALL JIRA content but I want to try and reduce that to a subset.
Focusing on JIRA is fine and a great topic, but high availability should also consider external dependencies;
When JIRA is integrated with external LDAP authenticatication/authroization, that LDAP server becomes a single point of failure. After experiencing one outage due the AD node I was using being taken down for unannounced maintenance I needed to find something to fix that.
Active Directory, a common authN/authZ provider that is accessed through the LDAP protocol, supports mirroring, and most large organisations, like my previous, have multiple duplicate copies of that data residing in many redundant servers, which is great. What is not great is that JIRA is still not capable of switching to alternate servers if the current is unavailable, I logged CONF-8867 back in 2007 for this...
To address, I implemented a quick and dirty (free!) TCP proxy on the JIRA server that proxied out to several of these LDAP mirrors (see comment on issue above) if the primary was unavailable, which worked flawlessly. A more Enterpri$e solution would involve an F5 with TCP load balancing.
Connect with like-minded Atlassian users at free events near you!Find a group
Connect with like-minded Atlassian users at free events near you!
Unfortunately there are no AUG chapters near you at the moment.Start an AUG
We're bringing product updates and pro tips on teamwork to ten cities around the world.Save your spot