Migration of Hipchat server to Data Center - a retrospective

Background

While our Hipchat server environment was reliable and performing well, as a significant and growing part of our business, the need to leverage the benefits of Hipchat Data Center were clear - high availability over multiple nodes, adjusting our backend server capacity to be more in line with usage instead of right sizing for peak load. All the usual good stuff. On top of this, the announcement that Hipchat Server support was being phased out with the choice for teams to move to Stride (not possible for us due to a business preference) or Data Center.

When talking to the Hipchat team at Atlassian Summit US 2017, there was a soft limit of ~ 5M messages for the current export process. At the time, we had approx 4.9M messages in our system and growing daily, so we waited on an updated release from the team that supported the number of messages we had.

Around November 2017, an update was released and we started the process.

Hipchat Server (the source)

  • Hipchat Server 2.29
  • Users: 1,600 from 3 different Crowd directories
  • Messages: 7.4M
  • Rooms: 2596 active, 93 archived, 448 deleted
  • AWS eu-central-1 m4.2xlarge instance
  • Attachments: 152,772 files totaling 114GB 

Preparation / testing

While there's an AWS CloudFormation template (https://confluence.atlassian.com/hipchatdc3/deploy-hipchat-data-center-on-aws-909770910.html) available to configure a Hipchat environment, I had a few false starts with it (errored out for reasons I now forget), but given we have specific naming conventions and existing VPCs, and I also like to understand as much as possible about the environment we'll then be managing, I picked the information out of CloudFormation template and deployed effectively the same configuration.

Migration Tests

The first few exports were full of good learnings:

  • the export from our current system pegged system resources, and given we're a global business, there's not much opportunity for an outage window, so we increased the size of the AWS EC2 instance to allow us to run test exports while the system was under production load without a significant impact to the users
  • While I thought I was careful with the AES password for the export, there was one occasion where after waiting for an export and transfer, I realized that I didn't have the right password, so had to start again. The fix to this was dropping a small bash script on the node with the password set and all other command line parameters in place so I could reliably run the exports multiple times. This script was deployed via Puppet from git, so we had change logging by default
  • To help ensure EFS wasn't a bottleneck (which it has in the past given it's throughput is dictated by the amount of content it stores - https://docs.aws.amazon.com/efs/latest/ug/performance.html ) we "pre-warmed" EFS with ~ 7TB of dummy files giving us a throughput of around 800MB/sec
  • The first import failed with an "invalid syntax" error on an attachment. After some digging with Hipchat support, turns out the filename finished in history.json so the importer attempted to import it as content, but it wasn't the right format. I believe this was being coded around by Atlassian, but in the interim, we checked with the user who uploaded the file and confirmed they didn't need it anymore, so we deleted it from the source environment so it wouldn't cause any further issues
  • On a few dry runs, we weren't shown the final config screen (to set the hostname/admin account):
    • Issues logging in to admin - our old admin credentials didn't seem to be working (turns out we needed some documentation updated), but given we didn't have email enabled, we couldn't do a password recovery. Instead of DB hacking, a quick edit of /hipchat-scm/web/application/views/emails/reset_password.php to echo $reset_url to the browser helped get around that
    • We were then able to login to admin, but no clients (web or app) were loading. We were using hipchat-uat.example.org style URL which didn't match our production hostname. A quick bit of digging on community.atlassian.com helped - users had mentioned similar issues when changing the hostname of a hipchat system and the fix was to update the fqdn setting in the psql configurations table. On checking the table (select * from configurations WHERE key = 'fqdn';) there were no results, but an insert (INSERT into configurations VALUES ('fqdn','hipchat-uat.example.org'); and service restart we were able to use the clients
    • After a bunch of testing (rooms, memberships, check for private message history, and most importantly - the custom emoticons!) nearly everything was looking as we needed, however there were a few (critical) rooms that either didn't load any history, or, loaded the most recent history but on scrolling back, wouldn't then load any more. A quick exchange with Atlassian Support and they felt it was a known issue (https://jira.atlassian.com/browse/HCPUB-4550). After applying the workaround, room history loaded as expected
    • The only other discrepancy we noticed was that the message count on the system overview page showed ~ 13M messages, whereas the source showed ~ 7M. Another helpful back and forward with support and they helped determine that there wasn't a discrepancy in the export/import, but just the redis calculated value that is only used on this page. After a bit of quick psql querying (count(*) on muc-* and count(*) on private-*) to get the correct value, using the below command
      redis-cli -h $hostname -p 6379 -a hipchat set 'messages:total' 7123456
      This number then incremented as expected as more room and private messages were sent
    • The other frustration we had in the admin was that the user search page uses a SQL LIKE on name field (display name). In PSQL, this is case sensitive, so even searching for myself using "craig" showed no results, it had to be Craig. Given we support numerous sister companies in our environment, being able to search by email domain is also a common need for our admins. After a big of digging on the file system, we were able to achieve both by editing /hipchat-scm/web/application/models/user.php and changing the line (~ 2604) from
      $query->where("name LIKE '%$search_name%'");
      to 
      $query->where("(name ILIKE '%$search_name%' OR email ILIKE '%$search_name%')");
      Using the above comes with the usual warning that this is not supported by Atlassian, and while it's working for us and we've seen no negative impact from this change, use this at your own risk
  • Part of our testing was to check that all services recovered after system restarts. We noticed a huge delay (> 1 hour) between the OS being up and the application actually starting. On checking running processes, there was a set of chown/chmod/chgrp on the EFS share (/file_share) commands running. A bit of digging on the file system and I found that in /opt/atlassian/hipchat/sbin/mount_shared_storage.sh there's 2 sets of commands that set the permissions on this share when the system boots. 
    • It does it twice - once half way through the script and then again at the end - with the comment # We do it again in case the permissions where wrong before
    • Given we're confident we can manage the permissions correctly, we commented out the 6 commands. It seems wasteful to do this twice, but also, given you can set group using chown user:group, at least combine the chown and chgrp to a single command. I mentioned this to Atlassian Support and did notice that in the 3.1.4 AMI this script seems to have changed and removed some of the duplicative commands

The (semi-unplanned) production move

While we were planning to action the production move in the next few weekends after communicating with the business, there was issue with our Hipchat Server environment (at some stage in the past, the system updated elasticsearch to an unsupported version, so we had to manually install 1.5.2 and downgrade the indexes to work on 1.5.2 again, and seems something to do with this fragility then decided it wasn't happy anymore and while all the dependent systems reported as healthy and the admin system worked, regardless of what we tried, that chat client (web/mobile) just wouldn't load. Instead of cobbling the server environment together, the call was made to bite the bullet and action the migration.

Import server

  • Hipchat Datacenter 3.1.3
  • AWS eu-central.1 m4.10xlarge (made it big so there was less chance of resource limits slowing down the import)
  • EFS with ~ 7TB of "pre-warmed" file allocation

A summary of the timeline is below:

Export - 2 Hours 7 minutes

  1. 2018/03/25 00:51:24 Fetching users

  2. 2018/03/25 00:51:25 Fetching rooms

  3. 2018/03/25 00:51:25 Fetching autojoins

  4. 2018/03/25 00:51:25 Fetching preferences

  5. 2018/03/25 00:51:34 Fetching emoticons

  6.  2018/03/25 00:51:34 Fetching upload files
  7. 2018/03/25 00:51:35 Fetching messages
  8. 2018/03/25 00:55:47 Creating export artifact
  9. 2018/03/25 00:56:12 Writing rooms.json
  10. 2018/03/25 00:56:12 Writing users/###/history.json
  11. 2018/03/25 01:17:45 Exporting sender attachments
  12. 2018/03/25 02:58:45 ======== *** Export completed successfully! *** ========

  13. Exported file was 101GB

Longest phase: File export - 1 hour 41 minutes

Transfer - 1 Hour 15 minutes

  1. 1 hr 15 mins to move the file from source to destination. In hindsight given these nodes are in the same VPC, I could have attached an EBS volume, attached to source, exported, disconnected, attached to destination and saved ~ 1 hr 10 minutes. 

(there's a gap in the log times here as a few things happened - namely sleep, but then I deviated from the tested UAT and spun up a 3.1.4 DC node as it was a newer version, but then the import tool errored saying it wasn't a supported version, so had to deploy a 3.1.3 node)

Import - 5 Hours 16 minutes

  1. 2018/03/25 12:41:17 Import Started 
  2. 2018/03/25 12:41:17 Truncating data in database
  3. 2018/03/25 12:41:17 Cleaning room messages for YYYY.MM
  4. 2018/03/25 12:41:25 Cleaning 1-1 messages for YYYY.MM
  5. 2018/03/25 12:41:35 Importing data
  6. 2018/03/25 12:41:47 Inserting uid=XX

  7. 2018/03/25 12:41:47 Created: first.last@example.orgwith id XX:

  8. 2018/03/25 12:41:58 Importing avatar photo users/XX/avatars/filename.png
  9. 2018/03/25 12:43:46 Importing autojoin pref:autoJoin:XX
  10. 2018/03/25 12:43:46 == Extracting and processing preferences.json

  11. 2018/03/25 12:44:08 == Extracting and processing emoticons.json

  12. 2018/03/25 12:45:51 == Extracting and processing rooms.json

  13. 2018/03/25 12:45:51 == Extracting and processing users/XX/history.json

  14. 2018/03/25 13:09:09 == Extracting and processing users/files/XX/filename
  15. 2018/03/25 16:43:27 == Extracting and processing rooms/XX/history.json

  16. 2018/03/25 17:57:24  ======== *** Import completed successfully! *** ========

Lengthy steps:

  • User history - 25 minutes
  • Attachments - 3 hours 30 minutes
  • Room history - 1 hour 15 minutes

CPU/RAM were again not showing as a bottleneck, but network throughput (to EFS) went to 8MB/sec and flatlined during the attachment phase

 

Post config

  • Once import finishes, send hup command
  • Apply suggested fix on https://jira.atlassian.com/browse/HCPUB-4550 ( a handful of rooms weren't scrolling back in history)
  • Go to web UI (https://hipchat.example.org)
  • Complete configuration (load balancer hostname, admin account)
  • Configure Crowd directories
  • Complete smoke-tests
  • Update DNS for production hostname to point to new ELB

 

Note

While this is documented, it's worth highlighting - the following components ARE NOT included in the export/import process and need to be re-applied

  • API Keys
  • Integrations

 

Summary

While business impact was minimal in the end, it was still far from ideal to need to rush a migration. Fortunately we'd been able to complete enough testing that we had confidence that the move would (should) be successful. It would have been helpful to measure / document the order of the steps and the amount of time each step took in our dry runs as there was a lot of not knowing when the next human action would need to happen.

The most frustrating part is that the entire export process is an all or nothing action meaning you're unable to minimize the outage window to migrate by preparing the destination system as much as possible. Given that out of the almost 8 hours and 45 minutes of process, 1 hour 40 is exporting attachments, 1 hr 15 is migrating a 100GB file (of which > 95% is attachments) and 3 hrs 30 is importing attachments, that's only about 2 hrs 15 mins of "other". When we needed to move our Hipchat server environment from on-prem to AWS, Atlassian support provided a bash script that exported, tar'd and then an import that un-tar'd and did what it needed to do. We were able to significantly reduce the time that process took (from ~ 36 hours to ~ 2 hours) by using rsync to pre-sync the attachment directory between the two nodes so during the final migration we just needed to handle the diff, not the whole attachment transfer. The DataCenter export/import tool appears to be a Go binary and short of reverse engineering it, that approach wasn't going to work.

It's only been less than 24 hours since the production move has completed, but all systems are humming along nicely. 

 

A big thanks to the Atlassian Hipchat/support team - notably @Kent Baxley who has been patiently assisting with the various issues, and checking in with us to see how things were progressing for the last few months.

 

 

References:

2 comments

Great notes, thanks

Truly impressive documentation, thanks so much for your patience, good suggestions/improvements sprinkled throughout (badpokerface) and I like the idea of a "--skipattachments" if a customer wants to migrate those by hand themselves.

Comment

Log in or Sign up to comment
Community showcase
Posted Thursday in Marketplace Apps

You + one app + a desert island...

Hi all! My name is Miles and I work on the Marketplace team. We’re looking for better ways to recommend and suggest apps that are truly crowd favorites, so of course we wanted to poll the Community. ...

127 views 3 4
Join discussion

Atlassian User Groups

Connect with like-minded Atlassian users at free events near you!

Find a group

Connect with like-minded Atlassian users at free events near you!

Find my local user group

Unfortunately there are no AUG chapters near you at the moment.

Start an AUG

You're one step closer to meeting fellow Atlassian users at your local meet up. Learn more about AUGs

Groups near you