While our Hipchat server environment was reliable and performing well, as a significant and growing part of our business, the need to leverage the benefits of Hipchat Data Center were clear - high availability over multiple nodes, adjusting our backend server capacity to be more in line with usage instead of right sizing for peak load. All the usual good stuff. On top of this, the announcement that Hipchat Server support was being phased out with the choice for teams to move to Stride (not possible for us due to a business preference) or Data Center.
When talking to the Hipchat team at Atlassian Summit US 2017, there was a soft limit of ~ 5M messages for the current export process. At the time, we had approx 4.9M messages in our system and growing daily, so we waited on an updated release from the team that supported the number of messages we had.
Around November 2017, an update was released and we started the process.
While there's an AWS CloudFormation template (https://confluence.atlassian.com/hipchatdc3/deploy-hipchat-data-center-on-aws-909770910.html) available to configure a Hipchat environment, I had a few false starts with it (errored out for reasons I now forget), but given we have specific naming conventions and existing VPCs, and I also like to understand as much as possible about the environment we'll then be managing, I picked the information out of CloudFormation template and deployed effectively the same configuration.
Migration Tests
The first few exports were full of good learnings:
redis-cli -h $hostname -p 6379 -a hipchat set 'messages:total' 7123456This number then incremented as expected as more room and private messages were sent
$query->where(
"name LIKE '%$search_name%'"
);
to $query->where(
"(name ILIKE '%$search_name%' OR email ILIKE '%$search_name%')"
);
Using the above comes with the usual warning that this is not supported by Atlassian, and while it's working for us and we've seen no negative impact from this change, use this at your own riskWhile we were planning to action the production move in the next few weekends after communicating with the business, there was issue with our Hipchat Server environment (at some stage in the past, the system updated elasticsearch to an unsupported version, so we had to manually install 1.5.2 and downgrade the indexes to work on 1.5.2 again, and seems something to do with this fragility then decided it wasn't happy anymore and while all the dependent systems reported as healthy and the admin system worked, regardless of what we tried, that chat client (web/mobile) just wouldn't load. Instead of cobbling the server environment together, the call was made to bite the bullet and action the migration.
Import server
A summary of the timeline is below:
2018/03/25 00:51:24 Fetching users
2018/03/25 00:51:25 Fetching rooms
2018/03/25 00:51:25 Fetching autojoins
2018/03/25 00:51:25 Fetching preferences
2018/03/25 00:51:34 Fetching emoticons
2018/03/25 02:58:45 ======== *** Export completed successfully! *** ========
Exported file was 101GB
Longest phase: File export - 1 hour 41 minutes
(there's a gap in the log times here as a few things happened - namely sleep, but then I deviated from the tested UAT and spun up a 3.1.4 DC node as it was a newer version, but then the import tool errored saying it wasn't a supported version, so had to deploy a 3.1.3 node)
2018/03/25 12:41:47 Inserting uid=XX
2018/03/25 12:41:47 Created: first.last@example.orgwith id XX:
2018/03/25 12:43:46 == Extracting and processing preferences.json
2018/03/25 12:44:08 == Extracting and processing emoticons.json
2018/03/25 12:45:51 == Extracting and processing rooms.json
2018/03/25 12:45:51 == Extracting and processing users/XX/history.json
2018/03/25 16:43:27 == Extracting and processing rooms/XX/history.json
2018/03/25 17:57:24 ======== *** Import completed successfully! *** ========
Lengthy steps:
CPU/RAM were again not showing as a bottleneck, but network throughput (to EFS) went to 8MB/sec and flatlined during the attachment phase
While this is documented, it's worth highlighting - the following components ARE NOT included in the export/import process and need to be re-applied
While business impact was minimal in the end, it was still far from ideal to need to rush a migration. Fortunately we'd been able to complete enough testing that we had confidence that the move would (should) be successful. It would have been helpful to measure / document the order of the steps and the amount of time each step took in our dry runs as there was a lot of not knowing when the next human action would need to happen.
The most frustrating part is that the entire export process is an all or nothing action meaning you're unable to minimize the outage window to migrate by preparing the destination system as much as possible. Given that out of the almost 8 hours and 45 minutes of process, 1 hour 40 is exporting attachments, 1 hr 15 is migrating a 100GB file (of which > 95% is attachments) and 3 hrs 30 is importing attachments, that's only about 2 hrs 15 mins of "other". When we needed to move our Hipchat server environment from on-prem to AWS, Atlassian support provided a bash script that exported, tar'd and then an import that un-tar'd and did what it needed to do. We were able to significantly reduce the time that process took (from ~ 36 hours to ~ 2 hours) by using rsync to pre-sync the attachment directory between the two nodes so during the final migration we just needed to handle the diff, not the whole attachment transfer. The DataCenter export/import tool appears to be a Go binary and short of reverse engineering it, that approach wasn't going to work.
It's only been less than 24 hours since the production move has completed, but all systems are humming along nicely.
A big thanks to the Atlassian Hipchat/support team - notably @Kent Baxley who has been patiently assisting with the various issues, and checking in with us to see how things were progressing for the last few months.
References:
Truly impressive documentation, thanks so much for your patience, good suggestions/improvements sprinkled throughout (badpokerface) and I like the idea of a "--skipattachments" if a customer wants to migrate those by hand themselves.