Hi all,
First off, the changes we did in November are bearing fruits already, resulting in a 2 month streak with no incidents. Yay!
However, Jira Product Discovery was down for 6.5 hours on 4 January.
What was the impact?
No one was able to access their projects in that time, and there were no updates on the Statuspage.
What happened?
A combination of things:
- Atlassian has a deployment freeze over the end of year break, and deployments restarted automatically on January 4. In this case the changes piled up, the front-end was deployed automatically while the necessary back-end changes were not yet in production. And Kaboom.
- Because the product is still in beta with a small team, we do not yet have on-call support out of office hours (with team members in the US and Europe). Up until now all incidents we faced were after back-end deployments, so we optimized for having people available to watch production for a few hours after that. In this case the deployment happened at the worst possible time when everyone was asleep, so the incident wasn't known to the team until 6 hours after it started. Resolving the incident only took a few minutes.
What are we changing?
We are taking this very seriously and are taking measures to address the root cause of the incident, and improve the incident response:
- To address the root cause of the incident we're looking at how to prevent deployments that happen out of sequence to prevent these kinds of incidents in the future.
- To improve the incident response, over the next few weeks we're implementing on-call support for out of office hours.
Thank you for your continued support!
0 comments