This blog was written by Bitbucket's Head of Engineering, Robert Krohn.
Last October, I published a blog post describing the efforts we've committed to on the Bitbucket Cloud engineering team to achieve world-class reliability. A lot has happened in the past year (understatement of the year)! What the team has accomplished is tremendous, but we've also learned a thing or two that we can work further to improve. In this post, I'd like to address our recent reliability issues, the lessons we've learned from it, and provide an update on some of the performance work we've done over the last 12 months.
Living up to one of Atlassian's values, open company no bullsh*t, we wanted to lift the curtain and provide an overview of the reliability issues we have seen in October. We strive for 99.9% across the board but we have not lived up to this goal consistently. The team and I know all too well that even if our services are available the vast majority of the time, thirty minutes of degraded performance can be incredibly disruptive, especially if it occurs during your team's core working hours.
Over the past few weeks, we've had several incidents that may have impacted your teams. These incidents have highlighted that there is still plenty of room for us to grow. One incident in particular lasted for 11 hours, and I want to share with you a little about what happened.
On the morning of October 6, automated alerts started notifying our engineering teams that something was wrong, including increased memory usage on some hosts, elevated error rates, and extended end-to-end delivery times for outbound webhooks. The incident response team quickly identified that many of our queues responsible for managing background tasks were backing up, with worker processes failing to process tasks quickly enough.
We were able to mitigate customer impact by making configuration changes to ease the pressure on our queuing infrastructure and restarting many of the worker processes that were failing to keep up with load, noting that some had completely run out of available memory. These changes led to some short-term improvement, but before long the same issues resurfaced: hosts were running out of memory, worker processes were dying, and queues were growing. As a result, many of our background processes, such as webhooks and merging pull requests, were failing or timing out.
Under normal circumstances, the team would have quickly rolled back to the prior release as a precautionary measure, to rule out a code change as the culprit for the incident. We did not pursue that option as quickly as we should have in this case, for a few reasons.
For a period of time, much longer than any of us would have liked, this put us in a state where we were still trying to identify the root cause of the problem, and we didn't have a reliable mitigation for the issue other than restarting worker processes, which would only buy us a limited amount of breathing room before the problem came back. During this time, the team worked on parallel streams: automating the process of safely restarting workers every 1-2 hours while continuing to investigate the root cause.
A breakthrough came when we have discovered that the cause for the increased memory pressure on the background worker servers was a proliferation of orphaned worker processes. These are processes that had become detached from their parent process and continued consuming memory but stopped processing tasks. A second breakthrough was the realization that the parent processes for these tasks were dying as a result of the SIGABRT signal. While the team looked into the cause of this signal, this gave us a specific set of conditions that we could detect and respond to in a more surgical way, no longer blindly restarting worker processes, but instead reacting to the SIGABRT by terminating orphaned processes that were left over.
Putting this automation in place ensured that our customers were protected from the effects of this bug, whatever it might be, until we found it. When we did find it, what we discovered turned out to be a painfully ironic plot twist.
One of the most powerful tools we've had to maintain reliability without sacrificing our teams’ velocity has been the use of feature flags to roll out changes safely and incrementally. The irony of this incident is that it was actually the use of a feature flag that caused the SIGABRT that in turn caused so many ripple effects throughout our systems. Specifically, the problematic code checked a feature flag in a core logger. Since feature flagging inevitably depends on periodic calls to an external system- in our case, a Memcached server – every feature flag check carries a chance of logging an error due to transient issues with that system. Checking a feature flag within a core logging module is therefore a type of layer violation, which caused unwanted effects such as infinite recursion or overflow errors.
Once the team was finally able to identify this change, we quickly reverted it and deployed the fix to our production servers. In a short time the team was able to validate that the problem had been resolved and the incident was closed internally.
We stand by feature flags as a best practice for development, they generally help teams avoid buggy product releases, reduce risk, and take a more experimentation-oriented approach to software development. It is my hope that by sharing lessons learned – in this case, the importance of accepting that checking feature flags does not always reduce risk, but sometimes increases it – we can help our customers and other software teams avoid making the same mistakes we did.
Some of the outcomes from our post-incident review for this and other recent incidents include:
The good news is that we have invested in performance as well as reliability improvements over the past 12 months, and we will continue to make improvements moving forward. Since October 2019, here are some of the improvements we have made:
While the team has done far more over the past year than I have shared here (this post could easily have been 10x longer!), hopefully this has been an interesting glimpse into the reliability and performance work we've been doing and will continue to do for Bitbucket Cloud. I look forward to our engineers sharing more progress and exciting new developments as we work to make Bitbucket faster, safer, and more reliable every day.
I wanted to share the details of this recent incident for a few reasons. First, I want Bitbucket's customers to trust us; and I know that in order to earn your trust we need to be transparent about incidents like these when they happen. And lastly, these details effectively highlight the fact that no matter how far we've come, there will always be lessons to learn and further improvements to make. Rest assured, we will remain vigilant.
Security and reliability are top priorities for Atlassian. We will continue making investments in these areas until incidents like the one I described above are a thing of the past, and bring you along on the journey as we do it.
-Robert Krohn
Ash Moosa
13 comments