This blog was written by Bitbucket's Head of Engineering, Daniel Tao, and is also published to the Bitbucket Cloud blog.
The past week has been a turbulent time for Bitbucket Cloud’s engineering and support teams as well as our customers. Some of you have expressed concern regarding our services' performance and reliability in recent days. Fortunately, for the majority of Bitbucket users, our services have continued to perform smoothly. But given the number of developers who use Bitbucket daily, even a small percentage of our active users represents a lot of people.
The purpose of this article is to provide some answers: what’s going on, why we’re having these issues, and what you should expect moving forward.
The short of it is this: we’re nearing completion on a year-long project to move to an entirely new platform, which will be good for Bitbucket’s security and reliability moving forward. The move has contributed to recent performance issues, which we are actively addressing and which we expect to be resolved over the next few days:
Merging pull requests takes longer than it used to. While we are always investing in performance enhancements and may see improvement in this area, it is important to realize that merges happen asynchronously in the background and so you do not need to wait for a pull request to be merged. If you navigate away, it will finish while you are doing other things. We’re updating the UX to make this clear.
Rendering pull request diffs is currently slower and sometimes timing out. We are actively working on fixing this and expect the issue to be largely resolved for all customers very soon—within days, not weeks.
For over a decade, the majority of Bitbucket’s services have been hosted in a data center. While this has served us well for many years, operating a data center comes with significant overhead as well as risk. For example, when we have had unexpected capacity issues (e.g. hardware failures or unplanned outages in upstream services), we have been limited by the physical servers we had available, impacting our time to recovery.
Over the past year, we have been on a journey to migrate all of Bitbucket Cloud to Micros, Atlassian’s internal cloud platform based on AWS. This is truly a quantum leap for Bitbucket Cloud and will resolve many reliability issues including the one described above and more. Of course, that doesn’t mean the move has been easy; if it were, we would have done it a long time ago.
While we’ve made tremendous progress behind the scenes over the past 12 months, recent performance degradation for some of our larger customers have highlighted significant gaps in our execution which we are actively working to close.
Perhaps the most significant challenge our teams have faced on this project is file system latency. For 10+ years, Bitbucket’s services have operated with ultra low-latency access to physical file system servers located within the same facility. This has fostered the spread of baked-in assumptions in how many features have been implemented, from page rendering to API serialization to caching data. Migrating to a new environment where our compute resources and file storage are virtual poses a trade-off. It is very good for our scalability, as we can dynamically increase capacity as needed; however, it comes with a performance cost, as the latency between the servers running our application code and the servers storing repository data is substantially increased.
We did anticipate this increased latency and have made many changes to minimize its impact. These changes include the elimination and refactoring of many code paths that previously required file system access, the introduction of new hooks to automatically optimize repositories whenever they’re changed, widespread caching of Git objects and other data that requires file system access, and offloading the storage of common data to S3. These efforts and more have been the key to successfully migrating our production traffic to Micros without seeing a steep increase in response times. But as the past week has made clear, there is still work to do.
As we planned for this migration, our engineering teams prioritized their efforts to ensure that Bitbucket continued to be responsive while supporting all operations that users typically perform with high frequency throughout the day: visiting the dashboard, browsing source code, reviewing pull requests, etc. Our focus was primarily on read operations since these occur with much greater volume than write operations.
In contrast, we accepted that there would be increased latency for writes—e.g. creating repositories, committing code, creating pull requests—reasoning that these represent actions a typical user performs far less frequently than reads.
The case of merging pull requests is clearly one that warranted more careful consideration. After we started receiving support cases and hearing from users that merges were taking too long, we recognized a gap in our UX (user experience) that had gone largely unnoticed: users perceived merges as a synchronous operation, requiring them to wait for the merge to complete and a banner to display on the page, when in actuality they are asynchronous (happening in the background).
When you click the Merge button, this adds a message to a queue. A background worker process later takes this message from the queue and performs the actual merge. While this is happening, the front end polls Bitbucket’s API for the status of the merge and dynamically updates the page once the changes have been committed to the repository.
The reality is that the vast majority of requests to merge pull requests are processed successfully, but our existing UX does not make this clear, causing users to be confused and concerned. We are in the process of updating our UX to make this much clearer:
We have updated the banner itself to indicate that the merge is happening in the background and you can safely navigate away from the page. (Done)
We are rolling out changes to persist the banner on the page even after a page refresh. This should alleviate confusion for users who refresh the page, see that the banner is missing, and then don’t know what state the pull request is in.
We will also be revisiting how we notify users of success or any errors that may occur during the merge.
Please bear with us as we make these updates to our UX. In the meantime, hopefully at least knowing about our asynchronous approach to merging pull requests is helpful for some teams.
Moments ago I shared that we aimed to optimize core capabilities that developers use throughout the day, including reviewing pull requests. The single most important part of a pull request is the diff. And in fact, the average pull request diff loads in just over a second. However, the past week has highlighted some significant issues with our diff rendering functionality that disproportionately affect our largest customers.
As mentioned previously, our new platform affords us improved scalability at the cost of increased file system latency. In order to keep the website responsive, our approach has been to accept trade-offs in areas that are not user-facing in order to optimize website response times, since our website is the service where users are most sensitive to increased wait times (no one likes waiting for pages to load).
To optimize our diff and diffstat endpoints, our engineers implemented a solution where we would proactively generate diffs during pull request creation and cache them. Then upon viewing the pull request, they would see diff, diffstat, and conflict information all retrieved from the cache, bypassing the file system and avoiding increased latency. Unfortunately, despite internal testing, when we promoted these changes to production we discovered a significant number of edge cases and bugs causing diffs to sometimes display with artifacts or, worse, simply fail to load.
As of right now, we have disabled this caching optimization in order to address the bugs. The good news is that we have already fixed many of them and are on track to have the remainder fixed over the next few days. What this means for now is that diffs are loading more slowly for many users and will continue to be slower than usual until we are able to complete all of our fixes and re-enable optimizations.
Tips for ensuring your diffs load quickly:
Keep your pull request branches up to date. Diffs in Bitbucket include the changes that have occurred since you branched off from the target. By keeping your branch close to the target branch you reduce the cost of computing the diff.
Keep your pull requests small. Breaking large monolithic changes down into multiple smaller, more incremental changes is not just good for performance; it also helps your teammates understand and review your code.
With respect to the issues I’ve described, I will repeat the summary from the top of this post to set expectations for the immediate future.
Merging pull requests takes longer than it used to, but it is important to realize that merges happen asynchronously in the background and so you do not need to wait for a pull request to be merged. If you navigate away, it will finish while you are doing other things. We’re updating the UX to make this clear.
Rendering pull request diffs is currently slower and sometimes timing out, but we are actively working on fixing this and expect the issue to be largely resolved for all customers very soon—within days, not weeks.
Again, I understand that the past week has been frustrating for many of you. I look forward to following up on this post with an update once we have restored the performance of our diff and diffstat APIs, and again once we’ve reached the finish line on our move to the cloud.
dtao
Head of Engineering, Bitbucket Cloud
3 accepted answers
14 comments