Create
cancel
Showing results for 
Search instead for 
Did you mean: 
Sign up Log in

Encountering some turbulence on Bitbucket’s journey to a new platform

This blog was written by Bitbucket's Head of Engineering, Daniel Tao, and is also published to the Bitbucket Cloud blog.

The past week has been a turbulent time for Bitbucket Cloud’s engineering and support teams as well as our customers. Some of you have expressed concern regarding our services' performance and reliability in recent days. Fortunately, for the majority of Bitbucket users, our services have continued to perform smoothly. But given the number of developers who use Bitbucket daily, even a small percentage of our active users represents a lot of people.

The purpose of this article is to provide some answers: what’s going on, why we’re having these issues, and what you should expect moving forward.

The short of it is this: we’re nearing completion on a year-long project to move to an entirely new platform, which will be good for Bitbucket’s security and reliability moving forward. The move has contributed to recent performance issues, which we are actively addressing and which we expect to be resolved over the next few days:

  • Merging pull requests takes longer than it used to. While we are always investing in performance enhancements and may see improvement in this area, it is important to realize that merges happen asynchronously in the background and so you do not need to wait for a pull request to be merged. If you navigate away, it will finish while you are doing other things. We’re updating the UX to make this clear.

  • Rendering pull request diffs is currently slower and sometimes timing out. We are actively working on fixing this and expect the issue to be largely resolved for all customers very soon—within days, not weeks.

What’s going on

For over a decade, the majority of Bitbucket’s services have been hosted in a data center. While this has served us well for many years, operating a data center comes with significant overhead as well as risk. For example, when we have had unexpected capacity issues (e.g. hardware failures or unplanned outages in upstream services), we have been limited by the physical servers we had available, impacting our time to recovery.

Over the past year, we have been on a journey to migrate all of Bitbucket Cloud to Micros, Atlassian’s internal cloud platform based on AWS. This is truly a quantum leap for Bitbucket Cloud and will resolve many reliability issues including the one described above and more. Of course, that doesn’t mean the move has been easy; if it were, we would have done it a long time ago.

Why we’ve been having issues

While we’ve made tremendous progress behind the scenes over the past 12 months, recent performance degradation for some of our larger customers have highlighted significant gaps in our execution which we are actively working to close.

Perhaps the most significant challenge our teams have faced on this project is file system latency. For 10+ years, Bitbucket’s services have operated with ultra low-latency access to physical file system servers located within the same facility. This has fostered the spread of baked-in assumptions in how many features have been implemented, from page rendering to API serialization to caching data. Migrating to a new environment where our compute resources and file storage are virtual poses a trade-off. It is very good for our scalability, as we can dynamically increase capacity as needed; however, it comes with a performance cost, as the latency between the servers running our application code and the servers storing repository data is substantially increased.

We did anticipate this increased latency and have made many changes to minimize its impact. These changes include the elimination and refactoring of many code paths that previously required file system access, the introduction of new hooks to automatically optimize repositories whenever they’re changed, widespread caching of Git objects and other data that requires file system access, and offloading the storage of common data to S3. These efforts and more have been the key to successfully migrating our production traffic to Micros without seeing a steep increase in response times. But as the past week has made clear, there is still work to do.

Merging pull requests

As we planned for this migration, our engineering teams prioritized their efforts to ensure that Bitbucket continued to be responsive while supporting all operations that users typically perform with high frequency throughout the day: visiting the dashboard, browsing source code, reviewing pull requests, etc. Our focus was primarily on read operations since these occur with much greater volume than write operations.

In contrast, we accepted that there would be increased latency for writes—e.g. creating repositories, committing code, creating pull requests—reasoning that these represent actions a typical user performs far less frequently than reads.

The case of merging pull requests is clearly one that warranted more careful consideration. After we started receiving support cases and hearing from users that merges were taking too long, we recognized a gap in our UX (user experience) that had gone largely unnoticed: users perceived merges as a synchronous operation, requiring them to wait for the merge to complete and a banner to display on the page, when in actuality they are asynchronous (happening in the background).

When you click the Merge button, this adds a message to a queue. A background worker process later takes this message from the queue and performs the actual merge. While this is happening, the front end polls Bitbucket’s API for the status of the merge and dynamically updates the page once the changes have been committed to the repository.

The reality is that the vast majority of requests to merge pull requests are processed successfully, but our existing UX does not make this clear, causing users to be confused and concerned. We are in the process of updating our UX to make this much clearer:

  1. We have updated the banner itself to indicate that the merge is happening in the background and you can safely navigate away from the page. (Done)

  2. We are rolling out changes to persist the banner on the page even after a page refresh. This should alleviate confusion for users who refresh the page, see that the banner is missing, and then don’t know what state the pull request is in.

  3. We will also be revisiting how we notify users of success or any errors that may occur during the merge.

Please bear with us as we make these updates to our UX. In the meantime, hopefully at least knowing about our asynchronous approach to merging pull requests is helpful for some teams.

Viewing pull request diffs

Moments ago I shared that we aimed to optimize core capabilities that developers use throughout the day, including reviewing pull requests. The single most important part of a pull request is the diff. And in fact, the average pull request diff loads in just over a second. However, the past week has highlighted some significant issues with our diff rendering functionality that disproportionately affect our largest customers.

As mentioned previously, our new platform affords us improved scalability at the cost of increased file system latency. In order to keep the website responsive, our approach has been to accept trade-offs in areas that are not user-facing in order to optimize website response times, since our website is the service where users are most sensitive to increased wait times (no one likes waiting for pages to load).

To optimize our diff and diffstat endpoints, our engineers implemented a solution where we would proactively generate diffs during pull request creation and cache them. Then upon viewing the pull request, they would see diff, diffstat, and conflict information all retrieved from the cache, bypassing the file system and avoiding increased latency. Unfortunately, despite internal testing, when we promoted these changes to production we discovered a significant number of edge cases and bugs causing diffs to sometimes display with artifacts or, worse, simply fail to load.

As of right now, we have disabled this caching optimization in order to address the bugs. The good news is that we have already fixed many of them and are on track to have the remainder fixed over the next few days. What this means for now is that diffs are loading more slowly for many users and will continue to be slower than usual until we are able to complete all of our fixes and re-enable optimizations.

Tips for ensuring your diffs load quickly:

  1. Keep your pull request branches up to date. Diffs in Bitbucket include the changes that have occurred since you branched off from the target. By keeping your branch close to the target branch you reduce the cost of computing the diff.

  2. Keep your pull requests small. Breaking large monolithic changes down into multiple smaller, more incremental changes is not just good for performance; it also helps your teammates understand and review your code.

What you can expect

With respect to the issues I’ve described, I will repeat the summary from the top of this post to set expectations for the immediate future.

  • Merging pull requests takes longer than it used to, but it is important to realize that merges happen asynchronously in the background and so you do not need to wait for a pull request to be merged. If you navigate away, it will finish while you are doing other things. We’re updating the UX to make this clear.

  • Rendering pull request diffs is currently slower and sometimes timing out, but we are actively working on fixing this and expect the issue to be largely resolved for all customers very soon—within days, not weeks.

Again, I understand that the past week has been frustrating for many of you. I look forward to following up on this post with an update once we have restored the performance of our diff and diffstat APIs, and again once we’ve reached the finish line on our move to the cloud.

14 comments

Sumeet Keswani July 9, 2021

We moved to bitbucket a week back as a pilot from an internal system that we have used for more than a decade. We have had a case of a less than 500 line diff not render in our first week and its been a fair bit of productivity loss.

My only comment is moving your storage to S3 is a great idea from a cost of operations perspective but horrible from a latency perspective. S3 access can never beat local file system access.  The elevator music on the UI to indicate an async operation might help. 

We hope you can resolve the matter at your earliest and fix the issues that result in poor performance. The success of our development projects depends on it.

Like # people like this
Ken Zarsky July 15, 2021

I don't know any dev that is comfortable with fire and forget on merges... just not a 'walk away' type of action.  Please consider human nature when decisioning your solution!

Like # people like this
Johannes_Nielsen July 16, 2021

This is unbearable. As Ken said: Nobody merges and then walks away. 

Often a merge will trigger a deployment pipeline. Having to wait for the merge means having to wait for the deployment. Sometimes the UI will simply state "We had a problem trying to display some data" or something and you are left there guessing and waiting to see if and when the merge will actually happen. Sometimes you even have to trigger the merge again then.

Horrible developer experience. If this does not get fixed soon, I'll advocate moving all our assets to GitHub. 

Like # people like this
Dave B July 16, 2021

As Johannes said if the merge didn't happen on completion of the UI action, the dev will be coming back until it's done, this may introduce latency of 30 - 60+ minutes due to having other tasks - major productivity issue.

Like # people like this
dtao
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
July 23, 2021

I'll be sharing an update and also updating this post with some more things we've done and things we've learned since this was published. To reply to your comments directly:

@Sumeet Keswani as I'll share in my upcoming update, we've made major improvements to diff response times already; so I hope your issues in that area are mostly resolved. Definitely let me know if you're still seeing significant diff performance issues.

To clarify re: S3, we are not directing any website or API user operations to S3! Specifically what we're using S3 for is supporting repeated clones of the same repository at the same revision, which largely come from CI traffic. The net effect of this is reducing I/O load on our file storage backend which improves user-facing response times.

@Ken Zarsky fair enough—the intent of this post wasn't so much to rationalize slow merge times as it was to reduce confusion. Previously if a merge was taking a while and the user refreshed the page, the banner would disappear, leading them to suspect it had failed or been canceled and then they would try again. We've since updated the UX to persist the banner as long as the merge is still in progress, not to make excuses for merges being slow, but just to more accurately reflect reality. We are actively working on addressing the sources of slowness as well (stay tuned for my next post to provide some updates on that).

@Johannes_Nielsen and @Dave B feedback like yours has helped us appreciate the disproportionate impact these issues are having on specific customers. We know that 99% of merges are completed in under 10 seconds; in fact it's closer to 5 seconds. We're still investigating why there is so much volatility in the top 1% of most expensive merges as well as why certain customers seem to be affected more than others. The reality is it is not expected for merges to be taking 30 seconds or more on a regular basis—and certainly not for simple diffs.

I'll comment again once I've had a chance to publish a more in-depth update.

Like # people like this
Sumeet Keswani July 26, 2021

thank you for your attention.

yes, the issue is mostly resolved, we have not had a major problem in about a week or so. 

Rick de Graaf August 4, 2021

I think we are one of the companies in the mentioned 1%. We are still experiencing a long merging time, on average 3+ minutes for almost all our merge requests. We had hopes this would improve after above statement.

Earlier this week we had to cancel a scheduled release because one of the repositories needed was unavailable for more than an hour. 

We are experiencing issues like these and others multiple times a week and this is really affecting us as a company. Also we feel like these issues aren't noticed quickly enough. In the example of the unavailable repository the status page reflected this after 30 minutes since we first noticed it.

As a company we are seriously looking into other services and solutions at the moment as we feel the service as a whole is to unstable, experiences problems way to often and that issues aren't noticed and solved quickly. It does not seem to improving over time (this year, especially the last few months).

Through this comment (and if there are better ways we would like to hear about it) we hope the clearly express the need for improvement and the continuous negative impact on our work proces and all its implications. 

Lastly we feel improvements can be made in communication in general about these issues and in keeping the status page up to date. 

As software developers we know that services and software aren't 100% reliable and bug free but feel that, especially this being a paid service, issues are way to frequent and take too long to be resolved.

That being said we sincerely hope things will improve quickly.

Like Vincent Radstake likes this
Rick de Graaf August 11, 2021

@dtao Would really love to see a reaction. Currently I'm waiting more then 10 minutes for a PR to merge.

Like # people like this
Dave B August 11, 2021

We also see slowness when "sync" from the web UI for PR's that are behind the target branch

dtao
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
August 16, 2021

@Rick de Graaf thanks for sharing—I've published a follow-up post that should be all good news for you and your team:

An exciting update on diff and merge performance

I'm very sorry to hear about the disruption to your scheduled release. We've done our best to mitigate customer impact as much as possible as we migrate repositories to our cloud storage layer, e.g. by working weekends on the actual migration itself and communicating scheduled maintenance via our Statuspage. Admittedly we aren't perfect and recognize, given the sheer number of users accessing bitbucket.org even over the weekend, some disruption is inevitable. Fortunately we're on track for this upcoming weekend to be our very last weekend migrating repositories; I expect for us to be 100% done by this time next week.

Lastly we feel improvements can be made in communication in general about these issues and in keeping the status page up to date.

I totally agree with you here and will be working with our teams as well as the Statuspage team (we're in the same department, after all!) to explore ways we can improve this, e.g. by automating our public statuses based on internal metrics.

The truth is with this move to our cloud platform, we will soon have more resources than ever before to pursue things like this as our engineers, especially our SREs, can spend less time maintaining hardware and operating a data center and focus more on longer-term operational excellence goals.

Like Rick de Graaf likes this
dtao
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
August 16, 2021

@Dave B please see the article I shared in my above comment and let me know if you're still seeing slowness from the feature to sync out-of-date PRs. I expect that the improvements we made to reduce PR merge times should have benefitted this functionality as well, since both are powered by our background task services.

Like Dave B likes this
Rick de Graaf August 16, 2021

@dtao looking forward to seeing the issues being resolved!

Dave B August 18, 2021

@dtao - Thx

 

I noticed it was improved yesterday - small sample - I'm checking with my team and will post a follow up

Dave B August 30, 2021

@dtao We are seeing improved performance across all requests now

Comment

Log in or Sign up to comment
TAGS
AUG Leaders

Atlassian Community Events