Come for the products,
stay for the community

The Atlassian Community can help you and your team get more value out of Atlassian products and practices.

Atlassian Community about banner
4,256,819
Community Members
 
Community Events
164
Community Groups

Extinguishing our performance fires and rebuilding for the future

This blog was written by Bitbucket's Head of Engineering, Daniel Tao, and is also published to the Bitbucket Cloud blog.

I stepped into the role of Head of Engineering for Bitbucket Cloud in late 2020, having served as one of the team's senior engineering managers for several years. It is an honor and a privilege to lead this team, and I couldn't be prouder of the hard work we've done and continue to do each day to make Bitbucket a world-class product empowering teams to build, test, and deploy software to millions of people around the world.

It has been an eventful journey, and the past few weeks are no exception.

Recent performance incidents

Back in October, our previous Head of Engineering Robert Krohn shared a blog post to provide transparency about a recent major incident that had affected customers as well as share some impressive performance improvements the team had delivered. Today I'd like to do the same for some recent incidents in late March and again this month that have been especially trying, for our customers but also for our engineering teams. Fortunately I also have another set of improvements to share, along with plans to continue investing in the reliability and speed of our services.

It would be fair to ask how these recent incidents are any different from past events such as those explained by my predecessor. This is a good opportunity to differentiate between reliability and performance. While these two concerns are highly correlated, they do not always go together. As an example, shifting expensive workloads from high-performance inelastic infrastructure to new infrastructure that sacrifices some speed for scalability could improve reliability while increasing execution time.

Much of our investments on Bitbucket Cloud over the past year have been focused on reliability—investing in better monitoring and alerting, moving to more scalable architecture, etc. Recently our customers experienced severely degraded performance causing pages to load very slowly and Git operations to appear to hang or time out. While our services did not experience outages during these incidents, this performance degradation was nonetheless highly disruptive to developers trying to use Bitbucket, highlighting the importance of treating performance and reliability as equal partners.

Extinguishing the fires

As the team dug in to investigate our performance degradation issues last week, one key discovery emerged early on: by far the biggest contributor was Bitbucket's file system layer, the piece of our infrastructure responsible for facilitating application access to customers’ source code. This is represented by the blue area in the following graph showing peaks in our response times last Monday and Tuesday.

Every other performance factor, such as databases, caches, and external web requests, were relatively stable in comparison to file system operations. (You can see that the other colored areas do not show as much variation as the blue area.)

A thorough explanation of what we ultimately discovered regarding the root cause of these performance issues will have to wait for a future blog post. While the investigation proved challenging, this early signal helped us kick into action right away on a plan to mitigate the impact to customers:

  1. By offloading as much work as possible away from the file system and onto lower-latency parts of the infrastructure (e.g. caches)
  2. By identifying expensive operations that depend on the file system that we could stop doing entirely (after all, the fastest code is the code which does not run)

Caching packfiles

One of our greatest sources of file system throughput has always been Git traffic: developers cloning repositories, pulling their teammates’ changes, and pushing their own code. The load of these operations is increased by an order of magnitude by CI tools such as Pipelines, Bamboo, or Jenkins. Customers often have these tools configured to poll Bitbucket for changes, sometimes on aggressive schedules (e.g. every minute!), which can generate huge amounts of file system I/O.

Over the past couple of weeks, our engineers have rolled out an optimization to cache and serve packfiles without touching the file system for repository clones and fetches, which has slashed throughput nearly in half.

Throughput from Git over SSH traffic File I/O from Git over SSH traffic
Throughput from Git over HTTPS traffic File I/O from Git over HTTPS traffic

While this change does not directly improve the performance of Git operations, it dramatically reduces load on our file system layer, which in turn makes response times faster across all endpoints that perform file system operations.

Custom libgit2 ODB backend

While our Git services have historically been the biggest contributor to our file system throughput, our website and APIs are no slouch either. Rendering source code, commit messages, and especially computing diffs between versions are all expensive and I/O-bound operations.

The services powering Bitbucket's website and API layers are implemented in Python and use a custom library to handle all access to repositories on disk. This library uses the libgit2 C library under the hood, which has support for pluggable custom backends including an ODB (object database) backend—allowing consumers to specify a data source other than the local file system for looking up Git objects.

A few months ago, a few of our engineers had implemented a proof of concept for a custom backend introducing a high-performance caching layer between libgit2 and the file system. We have had this custom backend enabled in a staging environment for the past couple of months, occasionally identifying minor issues and optimization opportunities along the way before deciding it was ready for production. Over the past week, we have enabled this backend for all users, reducing throughput by more than 30%.

Throughput from website traffic File I/O from website traffic

Warming and sharing diff caches

One of the most expensive things the Bitbucket website does is compute and render diffs. Diffs are everywhere in Bitbucket: reviewing a pull request, viewing a commit, and comparing two branches all require the computation of diffs. Even functionality that doesn't involve showing a diff sometimes requires computing it, for example to produce diffstat information (summary of files and lines added, changed, and removed between versions of a file) or detect conflicts.

We were already caching diffs in many places to speed up response times, but this caching was ad hoc: different code paths that computed the diff would cache it independently, and even the caching backend itself wasn't consistent in all places; some cached diffs were stored in Memcached, others in temporary files in a shared directory.

Our engineers have started rolling out some major optimizations to these code paths, which consolidate caching (reducing resource usage) and leverage known access patterns to ensure caches are "warmed" prior to being checked. Here's an example: the pull request view utilizes a diff behind the scenes to detect conflicts between the source and destination branches. By updating the UI to defer the request for these conflicts until after the diff has rendered—which ensures the diff is already cached when it's time to check for conflicts—our engineers were able to increase the cache hit rate for the conflicts API endpoint to nearly 100%, resulting in a huge drop in response times.

Response times from the pull request conflicts API Response times from the pull request conflicts API

Rebuilding for the future

While I'm thrilled at all of the improvements our engineering teams have been able to make in just the past couple of weeks, I also know from experience that software does not maintain itself (no matter how hard I wish it did!), and that includes performance. Left unchecked, response times will trend upward over time, resource usage will climb, and both reliability and performance will naturally erode.

In order to preserve and build on our recent performance work, we will need to increase our investment in these areas. This investment will take the following 3 forms:

  1. In addition to the reliability SLOs our teams have been internally tracking for months now, I will work with each of our teams to formalize complementary performance SLOs so that we are holding ourselves accountable to maintaining high-performing services.
  2. We are already in the process of transitioning to a new ownership model where our engineering teams are empowered to look after a broad set of performance metrics and take appropriate action. This should improve our rate of pursuing proactive measures as well as responding more efficiently to incidents like these in the future, when a system-wide infrastructure issue affects capabilities spanning services owned by multiple teams.
  3. Finally, we will be taking this opportunity to identify automation we can start building to provide guardrails and build a better defense against future performance bottlenecks. This will include both functional tests to detect common sources of performance regressions as well as programmatic circuit breakers to automate incident response and minimize our TTR (time to recover) wherever possible.

A commitment to transparency

Last but not least, I intend to continue keeping you informed of the challenges we face and the investments we are making to enhance Bitbucket's performance moving forward. We understand that you have placed your trust in Bitbucket Cloud, and we do not take that trust for granted. An increased focus on transparency is one of the key ways in which I believe we can deepen that trust, while providing an interesting and perhaps educational look at some of the engineering action going on behind the scenes every day to make Bitbucket Cloud the software collaboration and delivery platform of choice for millions of professional teams around the world.

-Dan Tao

33 comments

seanaty Atlassian Team Apr 30, 2021

Thanks for sharing this. I really appreciate the transparency!

Like # people like this

An excellent read, Dan! What a great insight into the issues that large platforms can face.

thank you for the detailed article

Thanks for sharing - as luck would have it - we are seeing https://bitbucket.status.atlassian.com/incidents/5025975zcr0v :( 

I can understand there are more challenge in Bitbucket cloud as  my team are pushing more loading to bitbucket. Also the team do run more jobs with Bitbucket pipeline, which is done on Bamboo before.

Thanks for the transparency and keep improvement.

Thanks for the transparency.

Like zomby2d likes this

Well articulated incident report and good read for IT guys.

Thanks @dtao ! This was very informative. I am now eager for the line you wrote ..."root cause of these performance issues will have to wait for a future blog post." :)

Like # people like this

Thanks for the detailed explanation. 

This was written in April, since then there were 2 major incidents, the last one only yesterday :-(

Hoping the suggested steps will be deployed soon

Like # people like this

It's too late, we were Bitbucket customers since 2017, and we had to switch to GitHub in January 2021 because of the lack of features (in-container database and VPN in pipelines) and most importantly these kinds of issues that are always going on. Good luck to the Atlassian team, but we're never going back to Bitbucket as long as it will be worse than GitHub and GitLab.

Like # people like this

Have you considered moving away from Python to Golang, for example? Or Julia?

Regarding diffs, Julia has powerful tools for working with huge data.

I would suggest creating two initiatives - Golang and Julia, where their smartest folks would take the challenge and try solve your biggest problems :-) In the case of Julia, you can try play with backend architectures as well. Golang in my opinion would be more like faster replacement.

Like # people like this

thanks for sharing this!

I definitely was stressing out with how long git pulls and pushes to the remote repos were taking, glad I know what was up! Thanks

So nice to know about challenges your team is facing. That really has an educational factor as we are all building software. Thanks for sharing.

Like Abhishek Saharn likes this

It was an awesome read. Understand what actually goes on behind the scenes, provides a context to the end users. Also, software developers like me can also learn from the mistakes

Props for going above and beyond, and for the masterclass on being transparent with your customers.

Very interesting, thanks for sharing!

Thanks for giving us visibility into this 

It is really great to see a company that is this transparent about stuff that happens.

Thanks for sharing. However, the pipelines are still way too slow. Unless we see a significant increase in performance in the near future, it is only a matter of time before we move to an environment that meets our needs. The background is, that we optimize our build time because fast feedback is essential for effective software development. But with a slow environment, these investments become worthless and the productivity decreases...

Like # people like this

Thanks, we suffered a bit from it but we appreciate the transparency effort.

Today 50% of SSH requests are failing.

 

The response error when it fails:

Permission denied (publickey).

fatal: Could not read from remote repository

 

but that is not the real problem because sometimes fails, sometimes works.

Like # people like this

I appreciate the transparency and honesty, engineering at scale like this is tough work, it's always great to learn about the problems and how they were solved, +1 on the education.

However, just after I read this post, I have now discovered that I cannot pull code from my repos on BitBucket, it returns 500 after a long delay on every request.

Terrible, ironic timing...

Like robinjodon likes this

Same for us... Ironic. We are stuck in the middle of PRs. Can't do anything.

Like Doug Jenkinson likes this

We have an emergency hotfix we now cannot deploy... :/

Comment

Log in or Sign up to comment
TAGS
Community showcase
Published in Bitbucket

Git push size limits are coming to Bitbucket Cloud starting April 4th, 2022

Beginning on April 4th, we will be implementing push limits. This means that your push cannot be completed if it is over 3.5 GB. If you do attempt to complete a push that is over 3.5 GB, it will fail...

1,384 views 1 7
Read article

Community Events

Connect with like-minded Atlassian users at free events near you!

Find an event

Connect with like-minded Atlassian users at free events near you!

Unfortunately there are no Community Events near you at the moment.

Host an event

You're one step closer to meeting fellow Atlassian users at your local event. Learn more about Community Events

Events near you