Create
cancel
Showing results for 
Search instead for 
Did you mean: 
Sign up Log in
  • Community
  • Products
  • Bitbucket
  • Questions
  • There have been a number of outages / performance degradations recently for BitBucket. What is the root cause of these and what is Atlassian doing about them?

There have been a number of outages / performance degradations recently for BitBucket. What is the root cause of these and what is Atlassian doing about them?

John Telford October 22, 2015

The month of the October has been a rough time for Bitbucket (see http://status.bitbucket.org/history) and is an impediment to our teams.  My questions are:

  1. What are the causes of these issues?
  2. What is Atlassian doing about it?
  3. When should we expect to see improvements?

Being in IT we all have empathy for what is going on internally but need information to make the right decision if Bitbucket is right for us, or because of our uptime needs, if we need to pursue another option. 

 

1 answer

1 accepted

6 votes
Answer accepted
@Dan
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
October 22, 2015

Hi John,

I'm the engineering manager for Bitbucket Cloud and, yes, it's been a rough month as you note. For now the large obstacle is a piece of failing network hardware that has led to the problems experienced this week. Networking being what it is we cannot rule out the possibility that this network has been playing a silent role as either a contributor or the root cause of many of this months problems. Sadly, we can't be 100% certain of its role because we're not getting any error logs from the switch – only sporadic, externally measurable packet loss. We have raised support tickets with the vendor but the larger plan is to efficiently and safely move off that hardware.  We're doing this today, but it's a tricky business.

In terms of what we are doing about it you should expect to see a reduction in service interruptions in the next couple of days. Today, we will be making those networking changes. We are also making architectural changes to provide some more resiliency against network fluctuations (possibly at the expense of some performance) in case there is a deeper problem. We will continue to do a full shakedown of the network stack and continue to follow up with the vendor.

In terms of future plans, this recent spate of interruptions has actually interfered somewhat with our larger capacity work being done over the next 6 months: specifically, we are in the midst of expanding our storage, adding to our Internet uplinks, and scaling up hardware (quadrupling compute to provide headroom for some future plans).

In addition to the long term infrastructure work we will continue to work on a more short term basis on smaller performance wins and user experience improvements from the code base itself. As a matter of fact, we just finished what we call "performance week" – a week dedicated to finding inefficiencies in the code and improving user experience. Unfortunately, we haven't been able to deploy those improvements because of these incidents.

Does that adequately answer your question?

 

Thanks,

 

Dan Bennett

John Telford October 22, 2015

Dan- Thank you for the detailed and thoughtful response. From our point of view, the biggest obstacle has been the uptime of interactions that need to happen synchronously (i.e. pushing of code) or merging a pull request. If there are "delays" in webhooks, etc. it is less of an issue for us. The biggest emotional frustration for a developer is to have a feature coded and done and not be able to push it to the repo. We look forward to improvement in the uptime of the platform and offer any assistance we can provide. /jrt

Suggest an answer

Log in or Sign up to answer
TAGS
AUG Leaders

Atlassian Community Events