The month of the October has been a rough time for Bitbucket (see http://status.bitbucket.org/history) and is an impediment to our teams. My questions are:
Being in IT we all have empathy for what is going on internally but need information to make the right decision if Bitbucket is right for us, or because of our uptime needs, if we need to pursue another option.
I'm the engineering manager for Bitbucket Cloud and, yes, it's been a rough month as you note. For now the large obstacle is a piece of failing network hardware that has led to the problems experienced this week. Networking being what it is we cannot rule out the possibility that this network has been playing a silent role as either a contributor or the root cause of many of this months problems. Sadly, we can't be 100% certain of its role because we're not getting any error logs from the switch – only sporadic, externally measurable packet loss. We have raised support tickets with the vendor but the larger plan is to efficiently and safely move off that hardware. We're doing this today, but it's a tricky business.
In terms of what we are doing about it you should expect to see a reduction in service interruptions in the next couple of days. Today, we will be making those networking changes. We are also making architectural changes to provide some more resiliency against network fluctuations (possibly at the expense of some performance) in case there is a deeper problem. We will continue to do a full shakedown of the network stack and continue to follow up with the vendor.
In terms of future plans, this recent spate of interruptions has actually interfered somewhat with our larger capacity work being done over the next 6 months: specifically, we are in the midst of expanding our storage, adding to our Internet uplinks, and scaling up hardware (quadrupling compute to provide headroom for some future plans).
In addition to the long term infrastructure work we will continue to work on a more short term basis on smaller performance wins and user experience improvements from the code base itself. As a matter of fact, we just finished what we call "performance week" – a week dedicated to finding inefficiencies in the code and improving user experience. Unfortunately, we haven't been able to deploy those improvements because of these incidents.
Does that adequately answer your question?
Dan- Thank you for the detailed and thoughtful response. From our point of view, the biggest obstacle has been the uptime of interactions that need to happen synchronously (i.e. pushing of code) or merging a pull request. If there are "delays" in webhooks, etc. it is less of an issue for us. The biggest emotional frustration for a developer is to have a feature coded and done and not be able to push it to the repo. We look forward to improvement in the uptime of the platform and offer any assistance we can provide. /jrt
Badges are a great way to show off community activity, whether you’re a newbie or a Champion.Learn more
After spinning my wheels trying to get organized enough to write a book for National Novel Writing Month (NaNoWriMo) I took my affinity for Atlassian products from my work life and decided to tr...
Connect with like-minded Atlassian users at free events near you!Find a group
Connect with like-minded Atlassian users at free events near you!
Unfortunately there are no AUG chapters near you at the moment.Start an AUG
You're one step closer to meeting fellow Atlassian users at your local meet up. Learn more about AUGs