Updates and Q&A on active incident affecting Atlassian products

Stephen Deasy
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
April 12, 2022

Update: April 18, 2022

Hello Community,

All customers impacted by the outage have been restored.

If you need assistance, please reply to your support ticket so that our engineers can work with you.

Our teams will be working on a detailed Post Incident Report to share publicly by the end of April.

Stephen

 


Update: April 17, 2022

Hello Community,

We have now restored our customers impacted by the outage and have reached out to key contacts for each affected site.

Our support teams are working with individual customers through any site-specific needs. If you need assistance, please reply to your support ticket so that our engineers can work with you as soon as possible.

Stephen

 


Update: April 16, 2022

Hello Community,

I’m happy to report we have now restored 99% of users impacted by the outage and have reached out to all affected customers.

Our teams are available to help customers with any concerns. If you need assistance, please reply to your support ticket so that our engineers can work with you.

Have a great weekend,

Stephen

 


Update: April 15, 2022

Hello Community,

I’m happy to report we have now restored 78% of users impacted by the outage.

With significant progress over the last 24 hours, our teams will continue to restore sites through the weekend, and we expect to have all sites restored no later than end of day Tuesday, April 19th PT.

We will be reaching out directly if we haven’t already delivered your site for validation and acceptance.

Have a great weekend,

Stephen

 


Update: April 14, 2022

Hello Community,

I’m happy to report that we have now restored functionality for 55% of users impacted by the outage.

With automation in full effect, we have now significantly increased the pace at which we are conducting technical restoration of affected customer sites in addition to reducing validation time of restored sites by half.

We will keep you posted on updates throughout the upcoming weekend and appreciate your continued patience. 

Stephen

 


Update: April 13, 2022

Hello Community,

Wanted to let you know that we have restored functionality for 49% of users impacted by the outage.

We are taking a batch-based approach to restoring customers, and to-date, this process has been semi-automated. We are beginning to shift towards a more automated process to restore sites. That said, there are still a number of steps required before we hand a site to customers for review and acceptance.

We are restoring affected customers identified by a mix of multiple variables including site size, complexity, edition, tenure, and several other factors in groups of up to 60 at a time. The full restoration process involves our engineering teams, our customer support teams, and our customer, and has three steps: 

  1. Technical restoration involving meta-data recovery, data restores across a number of services, and ensuring the data across the different systems is working correctly for product and ecosystem apps

  2. Verification of site functionality to ensure the technical restoration has worked as expected

  3. Lastly, working directly with the affected customer to enable them to verify their data and functionality before enabling for their users

We have also contacted all customers who are *up next* for step 3 in the site restoration process described above. These customers are aware that they are next in queue through their support ticket and/or via a support engineer.

We have proactively reached out to technical contacts and system admins at all impacted customers, and opened support tickets for each of them. However, we learned that some customers have not yet heard from us or engaged with our support team.

If you are experiencing an outage and do not have access to your open ticket, please contact us through our Support Service Desk (choose the Billing, Payments, & Pricing options from the drop down menu).

As always, thanks for your feedback and continued patience.

Stephen

 


April 12, 2022

Hi Atlassian Community,

I'm Stephen Deasy, Head of Engineering at Atlassian. I want to first apologize for the continued frustration this incident is causing.

Our Chief Technology Officer Sri Viswanath has posted a blog about this incident with more background and details, and our team is working around the clock to move through the various stages for restoration.

Our Statuspage has regular updates, but we want to go beyond that and address questions that you want to raise. Please add your questions here and we will respond as quickly and transparently as we can. Some questions may not be answered until we do an official PIR, but we will let you know that and answer as much as we can now.

While we really want to get in front of you live to answer your questions, we are prioritizing getting customers up and running first and foremost. We will host an AMA (Ask Me Anything) after we get all of our customers fully restored.

Thank you for your continued patience.

7 answers

Suggest an answer

Log in or Sign up to answer
14 votes
Shane Doerksen April 12, 2022

Hello Atlassian and Community,

First, thank you to Atlassian for finally releasing clear, specific information on the causes of this problem, and a realistic estimate of when this will all be behind us. If this had been done in the first 36 hours or so, when it became apparent that this was going to be a prolonged outage, it would have done wonders to reduce the rage and frustration that has built up in the community over the last week.

Atlassian is a strategic partner for its customers. Feeling ignored at the same time as we're being incredibly let down by a core partner (and eviscerated on a daily basis by our stakeholders) is a recipe for bitterness.

I'm sure there were talks in the executive team with legal and marketing and management from every corner, and maybe one of the approaches was to try to keep this out of the tech news by not talking about it. This was a mistake.

OPINIONS

1) I respectfully but very strongly disagree with the suggestions being made that there should have been more effort to direct us to the status page ("us" being irate customers). The updates being made on that page were so generic as to feel almost deceptive. There were a thousand words about nothing, with no information for days about the potential scale or duration of the problem. This avoidance of clear communication did more to erode trust in the Atlassian brand than the outage.

The non-information on the status page is often what led customers to post in the forums in the first place, and it was the generic non-answers eventually posted by Atlassian staff which made us think that either Atlassian didn't understand or didn't care about the level of pain they were causing our organizations.

A good example of what should have been done is Mr. Viswanath's blog post. It contains exactly what affected customers need to hear. Not all of this information would have been available in the early days of the outage, but having an honest, direct description of the problem closer to the beginning of the situation would have been a massively better decision than the strategy of generic corporate deflection and minimization.

2) Most of the people who interact with Atlassian products are technical stakeholders. We're site admins and technical contacts and power users and developers. The people who pay for the Atlassian subscriptions in our organizations are not. Leaving us, the IT people who recommend your solutions, to answer for your outage when you couldn't be bothered to give us the most basic information about the incident, is just unforgivable. This was a slap in the face to the people who advocate for your products in our organizations.

3) Having the support request form refuse to accept requests without a valid cloud URL was just pouring tabasco sauce in our open wounds. This is another reason why people went to the forums for support -- we couldn't reach Atlassian any other way.

4) I respectfully disagree with Mr. Viswanath's characterization of one of the critical problems as a "faulty script". This is incorrect, as his very next sentence says "The script was executed with the wrong execution mode and the wrong list of IDs." In other words, the script did exactly what it was told to do. The problems are a) that a person or group told it to do the wrong thing; and b) that no other person or group verified the action that was to be performed before it was executed on production.

QUESTIONS

1) Why wasn't a statement like Mr. Viswanath's blog post published within the first 2 days of the outage?

2) Why weren't Atlassian staff active in the forums trying to get ahead of this? I was in several threads on this issue and, in the beginning, there were days between staff activities. Leaving us twisting in the wind, in silence and ignorance beyond the anodyne corporate-speak on the status page, took the situation to an eleven very quickly.

Even just appearing quickly after a comment or question was posted to say "We're very sorry, we're throwing everything we have at getting you back online but we don't have solid information yet; we will keep you posted" would let us know that you were at least paying attention. That silence in the first few days was the worst.

3) Why wasn't the deletion script run on a non-production environment first? It is terrifying to think that your maintenance engineers are shooting from the hip on our production environments.

4) Why didn't Atlassian follow its own recommendations on disaster recovery, specifically regarding having a runbook with communication plans?

https://confluence.atlassian.com/enterprise/disaster-recovery-guide-for-jira-692782022.html

https://confluence.atlassian.com/enterprise/confluence-data-center-disaster-recovery-790795927.html

We understand that this was an unanticipated technical situation, but are there not standards for how to communicate during outages, and what thresholds merit "lights and sirens / man the battle stations" communication efforts?

FINAL ANSWER

We all understand that this was a black swan event. Nobody can predict everything.

We also understand that it happened at a 72 billion dollar company who specializes in IT service management and literally wrote the book on disaster recovery procedures for its products.

This wasn't a technical failure. It was a failure of IT governance.

Scott Farquhar
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
April 12, 2022

@Shane Doerksen you have highlighted the problems correctly.  We should have been more open earlier.

There is a combination of myself and many execs being at our annual Team Conference, alongside COVID taking out many on our communications team that have made things more difficult than normal.  No excuses, but as I try to wrestle with why we have ended up here they factor in.

I do want you to know that at no time were we trying to be heads down, hoping it would blow over.  Hanlon's razor instead.

Your questions are valid, and I will ensure most if not all are answered in the full public PIR.  At the moment we are heads down trying to fix our customers.

I appreciate that you, and other customers hold us to the high standard we have failed to meet in this case.  We need people like yourself in order for us to be better.

Like # people like this
Shane Doerksen April 13, 2022

Hello @Scott Farquhar,

Thank you for your response and the effort your team is putting into clearing the air. These are important steps for rebuilding trust.

We appreciate them.

Like # people like this
10 votes
Nic Brough -Adaptavist-
Community Leader
Community Leader
Community Leaders are connectors, ambassadors, and mentors. On the online community, they serve as thought leaders, product experts, and moderators.
April 12, 2022

I saw two issues here:

Communication was weak, despite the push to Statuspage (which was the right thing - that's what Statuspage is for!).  Too many people were asking questions instead of looking at Statuspage.  Why are we not referring people there?

There is no explanation of what has gone wrong beyond some vague mention of some scripting.  Some of the audience don't care past "it isn't working today".  Others would be more forgiving if there was some expansion on the TLDR story.

Stephen Deasy
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
April 12, 2022

Hey Nic, thanks for the feedback, agree with you. On the details, I updated the text above to link to our just-published blog that has that deeper dive.

Like # people like this
5 votes
Darryl Lee
Community Leader
Community Leader
Community Leaders are connectors, ambassadors, and mentors. On the online community, they serve as thought leaders, product experts, and moderators.
April 12, 2022

Sri wrote a great post. Albeit a touch late.

As I mentioned to @Monique vdB in person during the Community Leader sessions at Team 22, as the Community is the only support for free tiers, and often the first or second place paying customers go for support, you could and should have properly redirected several irate customers to the Statuspage.

Instead, Community Leaders and your own staff were stuck running PR for you.

This is not right.

You need a mechanism to put a banner at the top of the Community site. I would also suggest that the banner atop the "Contact Us" page is insufficient. For a major, multi-day outage like this (even as it only affects an overall small percentage of users), you have to suck it up and put it on the PRIMARY https://support.atlassian.com/.

Darryl Lee
Community Leader
Community Leader
Community Leaders are connectors, ambassadors, and mentors. On the online community, they serve as thought leaders, product experts, and moderators.
April 12, 2022

@Shane Doerksen You're absolutely right. The Statuspage updates were woefully inadequate. But my main point was that there should be a centralized location for authoritative information, and for better or worse, Statuspage is that place.

Agree completely that they should have had a more technically detailed message up and linked to from the Statuspage updates, many of which were simply repeated every several hours.

Instead technical details about this outage "leaked out" through Twitter and Reddit posts where affected customers would copy and paste what Support was telling them.

Like # people like this
Shane Doerksen April 13, 2022

I definitely agree with you @Darryl Lee. The status page should be the authoritative source of information. I think that's part of what led to such confusion and frustration, as that page basically said "something happened, meh" over and over again. As you described, in this case we were piecing bits of information together from various sources, which just wasn't great.

Like # people like this
Stephen Deasy
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
April 13, 2022

Totally understand the frustration, we know this did not help. Part of our incident review will be going through the status updates and comms too, so this feedback will be a valuable input to that.

Like # people like this
3 votes
ahawse April 14, 2022

I am the CIO of one of the impacted companies.  We have received 0... 0 ... 0 ... information about anything.  As a growing public company I wonder what are next steps are.  This is difficult to decide as you guys have literally given us 0 information.    We have been in the process of increasing our support, but at this point I we cant get even a sales contact to return our calls.

 

It seems like that it would be possible to make an estimate about when you might get to our instance.  I suppose it would be nice if I could tell my user community something.

 

Alan

Stephen Deasy
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
April 14, 2022

Hi Alan, firstly I apologize for not being in touch. We reached out to technical contacts and site admins for every impacted customer, as well as opening a support ticket automatically for them. We did learn that some customers didn't receive that for various reasons. If you do not have access to your open ticket, please contact us through our (choose the Billing, Payments, & Pricing options from the drop down menu): https://support.atlassian.com/contact/#/

From there we will pick it up and be in touch very quickly. If that doesn't work for any reason let me know here and I will get it sorted out.

Like # people like this
Shane Doerksen April 14, 2022

Hi @ahawse .  I'm very sorry to hear about your situation. I can't make any predictions about when they will get to your site, but I can relay some of our information and experience with this outage.

First, there is a lot of information in this thread where some community members have shared their interactions with support:

https://community.atlassian.com/t5/Jira-Service-Management/Any-updates-on-current-Atlassian-status-https-jira-service/qaq-p/1994038

The last update above says that they have restored service to 49% of the affected customers at this point. We're not one of them.

We received an update on Sunday that said our site was beginning to be restored, then we received an update yesterday from Nathan Spells in Atlassian who said that our site is in the data verification stage. We still don't have a concrete ETA yet. We are expecting that sometime in the next 1 to 3 days our site will be available, but there's no confirmation on that from Atlassian.

Atlassian has said that it will take 4 to 5 days from the time that a customer receives a notice saying they are beginning the restoration process to the time that the site is released to the customer for final testing and verification.

So I think it is reasonable to say that, if you have not received any communication that your site is being restored yet, that your outage will last at least another 4 to 5 days. Again, I can't speak concretely for Atlassian, but this is what they have written in other threads and posts.

I hope that your site is restored soon, and best of luck with everything.

Like # people like this
Stephen Sifers
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
April 14, 2022

Hello @ahawse

Just confirming we've heard your issue regarding the lack of updates. I've personally added you to your site's support request per this incident and you should have an email with the details.

If you have not yet received that email please do let me know.

Regards,
Stephen Sifers | Product Lead, Community

Like # people like this
1 vote
Mikael Sandberg
Community Leader
Community Leader
Community Leaders are connectors, ambassadors, and mentors. On the online community, they serve as thought leaders, product experts, and moderators.
April 12, 2022

For the affected customers, I’m one of them, will there be any compensation for the outage since you surpassed your 99.9% uptime promise?

Leslie Lee
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
April 12, 2022

Hi @Mikael Sandberg - OutreachThanks for asking the question. I know it's on a lot of our customer's minds. Our top priority right now is recovering our customers’ sites to full functionality and as we do so, affected customers will not receive a bill from us in the short term. Following our efforts to restore, we will reach out to each of our affected customers to discuss how we can make things right in the long term. Post incident, we will also be conducting a detailed review of all of our processes in a complete post incident review (PIR) with an overview made available to all affected customers. This will help ensure we're delivering the service and standard you'd expect from us.

Like # people like this
0 votes
melkaoui taha April 20, 2022

Merci pour votre service!

0 votes
Chihara
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
April 12, 2022

Atlassian,

Will you translate Sri Viswanath's blog to other language as soon as possible?

Leslie Lee
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
April 12, 2022

Yes, we will. I'll keep you posted on ETA. 

Like # people like this
Ai Hirama
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
April 12, 2022

Hi @Chihara

The japanese version localized by our team has been published here: https://www.atlassian.com/ja/blog/april-2022-outage-update

Like # people like this
Leslie Lee
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
April 14, 2022

Hello! Happy to report that Sri's blog post has been translated into Japanese, German, Spanish, French, Portuguese and Russian. Translated version can be found at the bottom of the blog here: https://www.atlassian.com/engineering/april-2022-outage-update

Like # people like this
Chihara
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
April 14, 2022
Elizabeth Barr
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
April 29, 2022

@Chihara This review was posted today.

Chihara
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
May 1, 2022

Elizabeth,

Thank you.

We will talk with our customers about 

https://www.atlassian.com/engineering/post-incident-review-april-2022-outage

Regards,

Like Elizabeth Barr likes this
TAGS
AUG Leaders

Atlassian Community Events