Create
cancel
Showing results for 
Search instead for 
Did you mean: 
Sign up Log in

Any updates on current Atlassian status - https://jira-service-management.status.atlassian.com/inci?

Olimpia Estela Cáceres-Brown April 5, 2022

We cannot access our instance neither sandbox since late Monday EST.

Any updates besides the ones showing on the Atlassian status page?

Thanks,

Olimpia Estela

 

40 answers

1 accepted

6 votes
Answer accepted
Shane Doerksen April 15, 2022

Hi Everyone,

In an Easter miracle of Biblical proportions, our site has been resurrected!

Here is what Atlassian support have sent us:

We have successfully completed site restoration! Please follow the steps below to get started.

Step 1: Validate your site

Please validate your site and respond to this ticket with any outstanding issues or concerns and we’ll make it our priority to work through them with you. Our engineering and support teams are standing by to address any additional issues.

Step 2: Enable mail

Note that your incoming and/or outgoing mail will be disabled at this point. Once you have thoroughly validated the site, you can enable it by going to the link below:

Outgoing mail - https://<YourSite>.atlassian.net/secure/admin/OutgoingMailServers.jspa
Incoming mail - https://<YourSite>.atlassian.net/secure/admin/GlobalMailSettings.jspa

If you would prefer us to turn this functionality back on for you, please reach out to us via this support ticket.

Step 3: Sync groups

If you have user provisioning configured at your Organization you might need to resolve potential conflicts with syncing groups to ensure the changes from your Identity Provider during the outage get reflected in your site. To achieve that, please follow the instructions below:

Access https://admin.atlassian.com/

Select your organization

Navigate to Directory > User provisioning. If you don't see User provisioning under the Directory tab, go to Settings and it should be there.

You might see a banner asking you to resolve the syncing conflicts. Banner will show: <group count> groups pending sync
(If you don't see a banner, no action is needed.)

Click on ‘Review groups before sync’.

Expand the groups to see what will change (who will be added/removed from that group). You can always add/remove users later at the Identity Provider administration dashboard.
This process will link these groups to the ones at your site(s) and allow you to add/remove people based on what we receive from your Identity Provider.

In case it is useful for other customers, here is the timeline of events regarding our restoration:

  • Site went down on April 5th.
  • Support ticket created from community posts: 06/Apr/22 6:28 PM
  • Site entered restoration phase: 10/Apr/22 10:24 AM
  • Site entered validation phase: 14/Apr/22 1:53 AM
  • Site restored and delivered to us for final verification and testing: 15/Apr/22 2:28 PM

We are beginning to test the site and verify that everything has returned to its proper place. At first glance, it seems ok, but we are going through it in detail.

Thank you to everyone in this thread for sharing information about this event. It has been very helpful for our organization.

Best of luck to everyone, and thank you all once again.

Eric D. April 15, 2022

Congrats on your site restoration.  Thank you for providing the timelines as well.  We have not heard anything on our site entering the restoration phase, so I'm guessing we are still at least 5 days away from possibly having our data/site restored, but very much appreciate you documenting and sharing your journey.

 

And as you so elegantly stated:

"May your instance always be online and your data always be backed-up."

Like # people like this
Karla April 15, 2022

Our site was restored this morning and we also received the same message. Very similar timeline as mentioned by @Shane Doerksen 

Thanks for continually posting the updates!

Like # people like this
Olimpia Estela Cáceres-Brown April 15, 2022

Excellent @Shane Doerksen - We're inLine with you, we're almost ready to open to all MIT Libraries staff. This posting really made a difference!  

Happy Easter! Happy Passover! Happy Ramadan!

Like # people like this
Eric D. April 15, 2022

Awww... I got a lump of coal in my stocking this week.  Maybe next week will be better, but congrats to everyone else, maybe go pick up a lottery ticket.  But please keep sharing your experience/issues.

 

Oh! an update.  So my original ticket was closed with the notification that it was a duplicate to one that my manager had opened, but that I was listed as a participant in my manager's ticket.  He got a notification earlier that we are queued for restoration now (although I never received that notice).  So, by next friday hopefully we will be fully armed restored and operational.

Like # people like this
Leslie Lee
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
April 15, 2022

Hallelujah! Thanks for following up with an update. Happy Easter.

Like Stephanie Grice likes this
Eric D. April 16, 2022

Ok, the Easter bunny exchanged that lump of coal for a golden egg. My manager got the notification this morning that our site was restored and ready for verification (still weird that I didn’t get the notification too even though my “duplicate ticket” was closed and they said I’d be added as a participant which I assume meant I’d receive updates as well).

But at a cursory glance, I can indeed access our tickets and things appear to all be there. 

Woohoo. Monday morning will be spent adding in all the new tickets that we’ve been tracking by hand. 

Like # people like this
Stephen Sifers
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
April 16, 2022

@Eric D_ 

Very happy to hear your site is ready for verification!

I've also shared your support request with you again and ensured you're a participant. If you're still not getting updates or notifications please let me know so I may resolve that.

Have a great weekend!

Regards,
Stephen Sifers

Eric D. April 16, 2022

Thank you. We are looking forward to getting back to a more normal service desk and experience and this outage has certainly been eye opening. 

I was mostly mentioning because I felt that as a participant me not getting notifications wasn’t the intended outcome you were expecting, and wanted to share that in case others were having the same experience.  But I also had no way of contacting support other than through my managers emails or through the forums since my ticket was closed and I wasn’t on the remaining open ticket. 

We will now review our service management, confluence, and insight for any missing items and reply via the email once our verification is done.

Like # people like this
Shane Doerksen April 16, 2022

That's great news @Eric D_ . Glad to hear that the world is slowly returning to normal.

 

And thank you to @Stephen Sifers  for monitoring this thread and helping everyone out. We definitely appreciate it.

Like Stephen Sifers likes this
Eric D. April 26, 2022

Did anyone just experience a BLIP in not being able to access Service Management.  I got the scary "Please try back in 5 - 10 minutes" screen and then after refreshing the screen got the side menus but was unable to see any of my tickets, but I could see the work queues.  Things seem to be working right now, and that blip was only maybe 5-10 minutes, but definitely caused some flashbacks.  Also, I had other co-workers cross country experience the same issue.

Shane Doerksen April 26, 2022

Hi @Eric D_ . We didn't see anything like that here. That's definitely a bit stressful, though. Hopefully it was just a hiccup.

Karla April 26, 2022

Hi @Eric D_,

 

We also did not see anything like that. Understandably stressful for you. Hope it was just the blip.

Olimpia Estela Cáceres-Brown April 26, 2022

Hi @Eric D_ ,

I hope the BLIP had been solved for you by now.  

We are not experiencing any issues.

Best wishes,

Olimpia Estela

Eric D. April 26, 2022

Yes. It was only for about 5 minutes this afternoon but the panicky part was seeing that all too familiar “try again in 5 to 10 minutes” screen that was displayed during the outage earlier this month. But since that blip this afternoon everything seems normal again. 

6 votes
Shane Doerksen April 6, 2022

Hello Mr. Sifers,

Thank you for replying in the forum; unfortunately your reply doesn't provide us with much information.

At my company, we're in the same position as @Karim Abrik and @Ulf Sahlin and many others. We can't even interact with support because the support form cannot find our site URL (presumably because of this incident).

Yes, you've sent out an email to the affected customers, but that email said that recovery would take several days and it strongly suggested that there was the potential for data loss. It very pointedly did not confirm that the data was safe. We're already at the point where the downtime is measured in days instead of hours.

You can imagine that this has caused a tremendous amount of stress for the affected customers. The updates on the status page are sporadic and devoid of anything that would give a hint as to an ETA.

The way that these updates are phrased as to be intentionally oblique also ratchets up the stress. Whenever people talk around an issue instead of directly about it, one starts to think that they aren't giving the unvarnished facts.

All of your customers work in IT. We can see that this is a gigantic screwup somewhere along the line in Atlassian. Ok, stuff happens to the best of us. Nobody is perfect. Even big screwups stay a minor irritation if they are resolved quickly and communicated clearly.

But, given the length of the outage and the way we're being communicated with, this is looking like something that is even more serious than it already appears.

Your customers are pretty unhappy right now.

We're unable to contact support, we're unable to access our instances, the communication from Atlassian has been terrible (although sending the email was a good idea) and we all have our own teams to answer to about this.

We would appreciate it if we could get at least a solid confirmation that no data has been lost and an estimate of when we will have access to our instances again.

We need to communicate that information to our stakeholders, and it is very reasonable for us to expect you to communicate that to us.

If you don't have that information, then please just say "We don't know if any data has been lost and we don't know when or if your site will return to normal."

Thank you again for replying. We do appreciate it.

6 votes
Ulf Sahlin April 6, 2022

Engineers tend to (only) focus on solving the problem, which sadly means missing informing the customer about what's going on. This is a big mistake.

Example: imagine you're on a train from city A to B. Suddenly, the train stops in the middle of nowhere. No public information announcements whatsoever. Twenty minutes later, the train starts rolling again and you arrive your destination.

Now imagine the same stop scenario but as the train stops in the middle of nowhere, the train driver/engineer gets on the speaker and informs everyone she's got a red light ahead and needs to call dispatch. Throughout the unscheduled stop, the engineer keeps you informed every ten minutes about what's going on. Thirty minutes after the unscheduled stop, the train starts rolling again and you arrive your destination.

Which of the above two scenarios would have you fuming as you arrive? I would much rather prefer a 30-minute delay being fully informed, rather than a 20-minute delay and no information given.

ATLASSIAN, PLEASE LEARN FROM THE ABOVE.

5 votes
Shane Doerksen April 7, 2022

It is incredibly discouraging that, after this many days and this many posts requesting some kind of real information, that nobody from Atlassian has responded after Mr. Sifers non-answer above.

Today is April 7th. Our sites have been inaccessible since April 5th.

This is the third day of downtime on the products that we use in order to *minimize* our downtime.

The level of disinterest Atlassian is displaying towards its paying customers is hard to comprehend.

You accidentally deleted our sites, might not be able to restore our data, and can't even be bothered to pop into the forums twice a day to say "We're still here, we understand your frustration and we haven't forgotten about you."

We're left to shout into the darkness and commiserate with our fellow customers who have had the rug pulled out from under us.

When problems like this occur, reputational damage control is just as important as technical damage control. If, as the response below indicates, you have hundreds of engineers working on this problem, how do you not have a support team proactively trying to communicate with the affected customers, in the forums or directly?

How do we continue to advocate for using Atlassian products? What should our response be to the executive teams and clients who depend on our recommendations for software solutions?

At least BitBucket isn't down, so we can continue with some development, but how do we answer when our organizations ask "How long until this happens to BitBucket, too?"

Somehow, Atlassian has a market cap north of 70 billion dollars but can't spare anyone to proactively communicate with its customers, even in a worst-case scenario outage such as this.

I have no doubt that I speak for everyone in this thread when I say that we are more than incredibly frustrated here; we are angry.

Three days, and we don't have real information about what happened.

We know that the original statement was not honest or complete because, if our sites were just disabled, they would have simply been re-enabled and this would have been a blip, not a three-day outage.

We know that Atlassian's own statements have gone from "We don't believe any data has been lost at this point." to "We are working on minimising any potential data loss."

We know that we've had to spend the last three days answering questions from our stakeholders asking us "What happened to the service desk requests?" and "Where are our Confluence documents that we've spent three years accumulating?"

Nobody has ever heard of a publicly-traded, industry-standard company in Atlassian's market position taking its core products offline for three days due to human error. It is quite literally unbelievable for IT service managers to hear this.

How do we rely on Atlassian for incident management software when this is how you handle incidents?

Why are we paying for a managed solution when you don't have automated disaster recovery procedures?

Someone has created a support ticket on our behalf now, which is great. For anyone who has not received such a ticket, I will include it below. It is not exceptionally helpful.


Hi Team,

We’re sorry for the continued frustration this incident is causing. We are continuing to move through the various stages for restoration. The team is currently in the verification stage on a subset of instances. Successful verification will then allow us to move to reenabling those sites or identify any other steps needed for restoration.

Once reenabled, support will update accounts via opened incident tickets. Our efforts will continue 24x7 through this process until all instances are restored.

This is our top priority and we have mobilised hundreds of engineers across the organisation to work around the clock to rectify the incident. These restoration efforts may be visible on your sites today, however please wait until support notifies you when the site is fully available and works with your teams to confirm the recovery.

The restoration is expected to continue over a number of days. We are working to reinstate access to all products with the priority on our key products. We are working on minimising any potential data loss.

We can confirm this incident was not the result of a cyberattack and there has been no unauthorized access to your data.

As we work to restore access you can look to us to continue to provide updates:
Every 3 hours, or sooner if we have a material update, at: http://status.atlassian.com

Direct contact via support tickets once we’re able to reinstate your access and your site becomes usable

You can continue to reach out to us at https://support.atlassian.com/contact for any questions, concerns or updates. If you have any issues opening a technical support ticket please open a billing question ticket and we will transfer it into the support teams.

Regards,
Daniel Soo

 

5 votes
Eric D. April 6, 2022

Have the same problem.  Can't find our site, but then there is absolutely NO OTHER WAY TO CONTACT THEM!!!  And as previously mentioned, their updates are completely useless.  Just the same repetition of the last update with someone using a thesaurus on a couple of words, but not actually providing useful information on timeframes.  And the link to "Reach out to us if you have any questions or comments" dumps you right back to the same useless web page that we can't get pass because they can't find our Atlassian URL.  Hoping someone from Atlassian sees this and reaches out to us, since we have no way to "Reach out to them".

4 votes
Shane Doerksen April 10, 2022

Hi Everyone,

We received an update on our support ticket this morning that said they are ready to begin restoring our site. I've included part of the update below, as it contains information that might help plan their return to service processes.

I've put the last bit in bold, as that's going to be quite important for some organizations:

Once your site is restored, we are looking for your input on a preferred path forward:

- Option 1 (ALL users): In selecting this option, Atlassian will provide access to your site within hours of your selection confirmation. All of your users will be enabled when the site is restored.

- Option 2 (LIMITED users): In selecting this option, Atlassian will work to restore your site to a limited number of users. This will allow you to perform testing before opening access more broadly. If you select this option we will need you to send us a list of email addresses that you wish to have access when we restore the site.

While we have completed validation of Atlassian owned connect apps, please note that third party apps may not be functional. We will work over the coming days to restore these apps. While Atlassian product data will be safe, we can’t provide guarantees about third party app data.

Olimpia Estela Cáceres-Brown April 10, 2022

@Shane Doerksen Congratulations! You're lucky. Thank you for sharing the important information to look forward when our site is restored.

Like Shane Doerksen likes this
Shane Doerksen April 10, 2022

Hi @Olimpia Estela Cáceres-Brown . You're quite welcome, and we should all say thank-you to you for starting this thread.

Just as some additional information, we received the update that they were ready to begin restoring our site at 10:36am.

This is what they originally wrote:

"We’ve started the process of site restoration. We are able to move forward with a limited restore of your Atlassian Cloud site. Please note that we don’t yet have an ETA for when this restore will be completed."

They also requested confirmation of which option we preferred (i.e., activate all users or a subset).

I consulted with my colleagues and then replied at 4:26pm that we wanted all users activated.

Atlassian support replied at 9:20pm and said that they would update me again when they had more information.

(all times GMT)

So it has been 16 hours since they first told us that they were starting to restore our site, our site is still down and we don't have an ETA. Our instance is also neither huge nor complex compared to many organizations.

I mention this just as information for other customers, as I'm past the point of complaining about this. Well, almost past the point :)

If I get more information, I will post it here for the group. Thanks.

Like # people like this
Shane Doerksen April 12, 2022

Hi Everyone,

Just another quick update that our site is still not back.

It has been 51 hours since we first received an update to our support ticket that said they were beginning restoration.

Nothing has changed and we do not have any kind of ETA.

According to https://opsgenie.status.atlassian.com/ they have restored another 5% of users since yesterday, so it looks like that 4% to 5% per day rate is probably what they're going to maintain. Unless they accelerate that significantly, some people are likely looking at almost 2 more weeks of downtime.

Like # people like this
Karla April 12, 2022

Same boat. This is absolutely ridiculous!

Karla April 12, 2022

In this blog post,

https://www.atlassian.com/engineering/april-2022-outage-update

Atlassian says "two critical problems ensued:

Communication gap. First, there was a communication gap between the team that requested the deactivation and the team that ran the deactivation. Instead of providing the IDs of the intended app being marked for deactivation, the team provided the IDs of the entire cloud site where the apps were to be deactivated.
Faulty script. Second, the script we used provided both the "mark for deletion" capability used in normal day-to-day operations (where recoverability is desirable), and the "permanently delete" capability that is required to permanently remove data when required for compliance reasons. The script was executed with the wrong execution mode and the wrong list of IDs. The result was that sites for approximately 400 customers were improperly deleted."


In my opinion, they need to add at least one more "critical problem" to that list. That being that the script wasn't first ran against the test environment with evaluation and comparison to see that all customers and expected products were still active.

Eric D. April 13, 2022

So our group just had our conversation with Trisha and Diane from Atlassian, and here are my takeaways.

Once we are notified that our data is being restored, all of the components will be restored at the same time (Confluence, Service Management, and Insight in our case).  Once notification occurs it will take roughly 4 to 5 days before we are able to access our site. 

Due to how the data was deleted, the site and URL was also deleted.  Site and URL will be recreated/restored and we were told that we are on there radar even though our URL is no longer visible. 

Atlassian is restoring in batches of 60 customer sites, but they are unable to tell us what batch we are a part of or when to anticipate the initial email stating that recovery of our site is beginning.

The restoration is a manual process and they are re-adding the deleted sites into the data stores of the unaffected sites.  Asked about the possibility of restoring all 400 effected sites to their own data store to get users up quicker did not return any answers.

So for now, we will wait until our ticket number is called (email stating data restoration has begun) and then able to head up to the counter to wait the 4 to 5 days until we can begin to access/verify our data.

4 votes
Ulf Sahlin April 9, 2022

Today's support letter below.

Finally we are getting some progress metrics! This is great news.

----

I'm writing to give you an overall incident update. Our team is working 24/7 to progress through site restoration work. At this point, we’ve restored core functionality to 23% of impacted active users and those customers have been notified. Product databases for all other customers are queued up for restoration, which will continue into next week.

We’ve taken a careful and considered approach in the early stages of this restoration process, with the aim of accelerating the restoration process from here.

This incident is our #1 priority. We have mobilized hundreds of engineers who are working around the clock to recover the remaining sites. When your site restoration has started, we will directly notify you via this support ticket.

Ulf Sahlin April 10, 2022

Sunday Update: a delta of 8% since yesterday.

With this eight-percent-per-day speed and we're currently at approx thirty percent, you can assume the final people being brought back online will have to wait another 8-9 days.

That makes for a neat two-week service interruption. I have NEVER heard of an IT service provider that has a two-week MTTR as a reasonable level of service. Obviously, there are no disaster recovery plans in place at Atlassian. This is VERY worrying.

-----

We want to share the latest update on our progress toward restoring your Atlassian site. Our global engineering teams are continuing to make progress on this incident. At this time, we have restored core functionality to 31% of impacted users. The current stage of restoration is particularly complex due to several steps that are required to validate sites and verify data. These steps require extra time, but they are critical to ensure the integrity of restored sites.

Like Shane Doerksen likes this
4 votes
Shane Doerksen April 8, 2022

Hi Everyone,

Yesterday evening I had a direct conversation with Mr. Sifers, who had responded above, and with Mr. Spells in escalations at Atlassian.

They were quite open about the causes of the current problem, and what Atlassian is doing internally to rectify the situation.

I was asked not to paraphrase what they said (it honestly wasn't much more than the email sent by Mr. Farquhar; it was just a bit less corporate-speak), but I strongly recommended that they participate in the community discussions around this. I can definitely confirm that Mr. Sifers is in this forum and has read everything that has been said. I have no explanation as to why he doesn't actively participate, and, believe me, I asked.

I can also say that both Mr. Sifers and Mr. Spells were very informative, professional and courteous with me. They both seemed to genuinely care about the situation. Mr. Spells in particular went above and beyond to answer my questions and I respect the way he handled the interaction. He was exceptionally accommodating and pleasant.

I don't have any more information about return-to-service times than anyone else, and all the questions I articulated above still stand, but at least there was an attempt to communicate with an affected customer.

Further, as much as the support team is steadfastly refusing to communicate in any kind of normal way publicly about this, my impression is that the they legitimately do not have answers much more helpful than they've already given. I just don't think that anyone there has concrete information yet about when this will all be over.

Again, I strongly disagree with the corporate decision to let people languish in the forums without official responses -- in a way, it's almost worse to know that the powers-that-be can see your frustration and are choosing to do nothing.

For now, I will say that I appreciate both Mr. Sifers' and Mr. Spells' time and explanations very much, and I will reiterate how professional they were with me. I could see they were both feeling the weight of a very heavy week.

I'm still frustrated, I'm still angry that all of our projects are at a standstill because our project management platform ate our site, I'm unhappy that they've chosen to stay silent in the forums...there's no satisfaction to be found yet.

I'm writing this post so that maybe knowing Atlassian is actually reading this thread will give a tiny spark of solace to my fellow customers who have been burned by this the same as me.

There will be hard questions for Atlassian to answer at the end of this, and I am not sure that my company will even continue using their services -- once bitten and all that.

We pay for a managed solution from one of the world's largest, most trusted providers of IT service management software, and they have let us down catastrophically. It's hard for me to go to stakeholders and pair that information with the sentence "but I'm sure it won't happen again."

This situation has made everyone in my IT department question the disaster recovery procedures in place at Atlassian, across all their products. It's a very large and complex platform, and the fact that they don't have automated, tested bare-metal recovery processes in place is just inexcusable.

How can we be sure that they have DRP in place at BitBucket? If our pipelines there go down, the situation would be dinosaur-asteroid bad for us.

I don't know if we've lost all our data yet or not, but I do know we've lost a week of work across the company. There is a cost to that which exceeds our subscription fee by a large margin.

If we aren't operational on Monday, then we quite literally have nothing left to lose by switching away from Atlassian.

Even if we are operational on Monday, we can no longer trust Atlassian the way we used to.

That means we need to prepare our own DRP and create some kind of on-premise fallback in case Atlassian goes down again.

And if we need to spend time and money doing that, what do we gain from using Atlassian?

We store data in their formats, in their systems, in exchange for the resiliency and professional capabilities of a 72 billion dollar company specializing in IT service management.

The problem isn't just that there was a technical failure at Atlassian; it's that they weren't prepared for it.

Thank you again to Mr. Sifers and Mr. Spells for reaching out to me. I hope that they are continuing to do that with as many affected customers as possible.

4 votes
Ulf Sahlin April 8, 2022

New day, new hope.

I would assume Atlassian would be able to get that "handful" of sites up very quickly, as "a handful" to me means <100. Now it seems to me the number of affected customers is way higher than that.

I received this nothingburger in the mail this morning. It is now FRIDAY and still no reasonable word on ETA or potential data loss. Out of those "hundreds of engineers" working on it, maybe Atlassian could set aside a few to deal with giving information to customers? 

 

 

Scott Farquhar here, I want to personally apologise for the Atlassian outage that you are experiencing. We understand how mission‑critical our products are to your business, and want to make sure you know we are doing everything we can to resolve this. We hold ourselves to the highest standards in dependability, transparency and customer service, and over the past few days, we have failed to live up to that standard.

On Tuesday morning (April 5th PDT), we conducted a maintenance procedure designed to clean up old data from legacy capabilities. As a result, some sites were unintentionally deactivated, which removed access to our products for you and a small subset of our customers. We can confirm this incident was not the result of a cyberattack and there has been no unauthorised access to your data.

We are working 24/7 to restore your service and will alert you when your products are available. We have already restored partial access for some customers and will continue to restore access into next week. Please know that once we have recovered all of our customers' access, we will review our processes to conduct a complete post incident review. We will make an overview of this post incident review available to you.

In our efforts to restore your site as quickly as possible, there may be some limitations when we make it available to you such as 3rd party app functionality. We will be sure to inform you of these in our direct communications with you.

When your site is available, we will directly notify you via your support ticket along with any details on the limitations mentioned above, as well as guidance for follow‑up support.

We'll continue to provide updates on status.atlassian.com as new information becomes available. If you have further questions, please reach out to us at https://support.atlassian.com/contact. If you have any issues opening a technical support ticket, please open a billing question ticket and we will transfer it into our support teams. It is my and my team's priority to do what we can to make things right.

4 votes
paul.fritz April 7, 2022

We have been down since Tuesday morning. It's beginning to seriously affect our teams. My IT team is coping but we can't access our asset tracking now and are having to handle support requests via email (which is super not ideal). 

I'm also curious to know where in the queue we stand for restoration of services. I have seen some big name companies complaining and fear that we might be considered "too small" to be addressed in a timely fashion.  

Our engineering group is already planning contingencies as this is starting to impact release schedules and seriously hampering their work. 


We have lost a LOT of good will towards Atlassian and I would not be surprised if leadership begins to ask for competitors and/or lawyers get involved. 

 

Just give us more information so we (those who represent you at our companies) can give useful information to our leadership and users. Saying "we don't know when it will come back...if it will come back" is horrible. 

4 votes
Eric D. April 7, 2022

So, now that we are going on Day 3 of not being able to access our JIRA Ticketing System, Confluence Pages, or the Insight instance that we've been working on, and being provided with very little to no information about how much longer this outage might be, I'm becoming more and more curious about the extent of the small population mentioned in "While conducting routine maintenance, an action caused a small population of our customers to be unable to access their products and data. Please reference our statuspage for updates." are effected. 

Who (or how many) is on the list of this subset of the small population that has been without their ticket system for ~60+ hours based on "Update - We are continuing work in the verification stage on a subset of instances. Once reenabled, support will update accounts via opened incident tickets. Restoration of customer sites remains our first priority and we are coordinating with teams globally to ensure that work continues 24/7 until all instances are restored." Since this has been the only provided update at Apr 7, 12:27 UTC, Apr 7, 09:35 UTC, Apr 7, 04:58 UTC.  Is my company part of that subset, or are we farther down the list??

What is the severity of the outage.  Is there data loss?  Was something compromised?  I've seen messages indicating the routine maintenance issue as well as issues with Atlassian AWS cloud.  What is going on so we can prep our customers.

When will we be able to contact support through a means that actually functions?  How do we get updates and timeframes that we can share with our customers to assure them that JIRA IS a legitimate ticketing system solution that is the right solution for them (my company was SUPPOSED to have a live demo of that system yesterday to a customer, but instead had to show them screenshots of the system that we use to support customers 24/7, because it was down and had no ETA on a return to service).

And what (if anything) will Atlassian learn from this outage to ensure that they have methods in place to provide their customers with usable, factual, and adequate updates to inform and manage the expectations of their customers' customers.

---

Page unavailable

Your Atlassian Cloud site is currently unavailable.

Please check Atlassian Status for known problems.
If there are no known problems and your page hasn't appeared again in 5-10 minutes then please contact our support team.
---
It's been 3 days, there should be more information available by now.
4 votes
Ulf Sahlin April 6, 2022

I'm really looking forward to the broken SLA reimbursements.

Likely they will offer a 50% service credit for one (1) month. Not automatic however: one actually needs to APPLY for it.

Imagine how that resonates with the actual cost of all the non-operational organizations.

Service Credits | Atlassian

4 votes
Shane Doerksen April 6, 2022

Mr. Sifers / Atlassian Support Team:

We're still stuck here. The fact that you don't even have someone regularly monitoring your community forums in an incident this significant does not inspire confidence.

You wrote "If you’re unsure if an issue has been created for your site we suggest reaching out to your site admins or technical contacts."

I am the site admin for our company. I would imagine most of the people in this thread are also site admins.

We are already asking you for the status of this issue. I (and I suspect most or all of us) haven't received any communication from Atlassian saying that a support ticket has been created.

We all understand that this isn't something a first-tier support agent will help with. What we're desperately trying to get from Atlassian are answers of some kind, even if the answer is an honest "we don't know how bad the damage is yet".

What we're getting right now is disinterested silence, broken support request forms and boilerplate "we're working on it" updates to the status page.

At my company at least, this outage has already caught the attention of the finance team that pays the bills.

It has also agitated everyone from middle-management down who haven't been able to submit or update any requests to our IT or marketing departments for a very long time now.

In short, Atlassian's failure to communicate about this in any meaningful way has managed to turn pretty much everyone in our company against them in the span of a little over a day.

Your site admins and technical contacts are the people who advocate for Atlassian within our respective organizations, and you're hanging us out to dry.

Please reply with meaningful information that we can communicate to the many thousands of people across our collective organizations who are directly affected by this situation.

Thank you for your time and assistance.

4 votes
Ulf Sahlin April 6, 2022

Dear Stephen,

Thank you for your response.

I assume you did not read this thread. I will try to summarize what has been written for you.

  • The support "reach out" DOES NOT WORK, as it requires giving a functioning Atlassian site URL when registering a ticket. We obviously do not have a functioning Atlassian site URL. A dead end.
  • The contents of the "live status update" page is not up to any form of business operations standard. Business customers need to know 1.) when to expect a functioning service again and 2.) what to tell our inquisitive customers why our service to them is inoperational. If you would please read through the actual contents of the status updates, it is in essence meaningless to your customers. In particular, the permutation and recycling of already-published updates in newer updates tells us you put no effort into keeping your customers informed. Please do a comparison of the information given during historical outages from major cloud players such as AWS and you'll find that much more detailed information was continuously given to customers.

In summary, technical mistakes and glitches do happen. As to how to handle them, check out the train story above.

4 votes
Karim Abrik April 6, 2022

The irony is that we use Jira to provide our customers with a higher quality of service, while we see that Atlassian, in particular, leaves us completely in the dark.

As a customer, I expect on a regular basis meaningful updates, including something like an predefined update interval, when we could expect the next update.
Posting an update without useful information here is pointless and only arouses more irritation. So please tell us a bit more about what you are working on and when you expect to resolve this issue.

3 votes
Shane Doerksen April 8, 2022

Thank you, @Karla Keefe. I hope that your instance is restored soon.

(Saying that feels like some strange Viking blessing..."May your instance always be online and your data always be backed-up.")

3 votes
Karim Abrik April 8, 2022

I really am frustrated and I believe within ITSM, this is what you don't want for your customers!

  • We still experiencing the outage after 3 days
  • Even though people complain, the information provided by Atlassian is extremely poor and not meaningful. Reposting the same status text is to me like you do not communicate and inform 
  • People effected by this outage are still ignored, we are still unable to report a ticket and nobody at Atlassian is addressing this e.g. something extremely simple is by providing complaining customers who experience this outage an email address to report their incident.
3 votes
paul.fritz April 6, 2022

Agreed. Seeing this often repeated suggestion to contact support when the support page blocks further contact because of the very problem we are trying to get help for.... is intensely frustrating. We have an entire engineering team idled and our Service Desk team is back to handling calls via direct emails. 

 

To rub salt into the wounds, my colleagues at other companies are happily using their Atlassian instances since this was evidently only impacting a "small segment" of customers. 

3 votes
Stephen Sifers
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
April 6, 2022

Hello, We're sorry you've been impacted by this incident. We have sent email communications to all affected customers which provide further details of the incident as well as a link for live status updates. Reach out to us at https://support.atlassian.com/contact if you have any questions or concerns.

Additionally, we’re checking and validating anyone who has posted in Community to report their site has been impacted and we are creating a support request on their behalf. If you’re unsure if an issue has been created for your site we suggest reaching out to your site admins or technical contacts.

Regards,
Stephen Sifers | Product Lead, Community

3 votes
Ulf Sahlin April 6, 2022

When trying to create a support ticket for the issue, the ticket service is out of commission. Likely because my instance is down. Oh, the irony!

down4.png

2 votes
paul.fritz April 18, 2022

Our site came back online at the end of Friday, but the stresses of doing our release without Jira kept our team from reviewing it until today. Everything seems restored except for our Insights data. We think it may be missing some assets. We are working with Atlassian to check. 

 

Ironically the system that we had to migrate to from the system that had worked okay for us..... 

Shane Doerksen April 18, 2022

Glad to hear that your site is back @paul.fritz . We are still reviewing our data here. Most of it seems to be okay, but we may have some small issues with the ownership of our service request tickets. But at least the site is up, so this week is off to a better start than last.

Like Stephen Sifers likes this
Stephen Sifers
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
April 18, 2022

@paul.fritz

Very happy to hear y'all are able to access your products again! I see you updated your support case with the missing assets. I'll personally update the team as well to ensure they're aware of your issue post restore.

Regards,
Stephen Sifers

2 votes
Ulf Sahlin April 17, 2022

Our site went online this morning, all data items seem to be present. Yay!

Shane Doerksen April 18, 2022

That's great news @Ulf Sahlin ! Glad to hear it.

Like Stephen Sifers likes this
2 votes
Shane Doerksen April 11, 2022

Hi Everyone,

Just a quick update that it has now been more than 30 hours since they first told us that they were ready to begin restoring our site, and nothing has changed. Our site is still down and we've received no further updates on the ticket.

As @Ulf Sahlin pointed out, their pace of recovery has actually slowed to 4% per day. This is in contrast to what they wrote on April 9th (emphasis mine):

We’ve taken a careful and considered approach in the early stages of this restoration process, with the aim of accelerating the restoration process from here.

By April 9th they had restored 23% of affected sites.

On April 10th, they restored an additional 8% of affected sites.

On April 11th, they restored an additional 4% of affected sites.

That math says the downtime will be measured in weeks for some customers.

As their other support messages have said, their restoration plans don't immediately include activating third-party apps; they will work on that after the initial restoration. So there will almost certainly be a period of time after restoration where the sites are not yet fully usable for many customers.

Even after all that, Atlassian have said that they can't guarantee third-party data restoration.

Atlassian continually writes that the complexity of this restoration requires verification, etc. Yes, great, we all understand that the Atlassian platform is complex. That doesn't need explained to us.

What needs to be explained to us is:

- why this is a manual procedure at all
- why they don't have automated site recovery / disaster recovery tools
- how they are verifying our data, and why that's not automated

If their platform architecture is so complex internally that critical management processes like DRP can't be automated, then that is a serious problem.

If they can be automated, then we need an explanation as to why they weren't.

Most importantly, I think we deserve an explanation as to why they don't follow their own recommendations for disaster recovery:

https://confluence.atlassian.com/enterprise/disaster-recovery-guide-for-jira-692782022.html

https://confluence.atlassian.com/enterprise/confluence-data-center-disaster-recovery-790795927.html

Yes, the cloud architecture will be different from data centre or server, but the fundamental principles are very much the same:

1) backup meticulously
2) automate everything
3) test regularly
4) have a runbook so that nobody needs to figure things out on the fly in the middle of a disaster

Point 4 is the most important, as Atlassian's own messages indicate that they are figuring this out as they go.

The service credits that Atlassian offers max out at 50% when their uptime goes below 95% or 36 hours of downtime in a month:

https://www.atlassian.com/legal/sla/service-credits

We're at around 168 hours of downtime, with no end in sight, and we've lost an entire work week to this.

We can't get an actual ETA from anyone at Atlassian and we've exceeded their maximum anticipated downtime by a factor of 5 (so far).

This isn't funny any more.

paul.fritz April 11, 2022

I suspect that they have all of our sites in a single (!) database and didn't notice that the script had deleted our accounts until the database had significantly deviated from the last snapshot. 

 

Since we reportedly are just 400 sites, or somewhere less than 1% of their customer base, they have elected to sacrifice us to avoid losing data from their other customers by restoring from backup. 

 

So my guess is that they have stood up a snapshot from before the script ran and are in fact manually rebuilding our sites in the prod database. Since this is manually tinkering in the prod database, I suspect they are triple checking their changes to avoid further damage. 


To me it speaks of poor architecture and DR planning. They appear have assumed that the only recovery scenario was one where the majority or entirely of the data would be damaged or lost, so they only planned for a complete restoration of the database. 

 

They never planned for a scenario where a portion of the database becomes corrupt (nor did they anticipate that putting 260k plus sites into one management system might be risky. 

 

All of this is hypothetical, but it's the only reason I can come up with for why restoring less than 1% of their system should take 3 weeks and not guarantee that the restored systems will even work correctly.

Like # people like this
Karla April 11, 2022

@Shane Doerksen - no, it isn't fun anymore. It has also been more than 27 hours since they told me they have started the restore process for our site and gave us the same Options (restore all users or a subset of users). No further action. Still down with no ETA. All I can say is ... this is nucking futs!! 

 

@paul.fritz - One would think that if that is the case and they are not doing incremental backups, it may have been much less time-consuming to tell ALL of their customers that they would be down for a brief time-period (4 hours overnight for example) and then take every site down, create another backup that covers the dates in question, restore ALL sites using the full backup and then apply the other backup to ensure sites that weren't affected before, remain unaffected.

But alas, if they are dumping everyone's data together - all I can do is (facepalm), shake my head, roll my eyes and recommend that Atlassian immediately start applying industry standard best practices - I would imagine that this outage will (and understandably should) result in a massive amount of revenue and reputation loss for Atlassian.

Like # people like this
paul.fritz April 11, 2022

I agree. This is why I've been playing the "what if" game. Every scenario I can think of from past experience or industry practices doesn't allow for this apparent very manual restoration. 

 

As for revenue and reputation loss, I  wonder what the impact actually will be. With it being such a small number of impacted sites, it isn't making many waves in the technology press. Most customers remain unaffected and oblivious. 

 

Unless somebody goes to one of the larger tech news sites and/or a serious lawsuit gets files, I worry that Atlassian will just chalk this up as a variance and burn what little reputation they have left with us. 

Being kinda pessimistic as just got off of Zoom with my leadership asking that we look at alternative systems....

Like # people like this
Eric D. April 12, 2022

@paul.fritz you hit the nail on the head when hypothesizing on the cause for the slow data recovery being a single database. My site is still down and we are still waiting for our email telling us that it’s in the restore process. Does anyone know if when recovery begins it includes all atlassian products or just part of them at a time (we have insight, confluence and service management).

Shane Doerksen April 12, 2022

Hi @Eric D_  . I actually had a Zoom call with Nathan Spells in Atlassian escalations today. As far as I understand it, all of a site's Atlassian products and data should be restored at the same time, and they are concurrently verifying the data for each product before releasing it to the customer for final verification and testing. He also explained that they have engineering representatives from each of the respective product teams involved in each site restoration as needed. So if you have Confluence, there will be Confluence engineers there, if you have Service Management, you'll have Service Management engineers, etc.

As I understand it, their approach to restoration is site-by-site, not product-by-product.

Like # people like this
Eric D. April 12, 2022

Awesome, thank you for letting us know. That was something that we were wondering as it might cause additional confusion to users and customers if only certain parts became available at different times. We are also meeting with Diane from atlassian tomorrow to discuss status and recovery, so all of this information will certainly allow us to have an informed meeting. Thanks again. 

Like Shane Doerksen likes this
2 votes
Karla April 7, 2022

I received the EXACT same response (verbatim) to my ticket, which was auto created as a result of my comment above. All I can do is SMH and roll my eyes. You hit the nail on the head, Shane!

2 votes
Karla April 6, 2022

And don't tell us you will post an update in an hour or in three hours and then not post an update!

2 votes
Andrew Parle April 6, 2022

We've also been affected and can't create a support ticket.  Nearly 2 full workdays lost for 2 departments, paperwork is going to be fun!  Good luck with the fixes. 

Suggest an answer

Log in or Sign up to answer
TAGS
AUG Leaders

Atlassian Community Events