Atlassian Post-Incident Review on the April 2022 Outage

Hi Atlassian Community,

I'm Stephen Deasy, Head of Engineering at Atlassian. 

Earlier this month, several hundred Atlassian customers were impacted by a site outage. We have published a Post-Incident Review which includes a technical deep dive on what happened, details on how we restored customers sites, and the immediate actions we’ve taken to improve our operations and approach to incident management.

To our customers and our partners, we thank you for your continued trust and partnership. We hope the details and actions outlined in our Post-Incident Review demonstrate that we’ll continue to provide a world-class cloud platform and a powerful portfolio of products to meet the needs of every team.

3 comments

Ollie Guan
Community Leader
Community Leader
Community Leaders are connectors, ambassadors, and mentors. On the online community, they serve as thought leaders, product experts, and moderators.
April 30, 2022

Commitment, Focus, Openness, Respect, Courag.

Wisdom comes by suffering 👍

Chris McEvoy May 3, 2022

I have read the PIR and found it very informative. One point that wasn't covered was the order in which sites were restored. It would be very helpful to understand the criteria for the order of site restoration, was it based on licence SLA's (e.g. Enterprise first then Premium), or was it based on order of deletion, or some other criteria?

Like Kalin U likes this
Jimi Wikman
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
June 14, 2022

I have to admit...this report show some serious issues in both technical architecture and collaboration. The fact that this could even happen is baffling.

The fact that I can see no mention on securing the API to ensure each call have a single point of failure instead of multipoint based on assumed input values, or that there are clear instructions on the steps that are needed, preferably by that first team that prepared the actions (and presumably verified them) does not put my mind at ease.

Even if this is a one time accident, it shows a lack of architectural control and poor disaster preparations.

Comment

Log in or Sign up to comment
TAGS
AUG Leaders

Atlassian Community Events