"Don't fuck the customer": what we're changing after the incidents from last week

Hi all,

Two of the Atlassian values are "don't fuck the customer" and "open company no bullshit". In the spirit of these two values, here's a follow-up to last week's post re: the incidents we had - I thought some of you might be interested in how we deal with things like this. As is often the case in these situations we needed to dig a little bit to figure out if it was just bad luck or if there was something more systemic to fix. 

Firstly, what happened? Over the course of a week you saw the following issues:

  • Three outages in 2 days, where you couldn't access discovery projects at all
  • An issue where you couldn't edit issue descriptions 
  • An issue where you couldn't edit select/multi-select fields in list views

Needless to say, by the end of the week team morale wasn't great. We met and discussed what we can learn from them and what we need to change. First things first, we realigned on the principles we use to prioritize new work: #1 reliability, #2 usability, #3 features (RUF)

 Screenshot 2021-11-09 at 10.26.23.png

There's always a tension between focusing on reliability vs new features at the early stage of creating a new product - we want to move fast to find product-market fit, but if we move too fast we're at risk of introducing bugs and incidents. 

For those of you who remember we already had a few incidents over a week a few months back. Since then 30% of each sprint capacity had been allocated to the tech debt, dealing with scale, bugs and improvements. It worked well for us up until last week, but this time around it wasn't sufficient. 

What changed since then:

  1. We increased the surface area in terms of features and things that can go wrong, but we were still relying too heavily on dogfooding and manual testing to identify when things break in staging before it gets to production
  2. We refactored a number of things in the backend to be able to better deal with scale as it became clear that the app was finding an audience! But doing this without enough automated test coverage was risky.

What didn't change since then:

  1. We were still using goal-based sprints and trying to hit specific goals each sprint, which served us right to get to market and iterate quickly, but introduced risk as we were sometimes taking shortcuts to get something quick in the hands of users
  2. Our team has 10 engineers, and we had 1h of meetings a day - 2h on Wednesday for demo and planning. A good chunk of the daily 1h meeting was used for standup, and not enough for sparring and discussions - as a result not enough decisions were debated vs just implemented fast by individual team members.

It became clear that we have reached a phase where this wasn't going to work anymore. So we're changing the following things:

  1. We're introducing a lot more automated tests. There's a hit list of key features that need to always be working for which we're adding tests now, and over time we'll add a test for every bug we find.
  2. The split for work moves to 1/3 reliability + techops + tech debt, 1/3 improvements to the existing experience, 1/3 new features. 
  3. We're changing team rituals:
    • Standup is now moving to Slack. No need to spend 20-30min discussing status when we can just do it async
    • We'll claim the time back in the daily 1h meeting for team sparring, and bring key decisions to the team instead. That's where we'll discuss key risks and how to go about them
    • We're moving away from goal-based sprints and into more of a Kanban model for sprint planning. We'll defocus a bit the "let's ship fast" aspect to be more intentional with the changes we make, and see how that works

The things we're not changing at this stage:

  1. Flat hierarchy / everyone is in every key meeting - at this stage we believe that a shared understanding of everything from user problems to solutions is key
  2. Few meetings - everyone needs time to get shit done
  3. The roadmap is loosely defined and changes frequently based on user feedback

We'll do a retro in about a month to see how that worked, and will keep you posted!

6 comments

Ana Duran
Contributor
November 9, 2021

Thank you for the transparency @Tanguy Crusson . I appreciate you taking the time to explain what happened and the next steps you're taking. Overall, you and your mighty team are killing it and we're enjoying using Jira Discovery everyday! 

Like # people like this
Brent Johnson
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
November 9, 2021

Thanks @Tanguy Crusson . Like @Ana Duran said, we appreciate the transparency and it's great to see how your team takes this seriously and owns it. Keep up the good work!

Like # people like this
quan
Contributor
November 9, 2021

Love the openness and transparency!  Keep up the amazing work you and the team are doing.

Like # people like this
Sarah Baker
Contributor
November 15, 2021

Love the openness and courage to post this - I'm about to share this with our team, as I got a lot from reading it, and I wonder if our own customers might feel the same way?

I'm a Scrummaster at heart, though, so I'd love to ask some questions in the spirit of collaborative continuous improvement... and I kinda feel I have your consent because you've left the comments open...

  • you said you have a daily one hour standup meeting, and that 20-30 minutes were spent discussing status. The daily standup should be a short meeting, which "focuses on progress toward the Sprint Goal and produces an actionable plan for the next day of work." (from the Scrum Guide) Do these status update achieve this, regardless of whether they are in a meeting or left in Slack?
  • you said you'd defocus on "lets ship fast" to ensure quality, but what are you using to measure and maintain this, regardless of the work management model? I wonder if you have a shared definition of done, and whether this needs revisiting? I'm enjoying your approach to automated testing - perhaps this could be included?

Good luck and keep up the retros - very pleased to see that these are still on the table in your struggle to find what's less valuable and can be reduced/removed.

Louis Aguila
Contributor
January 13, 2022

Hi @Tanguy Crusson, this is great insight and very helpful. As it has been a couple of months, I am wondering how the retros have gone and what impact these changes have made?

Tanguy Crusson
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
January 14, 2022

@Louis Aguila it helped us a lot. We did face another incident more recently, but that was for a whole bunch of different reasons: https://community.atlassian.com/t5/Jira-Product-Discovery-articles/Outage-on-4-January/ba-p/1902637

Changing our rituals was super helpful in making sure all our face-to-face time is used for meaningful sparring discussions (not status), and already helped discuss important topics we were punting for a while.

Removing weekly goals and to a Kanban style of delivery helped catch a lot more bugs early because things were less rushed. 

We're adding more automated tests, and they've caught quite a few potential bugs already. The one extra change we did on that front is to onboard a new team member to help with QA - although our vision is for all the testing to be automated, we're still a long way from that. In fact, we spend a considerable amount of time fixing the tests themselves since we've introduced more automation (no surprises there). So we're doing more manual testing today and iterating fast on our testing strategy based on what we learn from all this.

Comment

Log in or Sign up to comment
TAGS
AUG Leaders

Atlassian Community Events