What are your best incident management tips and stories? #HugOps

👋Community members!

Downtime happens. And great incident response takes a village. Teams like Support, Dev, SRE, Ops, IT, and Marketing have to come together to resolve the problem while keeping customers in the loop from 'investigation' through 'resolution'. We believe in celebrating the teams that work hard to keep the services we rely on up and running 99.99% of the time. That's why we love the HugOps movement so much.

HugOps exalts empathy, cross-team collaboration, and trust as the keys to finding problems and shipping solutions faster. It comes to life as a hashtag on Twitter – a much-welcomed burst of positivity on a platform that sees its fair share of negativity and trolling. While #HugOps are typically shared during outages, we love the idea of showing the HugOps love when things are going right, too. It's always a good time to spread support and appreciation throughout our community.

✨Today we're excited to share our new HugOps hub with you! We hope you feel inspired to tweet HugOps love to your favorite services, share stories from the trenches, and submit your tips for incident response – all while learning from other folks in the incident response community who have already shared theirs. ✨

Comment below with your top tip for incident management and/or your best HugOps story (a time you felt supported or appreciated during incident response).

We'll highlight some of our favorites from this discussion on the HugOps website. We'll even send you a free HugOps poster to proudly display in your workspace. 🙌

BONUS: We'll be drawing one tip or story from this discussion at random on Friday, August 24th. The winner will receive a special HugOps gift package 🎁🎉

With #HugOps Love,

Shannon from Statuspage

9 comments

Comment

Tip :

Communication is the key. it is very important to stay calm from start to end of an incident. it helps to maintain the diplomatic tone. update stakeholders to avoid unwanted noice helps to maintain the proper environment to manage incident properly.

Not specifically me but our team was appreciated for maintaining the proper communication-line during incidents. It is not always the fix/resolution that comes first but the proper communication which gave us strength to pass that extra mile in proper incident management throughout these years.

Like • Sharon.ross likes this

These are awesome tips @Alana Fernando!! Thanks so much for sharing.

proper communication gave us strength to pass that extra mile in proper incident management throughout these years.

^^^ THIS! So true. Communication really is 🔑

P.S. Email me (swinter@atlassian.com) so I can send you a link for a poster!

Tip:

Why bother communicating if your customers cannot easily find out about incidents? Promote your external or internal status page through Social Media & on-boarding emails to customers/users, to name a couple. Integrate with your existing portal(s) or applications (web/iOS/Android) to inform customers of incidents and direct them a status page. Finally, encourage customers via support tickets (add on the end of a signature or pre-defined template) to subscribe to your status page for future notifications of incidents.

It's been really rewarding to hear comments from our support and engineering teams such as "directing customers to our status page enables our teams to focus on the issue at hand, rather than stress of multiple tickets/calls, and restore service as quickly as possible".

Like • like this

@Nick Coates you are spot on with this!! Thanks for sharing. The best comms in the world are useless if your customers don't know what communication channels you're using (and where to find them). Love all of the ideas you've laid out for promoting them.

If you're up for it, we'd love for you to expand on this in a guest blog post for us. 😊What do you think?

P.S. Email me (swinter@atlassian.com) so I can send you a link for a poster!

Hey @shannyshan, absolutely! I would be honoured to expand on this in a guest blog post :) I'll drop you an email and we can go from there.

Communication - let people know there is an outage and it is being worked on. Acknowledge the issue. Don't rely on managers to pass messages down - use on-screen messages where possible or visit the working areas.

Focus #1 - return to operations. Resources should not be diverted from RTO to work out why it happened, what the expected long term impacts are, and how it can be prevented. Where possible do try to estimate the expected RTO so this can be communicated.

Excellent tips, @Kat Warner! Focus should definitely be on communication and returning to an fully operational state. Then, after recovery, you hold a postmortem or post-incident review to find the root cause and take steps to ensure the same issues doesn't happen twice. 💪

E-mail me (swinter@atlassian.com) for your HugOps poster. 😄

Always Educate: service desk customers, supporting resources, company leadership, etc on how to best get support either via self-service articles or submitting requests

Keep Adapting: actively seek feedback from all participants in the inicident management process and adjust workflows accordingly to maximize effectiveness

Be Transparent: keep expectations managed by providing a breakdown of the incident management workflow and visibility of the service desk queue

Love these @Brittany Steed! Thanks for sharing! Plus, they can be remembered with an easy acronym... always E.A.T during incident response. 😉

"Keep adapting" is a really great one to call out. We always recommend holding a retrospective or incident comms review after an incident. Also great to see you're helping customers get the best help possible during downtime with self-service articles and a transparent service desk process.

Email me at swinter@atlassian.com for your HugOps poster. 🙌

Oooooh, I love that! Who doesn't love to EAT!

Autonomy - Give as much autonomy as possible to customers:

Build a knowledge base. Customers will solve their issues by themselves.
Design a simple customer portal. It will let them create a ticket in a few seconds.

Visibility - Communicate, be transparent:

Share information when there's an incident. Don't hide it, make it an announcement on the customer portal or use StatusPage.
Open a dedicated HipChat/Stride/Slack room. This way, everyone is up-to-date, even distributed teams.
Take some time to write a post-incident report. Share it on Confluence so that everyone can read it and look back at it during future incidents.

💯💯💯@Manon Soubies-Camy - Thank you! Rally strong tips and tricks.

We love the idea of putting the power in the hands of the customer. The last thing someone wants during a stressful incident is to have a clunky and complicated experience trying to get help.

Also great how you point out the importance of communicating both externally (customer portal/Statuspage/etc.) and internally (chat rooms). Both are so important. And if you aren't communicating well internally, it's bound to show with inconsistencies externally.

Awesome idea to draft and share post-incident reviews on Confluence. New Confluence template idea, @Kesha Thill! 🤗We also have a new and improved postmortem feature in Statuspage for sharing info out with your customers.

Thanks again - email me at swinter@atlassian.com for your poster!

That's an interesting feature! And I'd love to see a PIR template on Confluence :)

Oh I LOVE a good new template idea 🙌🏽 And a PIR is such a great one! Thanks @shannyshan and @Manon Soubies-Camy! I'll add it to my list and hopefully soon 🙃

Communication
Communication is a must within a period of an Incident. Customers may measure the quality of the service in such cases. We may think if our Product/Service has unique things and strengths people will keep with it. But when an Incident happens that is the point some customers/stakeholders start to questioning their decision of trusting us. If we failed to communicate with them properly while investigating/fixing the issue then they may end up with deciding things about the quality of our product or service.

Updating
Always inform the customer as soon as possible if any incident happens. Because that can affect their work or activities. Then we should have a proper way to let them know the status of the incident by a status page or something as some of our community members mentioned above. Otherwise they'll keep contacting us continuously over the phone or email or else feel helpless while waiting without something.

Investigating/Fixing
While this communication and updating goes on, the team which holds the responsibility for the investigation/fix should act immediately and give the highest priority for the incident than the other regular work. Continuous updates to the customer will help them to be calmed and at the moment it is fixed remember to test it again at least 2 times to verify it is fixed for sure. If we missed this step it can cause for the customer to end up with the similar incident/issue again or something new because the tests were not done properly.

Finalizing
At last let them know it is fixed and apologize for the incident and let them know that we'll take all the steps to make sure this won't happened again.

Thank you for sharing @Anton Perera!!

I love how you show the link between communication and trust. Downtime is inevitable, but being communicative and transparent is a HUGE differentiator in the SaaS world. Companies that get this right are way less likely to lose customers after a big downtime events.

Always inform the customer as soon as possible if any incident happens.

^ YES! Updates early and often is the name of the game. We recommend getting something up right when you identify an issue and then updating every 30ish minutes even of just to say "still working on x, next update in 30."

Finalizing step is great, too. Always want to apologize and close the loop with affect users after an incident with a well-written postmortem. 🙂

Email me (swinter@atlassian.com) for your HugOps poster!

The best tip for incident management? Avoid incidents request created by customers!

There is nothing worse if customer complaining about system being down or something not working. We need to be one step earlier.. and for this we need to have a good event management that would automatically resolve most of the systems downs or inform people that the system is not available at the moment. With good event management people would not even notice many times that issue occurred.

When you actually have an issue or incident reported that next thing is communication. If you do not have a centralized system that inform people about availability (like StatusPage) it is a lot harder to focus on the incident itself since people are creating mass requests (if they know where to report and issue). If not they send many emails or hitting directly support team on IM. Anyway this is wrong. Good incident management should have a centralize way of informing users that there is an issue and someone is working on it. If not then support team is spending time on replying either fixing the problem..

Should we inform customers about all incidents? Not really.. You need to pay attention on those situations that are affecting customers (system not working, performance problem, .. ). If using data center for example one node is down but applications are working and customer is not complaining communication should be more internal to not let customer thing that there are always some issues.

Thanks @Mirek!! Totally agree that great incident management starts with being able to quickly detect and resolve incidents before they affect our customers. "Detect: Atlassian knows before our customers do" is actually the first in our set of our Atlassian Incident Values.

Good incident management should have a centralize way of informing users that there is an issue and someone is working on it. We agree! That is actually a pretty huge benefit of having a Statuspage since it lets you communicate the same message out of various channels (email, in-app, SMS, Twitter, etc.).

And really awesome point re: being thoughtful about when you should/shouldn't inform customers about an incident. We find it helpful to define different levels of incident severity so you can determine right away whether a certain issue needs broader comms or not.

Email me (swinter@atlassian.com) for your HugOps poster!

Just another tip with a recent incident we had to manage. (Not to have an extra poster. :D)

Never stick with usual/known things for a long time

Sometimes it may seems like a known issue and it may feels like we know the issue and the fix. It's always better to check the things with known fixes while trying to figure out if it's a common issue or a known issue with previous experience. If it is not the case we should act immediately to check the other things. I mean the things we did not suspected that can be related with this new issue. Because the incident may seems like an known one but it may not the case and if we'll keep working with known fixings and we may lose the time. At the very first moment you realize this is not something we had experienced when we have put all the efforts with known things and failed, remember to act immediately with treating the incident as caused by an unknown thing. It will save the time to fix the issue. Otherwise we may waste extra time with known things even we felt there is something odd with the new issue than the previous similar ones we experienced.

Thanks so much for the follow-up, @Anton Perera!! Love hearing about learnings that come out of real incidents. This is a really great point you make. Incidents are messy and complex and you need to treat them as new puzzles to solve every time. You can have great processes and roles and tools in place, but you still need to think outside the box to get things resolved as as efficiently as possible. Plus, if you're learning and improving through post-incident reviews, you really shouldnt ever have the exact same incident happenings twice, so most issues won't allow you to stick with what you've done before.

I know you aren't doing it for an extra poster, but we want you to have a little extra something for all the great tips. 🤗check your inbox!

Oh.. Didn't see that coming.. :D
Never expected that. But thanks. I'll check. :)

Remember this?:

BONUS: We'll be drawing one tip or story from this discussion at random on Friday, August 24th. The winner will receive a special HugOps gift package 🎁🎉

Well, we drew a winner on Friday and are excited to ship off a special gift to our winner. Drumroll please....

@Nick Coates - congrats! We will e-mail you for the best address to ship your gift :)

Oh, awesome, thank you @shannyshan 🎉 Great news just in time for Summit too!

Hooray! We can't wait to see ya in person at Summit. Definitely come find us at the Statuspage booth 😊

Definitely! I'll probably spend a lot of time at the booth 😁

Recommended Learning For You

Level up your skills with Atlassian learning

Atlassian DevOps essentials

Learn how to build, automate, and improve DevOps processes used for the development and delivery of software and other digital products.

1.7h Intermediate

Free

Tracking and improving DevOps metrics

Make informed decisions about current and future projects and deadlines to maximize your team's productivity and keep morale high.

1.9h Intermediate

Free

Exploring Atlassian Cloud products for agile and DevOps

Coordinate a suite of Atlassian Cloud products for greater collaboration and trust, higher-quality solutions, faster releases, and more.

30m Beginner

Free

Was this helpful?

Thanks!

DevOps

Forums

Q&A

Community resources

Support

Top groups

Community resources

Support

Learn

Community resources

Support

Events

Community resources

Support

What are your best incident management tips and stories? #HugOps

9 comments

Comment

Was this helpful?

Thanks!

TAGS

Atlassian Community Events