Your companies portal goes down. You're immediately in firefighting mode, trying to fix the bug to restore the portal. But are you being transparent with your customers? Are your support teams starting to see queues with tickets, tweets or chats via your support portal?
Good incident response isn't just about getting services back up quickly - it's about being upfront and frequently updating your customers.
I recently ran my first Atlassian Team Playbook with our Incident Communication Team. The playbook we chose was the 'Incident response communications', and would like to share my thoughts.
We have over 30+ enterprise cloud products, with over half of these using Statuspage. The team responds to over 20 engineering/dev teams with several incidents a day that impact our customers.
We needed time to step back and take a look at how well are we doing with communicating.
I've been fortunate enough to have one of the Atlassians run several team playbooks for our service teams here in Reading. Based on the positive reaction, I knew playbooks were the way forward. Also, why reinvent the wheel? Atlassian puts a lot of research and dedication to these playbooks, which are completely free.
I stumbled upon the Incident response communications playbook, and knew this was what we needed. I shared this with the managers, and they asked me to go ahead and schedule a play session with the team.
I worked with one of our Incident Communication Managers to look at the past 30 days worth of incidents to narrow down what incident we wanted to focus on. Out of the handful of incidents, we chose one and started to gather the incident details collaborating on a Confluence page.
As the entire team is remote, we had to come up with a way to for the team to be interactive and engage. If the team was in the same office, we could have used a whiteboard wall to draw the timeline and use sticky notes, but that wasn't an option.
Therefore I took the timeline from Confluence and created it as a Slide. We then used "sticky notes" on the presentation using a Whiteboard in Webex (we use this as a our video conferencing tool). One of the team members was the scribe and took the whiteboard notes and added them to the timeline in the slide. The team focused on the entire incident from start to finish: from the point that engineering had an alert from monitoring to our last update. It was interesting to see that we were actually posting before we had internal comms.
Once the team assessed the incident, we put an action plan in place which included people, process and technology actions. We assigned each action to an "owner" and a recommendation on the next steps.
A few weeks later we re-grouped and the owners provided an update on the actions which has now improved our future communications.
But this isn't a one-off exercise. It's something that we will continue to repeat every few months to help refine and improve our incident communication process.
Have you ran this play? What did you find? Have you made any changes to your incident process? Let me know!
Thanks for reading!
Nick Coates
Product Owner - Symantec Status
Nick Coates
Product Owner (Service Status)
Broadcom
United Kingdom
39 accepted answers
3 comments