Hello there friendly community! I'm currently working on a guide for folks who are new to Incident Management, or want to improve their current Incident Management process.
The TL;DR
I'd love to see if you have any nuggets of wisdom to share that I might be able to include in the guide. Your advice would be attributed to you in the form of a quote, even alongside your photo if you like!
The goal is provide modern IT teams who are looking to start or improve upon an incident management process.
The whole, detailed story
If you have something to share, please share it below. If you need ideas, feel free to pick one or two of the below questions or themes to focus on!
1. While defining your incident management process, what were some key learnings? What worked/what didn’t?
2. How did you define what constituted an incident versus a notification? Did you find these definitions needed to be fine-tuned over time?
3. How did you define SLAs for different kinds of incidents?
4. How did you assign incident response roles to your team?
5. How did you choose what your “war room” or “command center” would be? (I.e. why did you choose slack versus a phone bridge, etc.)
6. What were the best methods for communicating with stakeholders about incident status? What were the worst?
7. How did you keep responders motivated especially during the postmortem process?
8. How did you report back to C-level stakeholders about the incident response and postmortem process?
9. Overall, what advice would you give to folks just starting an Incident Management process?
I'll ask you to proof and approve your quote before we publish the guide. All participants will receive a special thank you gift for your participation!
Looking forward to hearing from you!
Best,
Kate
This is great! Thanks so much @Matt Doar !
Take really good notes, try to find out what was happening when the incident happen, from all involved parties/departments even things that seem like they wouldn't be related. Just put in the notes with a time stamp. Its interesting what things will just fall out when you get it all on the time line.
Move though the troubleshooting and fix ideas methodically. Work the easy stuff first. Back out any recent changes, look over your notes and time line, follow up on all of those items.
Don't name names. throw blame or call out people on the incident call. Present the problem, present the symptoms, offer solutions, and be a positive influence. This is a rough patch for everyone involved.
If you discover you made the mistake/caused the issue, own up to it. The problem can then be solutioned instead of everyone still working on what happened. If someone else made the mistake and owns it, don't tease them about it, that's for in the break room over coffee the next mornings ;)
Love this! Thanks @Kimberly Deal _Columbus ACE_
Only one channel for communicate stakeholders.
Only one source of information and truth.
Let the techs do their incident jobs: respond to managers AND try to fix something will take more time than just working on the problem.
Dont promise time to restore if its a big issue, or a complex one - if you dont fix in that time, for sure, the pressure will double. Instead, promise communication about status and progress with details enough to let the techs work, and have the stake holders in peace (or less angry....).
Having a Plan for communication, well knowm, with the reasons why need to be this way would be helpful .
For me, the user and stakeholders experiences are related to poor communication.
IT is not good on that one LOL, we need to improve this part (the human one....)