Today's AWS outage was a stark reminder: what happens when the tools you rely on to manage incidents... are part of the incident?
When Slack, Zoom, PagerDuty, and even Statuspage are impacted, how do you get your response team re-connected to solve the underlying problem? Once they're talking to each other, they can improvise a response, but that first step of re-establishing contact is critical.
This isn't just a hypothetical. It's a real-world scenario that can paralyze even the most prepared organizations. Relying on a plan that's tucked away in a long-forgotten document is a recipe for disaster.
Here's what I recommend to the leaders I advise:
πΉ Have a "Rally Point" Plan: Don't just have a backup concept; have a pre-defined, communicated, and accessible fallback plan. Every second counts in an incident, and you can't waste time figuring out where to communicate. If you normally use Slack and Zoom, then think Google Meet or Microsoft Teams for your backup, and vice versa. Maybe even an old-fashioned conference call bridge. The key is that everyone knows where to go, when the normal places aren't working.
πΉ Make it Accessible: Your plan is useless if it's on a server that nobody can get to at the moment. Laminated wallet cards, a shared password vault with offline access, or a regularly updated file on every employee's laptop are all viable options.
πΉ Practice, Practice, Practice: Fire drills aren't just for fires. Run drills for your fallback communication plan. This ensures everyone remembers it exists and that the mechanisms still work.
πΉ Don't Forget Security: Assume that your fallback channel is compromised, and that outsiders are listening in. Use it just as a rendezvous point to direct responders to more secure, authenticated channels, where you can validate every participant. Don't discuss sensitive information in the open.
Incidents are costly, not just in revenue, but in reputation and team morale. Proactive preparation isn't a luxury; it's a necessity.
What's your team's communication fallback plan? Share your thoughts in the comments below. π
#IncidentManagement #BusinessContinuity #SiteReliability #DevOps #AWSOutage