1) The document discusses how to manage a crisis when systems break down. It emphasizes the importance of having a crisis management plan in place before issues occur to define roles, responsibilities, communication processes and metrics to track performance.
2) During a crisis, the plan should be followed and clear communication with regular updates is essential. A root cause analysis should be conducted after to determine what changes are needed to prevent future issues.
3) An after action review identifies what went well, opportunities for improvement and an action plan with goals and deadlines to apply lessons learned from the crisis. Having a high performing team and understanding roles is critical to crisis response.
2. Introductions
When Things Break: Managing a Crisis
Deirdre Woods
@deirdre_woods
Sarah Toms
@sarahetoms
Principal, Deirdre Woods Technology Advisors
Former CIO and Associate Dean of the Wharton
School
IT Guru and mentor
IT Technical Director, The Wharton School
Over 20 years as tech entrepreneur and innovator
ITIL Practitioner and Leader of Award Winning Teams
3. Deirdre’s Worst Hits
• When you get off the plane and have no
email and 50 voicemails...and you find
out your root passwords have been
hacked. All systems were taken offline
and had to be rebuilt over 48 to 72
hours.
• When your hardware provider provides
terrible support, which creates a two
day outage for for 30,000 global
customers and you end up screaming at
the CEO on a Sunday morning.
6. Initial • Success depends on individual performance
• Reactive and Unpredictable
Managed
• Projects & activities managed
• Based on a set plan
Defined
• Processes defined
• Performance Managed
Measured • Managed based on measurement
Optimized • Focus on Innovation & Improvement
Organizational
Maturity
When Things Break: Managing a Crisis
7. • Assess where you are today and why
• How does your org maturity impact
you in a crisis?
• What steps will help improve
maturity and mitigate these risks?
Organizational
Maturity
When Things Break: Managing a Crisis
11. Define SLAs & Major Incident Process
• A service-level agreement (SLA) is an official commitment
between a service provider and the customer where aspects of
the service – quality, availability, responsibilities – are agreed
• A major incident is an event which has significant impact or
urgency for the business/organization and which demands a
response beyond the routine incident management process
When Things Break: Managing a Crisis
12. Define Measurements
• Transparency and Improvement
• Track Key Performance Indicators and Critical Success Factors
• KPI for Change Management is number of major incidents caused by an IT
change (with the obvious goal being ZERO!).
• KPIs for Incident Management (especially for P1)
• Number of P1 Major Incidents
• Average Time to Detection
• Average Time to Workaround
• Average Time to Resolution
• Number of users impacted by Major Incident
• Number of hours P1 Services unavailable
When Things Break: Managing a Crisis
13. SLA: Defining IT Service Priority
When Things Break: Managing a Crisis
• From standpoint of the business users (and not IT)
• Who uses the service and for what?
• Primary hours of use? Specific times of day, month & year
when service is more critical?
• Downtimes, Response times, Updates
• Business contacts
14. Major Incident Process
When Things Break: Managing a Crisis
• Align with SLA
• Define IT Subject Matter Expert(s) for each Service
• Define who is in charge of communication during crisis
• Provide a crisis communication channel (Slack, Skype, etc)
• Define standard communication template
• Test the process by simulating a disaster
16. During the Crisis
When Things Break: Managing a Crisis
• Follow the major incident plan
• Communication is key
• Protect SME Investigators
• Schedule check-ins for updates
• Timely escalations
Detection
Classification
& Initial
Support
Escalation
Investigation
& Diagnosis
Resolution
17. When Things Break: Managing a Crisis
• What changed? Review Change Register
• Rule in / out likely root cause using simple logic
• The cause is usually the most obvious culprit
• Experience helps you to see common patterns, regardless
of the underlying technology
• May have to convince without blame
Root Cause Analysis
18. When Things Break: Managing a Crisis
After Action Review
• What went well?
• What needs to improve?
• Priority of making improvements
• Review organizational maturity goals
• Feed into an action plan with deadlines, goals and report
back on improvement measures
19. Common Behaviors During Crises
• Complexity, misplaced optimism, lack of communication, too slow to react, blame
assigned elsewhere, unclear responsibilities
• Team cohesiveness, reaction appropriate to event, clear roles, communications
consistent and effective, after action reviews.
When Things Break: Managing a Crisis
20. Teams and Roles
• The investment in managing and being part of a high performing
team pays off in a crisis
• Individuals need to clearly understand their roles and responsibilities
• Go by the plan, and be aware that the plan may not suit the crisis
• Communications are critical and be certain crisis is actually resolved
before sounding “all clear”.
• Follow up/mop up is part of the process.
21. Roles
You’ve caused the issue…
Report issue to your manager as soon as possible. Bad
news never ages well and an honest report will resolve
issue more quickly.
You are charged with solving the issue…
Be clear of your role
Don’t waste energy on anger and blame
You are part of the team…
Offer assistance but don't be insulted if no one takes
you up on it right away - they will if it is a multi-day
crisis.
No job is too small - food, coffee, purchasing, etc.
No blame or gossip
22. Leadership Role
Report issue to your management including impact, cause
and the next time you will provide an update. You must
communicate directly to your management and include
other communications updates.
“Call” the crisis, so protocols are put into action.
Contact vendors and request the highest level assistance.
When Things Break: Managing a Crisis
23. Leadership Role…
• Identify team - SMEs, communications, support - how often to meet (every
2-4 hours), how often to provide updates to internal/ external parties. You
are not the tech team.
• Be clear with team on specific roles so time doesn't get wasted on who's
doing what
• Be clear and positive to your team. They will watch you closely and model
their behavior after yours.
• If this is a multi day issue you will eventually need to sleep. Who has
authority in your absence?
When Things Break: Managing a Crisis
24. Communications
Work with your leadership and tech team to
develop messaging, timing, communications
modalities
Provide updates at promised times, even if
there's nothing new to report.
People will reach out to you individually,
communicate directly with them.
25. Considerations for Women
• Be confident of your decision making
abilities. Repeat yourself until you
are heard
• Step up to the challenge will help
your career
• Its not the time to be soft and have
round edges
• Don’t underestimate your knowledge
and experience
• Use your network for advice and
support
26. About Apologies
• Always apologize and follow it up with action
• A crisis, handled correctly, may give you an opportunity to develop and
deepen customer relationships.
“These are not the standards you expect from our organization”
“I sincerely apologize for this situation”
When Things Break: Managing a Crisis
During a crisis, people need to be given specific and explicit roles, permissions
Who is managing the crisis
Who is in charge of communications - may be multiple people company management, customers, internal to crisis team
Be quick to provide information but not to say issue is resolved, communication next update timing and stick to it.
Who is managing the vendor
Food and sleep for multi day issues
Blaming and criticism and egos will only slow things down.
The people who run operations are the people many not be good at managing a crisis who are good at crisis
Liability issues…
If legal team says you can’t apologize, tell them to draft apology language
This was more of an issue several years ago as data shows that sincere apologies reduce lawsuits and settlements