19. How do we maintain
combat effectiveness
during a SEV?
20. Incident Manager On-Call (IMOC)
- Should be a small rotation of Engineering Leaders
- Only one person is on-call in this role at any point in time
- These people should possess a wide knowledge of services and
engineering teams
- Will be our version of Air-Traffic-Control for the SEV, ensuring
different people working on the SEV are organized and working
coherently as a unit!
21. Tech Lead On-Call (TLOC)
- This would be the engineer driving resolution of the SEV
- Should have deep knowledge of specific domain of knowledge;
be a SME (Subject Matter Expert)
- Should have a deep knowledge of upstream and downstream
dependencies
22. What we need to define to have these roles:
- IMOC runbook/guide
- Designate a Primary and Secondary IMOC at all times
- Escalation should be automatic
- Monthly sync for all IMOC and TLOC
- Way to quickly triage what systems are effected/find root cause
- How would we do this?
- How do we record / document SEV’s?
- Google Form? Git repo? Suggestions??
- SEV naming convention
25. Technical Issues
● Dependency Failure
● Cloud Provider Region/Zone Failure
● Provider Failure
● Connectivity Issues
● Power issues (our local office power affects AWS RDS!)
● DNS outage/latency
● Misconfiguration of machines/docker images
● Software Bugs
● Corrupt/unavailable backups
26. Cultural Issues
● Lack of knowledge sharing
● Lack of knowledge handover
● Lack of on-call training
● Lack of chaos engineering
● Lack of a high severity incident management program
● Lack of documentation and playbooks
● Lack of alerts and pages
● Lack of effective alerting thresholds
● Lack of backup strategy
27. How do we prevent SEVs from repeating?
● Combination of:
○ Record outages
○ Correlate failures
○ Track SEVs
29. What if we could break
things safely!?
What lessons/data could
we gather?
30. Chaos Engineering...yes, it is a real thing!
● 2010 - Netflix created the Chaos Monkey which can wreak
havoc in AWS at will deleting instances (but fully
customizable/controllable) -- this is OSS as of 2012
● 2011 - Netflix creates the Simian Army--a host of chaos tools to
test failure modes in your infrastructure and applications
● 2014 - the Role of Chaos Engineer is created at Netflix