Paging, Alerting, Chaos Eng Overview

•

0 likes•85 views

matthewbrahms

A talk I gave for #sre-office-hours to our Eng team about my proposed way we should run Paging/Alerting/Chaos Eng!

Technology

Paging/Alerting
Workshop
#sre-office-hours | 04.04.2018

How reliable/accurate
would this information be?

What’s the acceptable time
to be alerted/respond?

An Initial Framework for
Discussing SEVs

What is a SEV?
SEV is a term used to refer to an incident, it is derived from the
word severity.

Common types of SEV
- Availability Drop
- Product Issue / Feature Broken
- Data Loss
- Security Risk
- etc.

Any SEV which involves a
loss of customer data should
be classified as SEV0.

Calculate Critical
Uptime in 9’s
https://uptime.is/

What uptime do you
think we could safely
publish?
Would we be okay
telling our
customers/vendors that
number?

How to measure a SEV?
% loss * outage duration

How do we maintain
combat effectiveness
during a SEV?

Incident Manager On-Call (IMOC)
- Should be a small rotation of Engineering Leaders
- Only one person is on-call in this role at any point in time
- These people should possess a wide knowledge of services and
engineering teams
- Will be our version of Air-Traffic-Control for the SEV, ensuring
different people working on the SEV are organized and working
coherently as a unit!

Tech Lead On-Call (TLOC)
- This would be the engineer driving resolution of the SEV
- Should have deep knowledge of specific domain of knowledge;
be a SME (Subject Matter Expert)
- Should have a deep knowledge of upstream and downstream
dependencies

What we need to define to have these roles:
- IMOC runbook/guide
- Designate a Primary and Secondary IMOC at all times
- Escalation should be automatic
- Monthly sync for all IMOC and TLOC
- Way to quickly triage what systems are effected/find root cause
- How would we do this?
- How do we record / document SEV’s?
- Google Form? Git repo? Suggestions??
- SEV naming convention

What happens when we
don’t meet our uptime
requirements?!?

Technical Issues
● Dependency Failure
● Cloud Provider Region/Zone Failure
● Provider Failure
● Connectivity Issues
● Power issues (our local office power affects AWS RDS!)
● DNS outage/latency
● Misconfiguration of machines/docker images
● Software Bugs
● Corrupt/unavailable backups

Cultural Issues
● Lack of knowledge sharing
● Lack of knowledge handover
● Lack of on-call training
● Lack of chaos engineering
● Lack of a high severity incident management program
● Lack of documentation and playbooks
● Lack of alerts and pages
● Lack of effective alerting thresholds
● Lack of backup strategy

How do we prevent SEVs from repeating?
● Combination of:
○ Record outages
○ Correlate failures
○ Track SEVs

What if we could break
things safely!?
What lessons/data could
we gather?

Chaos Engineering...yes, it is a real thing!
● 2010 - Netflix created the Chaos Monkey which can wreak
havoc in AWS at will deleting instances (but fully
customizable/controllable) -- this is OSS as of 2012
● 2011 - Netflix creates the Simian Army--a host of chaos tools to
test failure modes in your infrastructure and applications
● 2014 - the Role of Chaos Engineer is created at Netflix

And we all get to “share
the pain” with our new
tool PagerDuty...

Credits
- https://www.gremlin.com/community/tutorials/how-to-establish-a-high-severity-inciden
t-management-program/
- https://www.gremlin.com/community/tutorials/chaos-engineering-the-history-principles
-and-practice/
- https://www.gremlin.com/the-discipline-of-chaos-engineering/
- https://github.com/tammybutow/chaos_engineering_bootcamp
- https://www.usenix.org/conference/srecon17americas/program/presentation/andrus
They did a cool workshop about Chaos Engineering
with hands-on labs at SREcon this year. If you like this
notion of chaos, more of us should go next year!
#notjustforSRE

What's hot

2016 virus bulletinAdrian Sanabria

Chaos Engineering: Injecting Failure for Building Resilience in SystemsYury Roa

Application Security WebcastVlad Styran

Chaos Engineering 101: A Field Guidematthewbrahms

The Teams Behind DevSecOps Uleska

Silver Lining for Miles: DevOps for Building Security SolutionsSeniorStoryteller

451 and Cylance - The Roadmap To Better Endpoint SecurityAdrian Sanabria

[Webinar] Building a Product Security Incident Response Team: Learnings from ...bugcrowd

Establishing a-quality-vulnerability-management-programPriyanka Aash

Security vulnerabilities for grown ups - GOTOcon 2012Vitaly Osipov

Top 6 Technology Threats to Your Long Term Care Organization Gross, Mendelsohn & Associates

Chaos EngineeringYury Roa

Make it Fixable, Living with Risk (Paranoia 2017)Patricia Aas

Security Surveillance 2010_FinalRKDickey

CSA Raleigh application security and deception in the cloudPhillip Maddux

API Vulnerabilties and What to Do About ThemEoin Woods

Make it Fixable (Security Divas 2017)Patricia Aas

Ops Happen: Improve Security Without Getting in the WaySeniorStoryteller

В чому різниця між тестами на проникнення, аудитами, та іншими послугами з кі...Vlad Styran

Deception in Cyber Security (League of Women in Cyber Security)Phillip Maddux

What's hot (20)

2016 virus bulletin

Chaos Engineering: Injecting Failure for Building Resilience in Systems

Application Security Webcast

Chaos Engineering 101: A Field Guide

The Teams Behind DevSecOps

Silver Lining for Miles: DevOps for Building Security Solutions

451 and Cylance - The Roadmap To Better Endpoint Security

[Webinar] Building a Product Security Incident Response Team: Learnings from ...

Establishing a-quality-vulnerability-management-program

Security vulnerabilities for grown ups - GOTOcon 2012

Top 6 Technology Threats to Your Long Term Care Organization

Chaos Engineering

Make it Fixable, Living with Risk (Paranoia 2017)

Security Surveillance 2010_Final

CSA Raleigh application security and deception in the cloud

API Vulnerabilties and What to Do About Them

Make it Fixable (Security Divas 2017)

Ops Happen: Improve Security Without Getting in the Way

В чому різниця між тестами на проникнення, аудитами, та іншими послугами з кі...

Deception in Cyber Security (League of Women in Cyber Security)

Similar to Paging, Alerting, Chaos Eng Overview

Securing Systems - Still Crazy After All These YearsAdrian Sanabria

DIY guide to runbooks, incident reports, and incident responseNathan Case

Build Automate and Test Strategies - BATMAN Eturnti Consulting Pvt Ltd

Brighttalk understanding the promise of sde - finalAndrew White

Security engineering 101 when good design & security work togetherWendy Knox Everette

GameDay - Achieving resilience through Chaos EngineeringDiUS

"You Got That SIEM. Now What Do You Do?" by Dr. Anton ChuvakinAnton Chuvakin

Solnet dev secops meetuppbink

2016 - Safely Removing the Last Roadblock to Continuous Deliverydevopsdaysaustin

The End of Security as We Know It - Shannon LietzSeniorStoryteller

AI at Scale in Enterprises Ganesan Narayanasamy

Safely Removing the Last Roadblock to Continuous DeliverySeniorStoryteller

Something Fun About Using SIEM by Dr. Anton ChuvakinAnton Chuvakin

DevOpsRoadTrip San Francisco Final Speaking Deck VictorOps

From Monoliths to Microservices at Realestate.com.auevanbottcher

Computational Patterns of the Cloud - QCon NYC 2014Ines Sombra

Sailing Through The Storm of Kubernetes CVEs Meetup 29062023.pptxlior mazor

Automation and Management of Database Clusters MariaDB Roadshow 2014MariaDB Corporation

2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0Joakim Lindbom

DevSecCon KeyNote London 2015Shannon Lietz

Similar to Paging, Alerting, Chaos Eng Overview (20)

Securing Systems - Still Crazy After All These Years

DIY guide to runbooks, incident reports, and incident response

Build Automate and Test Strategies - BATMAN

Brighttalk understanding the promise of sde - final

Security engineering 101 when good design & security work together

GameDay - Achieving resilience through Chaos Engineering

"You Got That SIEM. Now What Do You Do?" by Dr. Anton Chuvakin

Solnet dev secops meetup

2016 - Safely Removing the Last Roadblock to Continuous Delivery

The End of Security as We Know It - Shannon Lietz

AI at Scale in Enterprises

Safely Removing the Last Roadblock to Continuous Delivery

Something Fun About Using SIEM by Dr. Anton Chuvakin

DevOpsRoadTrip San Francisco Final Speaking Deck

From Monoliths to Microservices at Realestate.com.au

Computational Patterns of the Cloud - QCon NYC 2014

Sailing Through The Storm of Kubernetes CVEs Meetup 29062023.pptx

Automation and Management of Database Clusters MariaDB Roadshow 2014

2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0

DevSecCon KeyNote London 2015

Recently uploaded

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Developing An App To Navigate The Roads of BrazilV3cube

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Scaling API-first – The story of a global engineering organizationRadu Cotescu

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

Recently uploaded (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx

IAC 2024 - IA Fast Track to Search Focused AI Solutions

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Exploring the Future Potential of AI-Enabled Smartphone Processors

Automating Google Workspace (GWS) & more with Apps Script

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Developing An App To Navigate The Roads of Brazil

08448380779 Call Girls In Friends Colony Women Seeking Men

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

[2024]Digital Global Overview Report 2024 Meltwater.pdf

A Domino Admins Adventures (Engage 2024)

Driving Behavioral Change for Information Management through Data-Driven Gree...

GenCyber Cyber Security Day Presentation

Scaling API-first – The story of a global engineering organization

How to Troubleshoot Apps for the Modern Connected Worker

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

Data Cloud, More than a CDP by Matt Robison

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

Paging, Alerting, Chaos Eng Overview

1. Paging/Alerting Workshop #sre-office-hours | 04.04.2018

2. What qualifies as an incident?

3. What should happen in an incident?

4. Who should be alerted/paged?

5. What tools would we use to gather data?

6. How reliable/accurate would this information be?

7. What’s the acceptable time to be alerted/respond?

8. An Initial Framework for Discussing SEVs

9. What is a SEV? SEV is a term used to refer to an incident, it is derived from the word severity.

10. Common types of SEV - Availability Drop - Product Issue / Feature Broken - Data Loss - Security Risk - etc.

11. SEV Levels

12. Any SEV which involves a loss of customer data should be classified as SEV0.

13. Calculate Critical Uptime in 9’s https://uptime.is/

14. What uptime do you think we could safely publish? Would we be okay telling our customers/vendors that number?

15. SEV Terminology

16. Lifecycle of a SEV event

17. How to measure a SEV? % loss * outage duration

18. Visualizing/tracking SEV’s

19. How do we maintain combat effectiveness during a SEV?

20. Incident Manager On-Call (IMOC) - Should be a small rotation of Engineering Leaders - Only one person is on-call in this role at any point in time - These people should possess a wide knowledge of services and engineering teams - Will be our version of Air-Traffic-Control for the SEV, ensuring different people working on the SEV are organized and working coherently as a unit!

21. Tech Lead On-Call (TLOC) - This would be the engineer driving resolution of the SEV - Should have deep knowledge of specific domain of knowledge; be a SME (Subject Matter Expert) - Should have a deep knowledge of upstream and downstream dependencies

22. What we need to define to have these roles: - IMOC runbook/guide - Designate a Primary and Secondary IMOC at all times - Escalation should be automatic - Monthly sync for all IMOC and TLOC - Way to quickly triage what systems are effected/find root cause - How would we do this? - How do we record / document SEV’s? - Google Form? Git repo? Suggestions?? - SEV naming convention

23. What happens when we don’t meet our uptime requirements?!?

24. What causes SEV’s?

25. Technical Issues ● Dependency Failure ● Cloud Provider Region/Zone Failure ● Provider Failure ● Connectivity Issues ● Power issues (our local office power affects AWS RDS!) ● DNS outage/latency ● Misconfiguration of machines/docker images ● Software Bugs ● Corrupt/unavailable backups

26. Cultural Issues ● Lack of knowledge sharing ● Lack of knowledge handover ● Lack of on-call training ● Lack of chaos engineering ● Lack of a high severity incident management program ● Lack of documentation and playbooks ● Lack of alerts and pages ● Lack of effective alerting thresholds ● Lack of backup strategy

27. How do we prevent SEVs from repeating? ● Combination of: ○ Record outages ○ Correlate failures ○ Track SEVs

28. Chaos Engineering!

29. What if we could break things safely!? What lessons/data could we gather?

30. Chaos Engineering...yes, it is a real thing! ● 2010 - Netflix created the Chaos Monkey which can wreak havoc in AWS at will deleting instances (but fully customizable/controllable) -- this is OSS as of 2012 ● 2011 - Netflix creates the Simian Army--a host of chaos tools to test failure modes in your infrastructure and applications ● 2014 - the Role of Chaos Engineer is created at Netflix

31. Principles of Chaos Engineering

32. Can we do this?

33.

34. And we all get to “share the pain” with our new tool PagerDuty...

35. Credits - https://www.gremlin.com/community/tutorials/how-to-establish-a-high-severity-inciden t-management-program/ - https://www.gremlin.com/community/tutorials/chaos-engineering-the-history-principles -and-practice/ - https://www.gremlin.com/the-discipline-of-chaos-engineering/ - https://github.com/tammybutow/chaos_engineering_bootcamp - https://www.usenix.org/conference/srecon17americas/program/presentation/andrus They did a cool workshop about Chaos Engineering with hands-on labs at SREcon this year. If you like this notion of chaos, more of us should go next year! #notjustforSRE

36. Questions/thoughts? #sre-office-hours

Paging, Alerting, Chaos Eng Overview

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Paging, Alerting, Chaos Eng Overview

Similar to Paging, Alerting, Chaos Eng Overview (20)

Recently uploaded

Recently uploaded (20)

Paging, Alerting, Chaos Eng Overview