Monitoring Graceful Failure

•

0 likes•241 views

How can you be sure that your team is alerted of a failure before it causes an outage for your users? The move from monolith to microservice has allowed pieces of functionality to be deployed individually and on demand. Having functionality isolated allows the opportunity for one microservice to fail without bringing down the whole system. However, the complexity of releasing and monitoring API calls being made across services has increased. Whether you’re launching a new product or iterating on a feature, delivering a delightful experience is crucial to your success. If something is to fail, you’d prefer your users didn’t know. Be thoughtful about how your system will degrade, how to inject failure to verify your design, and how this is monitored. In this Sensu Summit 2019 talk, Lorne Kligerman, Director of Product at Gremlin, will cover failing gracefully as an engineering goal which can be confidently tested and monitored with Chaos Engineering. By purposely causing failure of one service at a time in a controlled environment, you can safely observe and react in a timely manner to limit the effect on the end user.

Technology

Monitoring
Graceful Failure
Lorne Kligerman
Director of Product, Gremlin
@lklig
Aaron Sachs
Customer Reliability Engineer, Sensu
@asachs01

4
Be down in 10!
T-Ho 2017
Hey team… bit of a spill but
I’m fine.

Technical Issues Likely Cost Retailers
Billions
Macy’s, Lowe’s hit by Black Friday
technical glitches
Retail outages online leave shoppers
frustrated on Black Friday
People.com
Black Friday Failures
@lklig

Computer Problems Blamed For
Flight Delays
4.1.19
Major US Airlines hit by delays after glitch at
vendor
4.1.19
Pilots of doomed Boeing 737 MAX fought
the plane’s software and lost
4.4.19
Airline Incidents
@lklig

8
Technology is fragile.
Plan ahead to
keep your
users happy
FAILURE
GRACEFUL
DEGRADATION
@lklig

11
Lack of Testing
Failure
UI
End to end
Integration
Unit
@lklig

15@lklig
Loading Screens
Are Not Graceful

16
Fail on Your Own Terms
Key User Stories
& Features
Edge Cases From
Unexpected User
Behaviour
Dependency Failures
@lklig

17
Inject Failure
By Breaking Things
On Purpose
@lklig

Inject failure one
service at a time.
Maintain critical
functionality.
18@lklig

20@lklig
When one
dependency
fails, users are
often affected
Storage
Auth
User Data
Content
Cache
Feature 1
Feature 2

32
RELIABILITY THROUGH CHAOS ENGINEERING
Design for Failure
Identify the most
critical end user
functionality.
Inject Failure
Impact your system to be
sure your user experience
isn’t impacted.
Degrade Gracefully
Plan for non critical
functionality not to
get in the way.
Delight Your Users
Your product metrics will
show behaviour, no
matter the condition.
Graceful Failure
@lklig

Q&A
Lorne Kligerman
Director of Product, Gremlin
@lklig
Aaron Sachs
Customer Reliability Engineer, Sensu
@asachs01

Similar to Monitoring Graceful Failure

Phil Koopman's ISSRE 2016 Keynoteedgecaseresearch

What You Need to Know About SaaS Application Data ProtectionSpanning Cloud Apps

Visual Detection Technology in Siemens Gamesa (by Allan Moeller Larsen)TUS Expo

Another Update of Tablet Strategy BootcampPaul Saunders

Part1: Introduction to Project ManagementArry Arman

DevOps goes Mobile (daho.am)Wooga

Welcome to the it lab.pptxAnees120773

Mobile Testing Success: Real World Strategies and TechniquesTechWell

Of innovation and impatience - Future Decoded 2015Christian Heilmann

NDC London 2014: Erlang Patterns Matching Business NeedsTorben Hoffmann

LogLogic SQL Server Hacking DBs April09Mark Ginnebaugh

Tablet Market: Investment AnalysisAjay Singh

Software engineering unit 1Sumit Paul

Riding The N Train: How we dismantled Groupon's Ruby on Rails MonolithSean McCullough

Beyond JIRA: When Issue Tracking Alone Isn't Enough Perforce

UPDATED: Tablet Strategy BootcampPaul Saunders

A Data Integration Case Study - Avoid Creating a “Franken-Beast”DATAVERSITY

Semicon west monetizing the internet of thingsPaul Brody

Alpha Anywhere presentation at the the Always on Summit -- Building Offline M...Richard Rabins

Agile Australia 2017 Hypothesis-Driven COTS Software Selection Tiago GriffoTiago Griffo

Similar to Monitoring Graceful Failure (20)

Phil Koopman's ISSRE 2016 Keynote

What You Need to Know About SaaS Application Data Protection

Visual Detection Technology in Siemens Gamesa (by Allan Moeller Larsen)

Another Update of Tablet Strategy Bootcamp

Part1: Introduction to Project Management

DevOps goes Mobile (daho.am)

Welcome to the it lab.pptx

Mobile Testing Success: Real World Strategies and Techniques

Of innovation and impatience - Future Decoded 2015

NDC London 2014: Erlang Patterns Matching Business Needs

LogLogic SQL Server Hacking DBs April09

Tablet Market: Investment Analysis

Software engineering unit 1

Riding The N Train: How we dismantled Groupon's Ruby on Rails Monolith

Beyond JIRA: When Issue Tracking Alone Isn't Enough

UPDATED: Tablet Strategy Bootcamp

A Data Integration Case Study - Avoid Creating a “Franken-Beast”

Semicon west monetizing the internet of things

Alpha Anywhere presentation at the the Always on Summit -- Building Offline M...

Agile Australia 2017 Hypothesis-Driven COTS Software Selection Tiago Griffo

Recently uploaded

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

Evaluating the top large language models.pdfChristopherTHyatt

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Tech Trends Report 2024 Future Today Institute.pdfhans926745

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

GenCyber Cyber Security Day PresentationMichael W. Hawkins

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Scaling API-first – The story of a global engineering organizationRadu Cotescu

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker

presentation ICT roal in 21st century education

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Automating Google Workspace (GWS) & more with Apps Script

Strategies for Landing an Oracle DBA Job as a Fresher

Evaluating the top large language models.pdf

08448380779 Call Girls In Friends Colony Women Seeking Men

2024: Domino Containers - The Next Step. News from the Domino Container commu...

A Domino Admins Adventures (Engage 2024)

Tech Trends Report 2024 Future Today Institute.pdf

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

GenCyber Cyber Security Day Presentation

CNv6 Instructor Chapter 6 Quality of Service

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Scaling API-first – The story of a global engineering organization

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Data Cloud, More than a CDP by Matt Robison

Driving Behavioral Change for Information Management through Data-Driven Gree...

How to Troubleshoot Apps for the Modern Connected Worker

Monitoring Graceful Failure

1. Monitoring Graceful Failure Lorne Kligerman Director of Product, Gremlin @lklig Aaron Sachs Customer Reliability Engineer, Sensu @asachs01

2. 2

3. 3

4. 4 Be down in 10! T-Ho 2017 Hey team… bit of a spill but I’m fine.

5. 5 We Expect Technology To Just Work™

6. Technical Issues Likely Cost Retailers Billions Macy’s, Lowe’s hit by Black Friday technical glitches Retail outages online leave shoppers frustrated on Black Friday People.com Black Friday Failures @lklig

7. Computer Problems Blamed For Flight Delays 4.1.19 Major US Airlines hit by delays after glitch at vendor 4.1.19 Pilots of doomed Boeing 737 MAX fought the plane’s software and lost 4.4.19 Airline Incidents @lklig

8. 8 Technology is fragile. Plan ahead to keep your users happy FAILURE GRACEFUL DEGRADATION @lklig

9. 9 Why Are Failures So Common?

10. 10 Legacy Systems @lklig

11. 11 Lack of Testing Failure UI End to end Integration Unit @lklig

12. @lklig

13. 13 What Can We Do About It?

14. 14 Design For Failure

15. 15@lklig Loading Screens Are Not Graceful

16. 16 Fail on Your Own Terms Key User Stories & Features Edge Cases From Unexpected User Behaviour Dependency Failures @lklig

17. 17 Inject Failure By Breaking Things On Purpose @lklig

18. Inject failure one service at a time. Maintain critical functionality. 18@lklig

19. 19 Degrade Gracefully

20. 20@lklig When one dependency fails, users are often affected Storage Auth User Data Content Cache Feature 1 Feature 2

21. 21@lklig

22. 22 Monitoring + Chaos Engineering

23. 23 Let Monitoring Know

24. 24

25. 25

26. 26 Let The Right People Know

27. 27

28. 28

29. 29 Closing the Loop

30. 30

31. 31

32. 32 RELIABILITY THROUGH CHAOS ENGINEERING Design for Failure Identify the most critical end user functionality. Inject Failure Impact your system to be sure your user experience isn’t impacted. Degrade Gracefully Plan for non critical functionality not to get in the way. Delight Your Users Your product metrics will show behaviour, no matter the condition. Graceful Failure @lklig

33. USE inthefamily FOR $50 OFF

34. 34 gremlin.com/lorne

35. Q&A Lorne Kligerman Director of Product, Gremlin @lklig Aaron Sachs Customer Reliability Engineer, Sensu @asachs01

Monitoring Graceful Failure

Recommended

Recommended

More Related Content

Similar to Monitoring Graceful Failure

Similar to Monitoring Graceful Failure (20)

More from Sensu Inc.

More from Sensu Inc. (20)

Recently uploaded

Recently uploaded (20)

Monitoring Graceful Failure