Velocity NY 2014: Signal through the noise

•

0 likes•439 views

In recent years it’s become evident that alerting is one of the biggest challenges facing modern Operations Engineers. Conference talks, hallways tracks, meetups, etc are rife with discussions about poor signal/noise in alerts, fatigue from false positives, and general lack of actionability. Our talk (informed by real-world experience designing, building and maintaining our distributed, multi-tenant metrics/alerting service) takes a fundamental approach and examines alerting requirements and practices in the abstract. We put forth a comprehensive abstract model with best practices that should be followed and implemented by your team regardless of your tool of choice. This talk is equal parts cultural and technical, encompassing both computational capabilities as well as social practices, like: Defining organizational policy about where and when to set alerts. Ensuring the on-call engineer is armed with the proper information to take action Best practices for configuring an alert Fire-fighting after an alert has triggered Performing analysis across your organization wide history of alerts

Technology

hi.
dave@librato.com
@davejosephsen
github: djosephsen

Signal Through the Noise
dave@librato.com
@davejosephsen
github: djosephsen

Business Projects
IT Projects
Changes
Unplanned Work

What did he just say?
•Notifications are expensive, they hurt people and productivity
•Make people work harder to send them by requiring run books
•Run books add context to alerts. Other types of context are awesome too
•Like graphs

1. Identify Operational Limitations
Y<160bpm
X<7m km/h

1. Identify Operational Limitations
2. Monitor those limitations
X<7m km/h
Y<160bpm

(Hint: one of these things measures balancing)
%hosts
alive
% IO
VS per instance

Does not measure balancing Measures balancing
66X %hosts
alive
VS .2
% IO
per instance

WE CAN REDUCE ALERTS BY
IMPROVING OUR TELEMETRY
SIGNAL

What did he just say?
•Monitoring isn't a thing. It’s just part of the engineering process
•We’re treating it like a thing that only some types of engineers might want to
do, and that’s giving us broken feedback
•Aerospace engineers are rad, they don’t do that.
•Fix your monitoring and your alerts will follow

What did he just say?
• Choose metrics that tell you about the things you care about.
•Alert when the things you care about hit limits you understand
•All alerts < critical go to chatrooms, ticket systems or dashboards
•Critical alers use an automated escalation service that enforces on call policy
•Escalated alerts require acknowledgement
•Escalated alerts require run book url’s and/or links to graphs of the metric

The Ultimate Recap
• Enforce a notification policy that requires context
• Make monitoring an engineering process
• Use the same signal for all metrics introspection and notification
• Encourage everyone to rely on telemetry data (graphs or it didn’t happen!)
• Everyone who collects a metric, gets keys to dashboard and alert design

Recently uploaded

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

Commit 2024 - Secret Management made easyAlfredo García Lavilla

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz

Story boards and shot lists for my a level piececharlottematthew16

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

"ML in Production",Oleksandr BaganFwdays

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

Training state-of-the-art general text embeddingZilliz

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Recently uploaded (20)

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

Commit 2024 - Secret Management made easy

Scanning the Internet for External Cloud Exposures via SSL Certs

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost

Story boards and shot lists for my a level piece

What's New in Teams Calling, Meetings and Devices March 2024

Nell’iperspazio con Rocket: il Framework Web di Rust!

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Streamlining Python Development: A Guide to a Modern Project Setup

Unraveling Multimodality with Large Language Models.pdf

Powerpoint exploring the locations used in television show Time Clash

"ML in Production",Oleksandr Bagan

DevoxxFR 2024 Reproducible Builds with Apache Maven

Connect Wave/ connectwave Pitch Deck Presentation

DMCC Future of Trade Web3 - Special Edition

Ensuring Technical Readiness For Copilot in Microsoft 365

Training state-of-the-art general text embedding

Developer Data Modeling Mistakes: From Postgres to NoSQL

My Hashitalk Indonesia April 2024 Presentation

Featured

Content Methodology: A Best Practices Report (Webinar)contently

How to Prepare For a Successful Job Search for 2024Albert Qian

Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)

Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal

5 Public speaking tips from TED - Visualized summarySpeakerHub

ChatGPT and the Future of Work - Clark Boyd Clark Boyd

Getting into the tech field. what next Tessa Mero

Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray

How to have difficult conversations Rajiv Jayarajah, MAppComm, ACC

Introduction to Data ScienceChristy Abraham Joy

Time Management & Productivity - Best PracticesVit Horky

The six step guide to practical project managementMindGenius

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools

12 Ways to Increase Your Influence at WorkGetSmarter

ChatGPT webinar slidesAlireza Esmikhani

More than Just Lines on a Map: Best Practices for U.S Bike RoutesProject for Public Spaces & National Center for Biking and Walking

Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference

Barbie - Brand Strategy PresentationErica Santiago

Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellSaba Software

Featured (20)

Content Methodology: A Best Practices Report (Webinar)

How to Prepare For a Successful Job Search for 2024

Social Media Marketing Trends 2024 // The Global Indie Insights

Trends In Paid Search: Navigating The Digital Landscape In 2024

5 Public speaking tips from TED - Visualized summary

ChatGPT and the Future of Work - Clark Boyd

Getting into the tech field. what next

Google's Just Not That Into You: Understanding Core Updates & Search Intent

How to have difficult conversations

Introduction to Data Science

Time Management & Productivity - Best Practices

The six step guide to practical project management

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...

12 Ways to Increase Your Influence at Work

ChatGPT webinar slides

More than Just Lines on a Map: Best Practices for U.S Bike Routes

Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...

Barbie - Brand Strategy Presentation

Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well

Velocity NY 2014: Signal through the noise

1. hi.

2. hi. dave@librato.com @davejosephsen github: djosephsen

3. hi. dave@librato.com @davejosephsen github: djosephsen

4. hi. dave@librato.com @davejosephsen github: djosephsen

5. Signal Through the Noise dave@librato.com @davejosephsen github: djosephsen

10.

11.

12. WAT?

13. WAT?

14. AAAGHHHHH!!!

15. ALERTS AREN’T FREE

16. Business Projects IT Projects Changes Unplanned Work

17. Unplanned Work (eeew Comic Sans)

18. Unplanned Work

19. Unplanned Work

20. Alerting

21. Tax the Ammunition

22.

23.

24.

25. THE CONTENT OF YOUR ALERTS MATTERS

26. What did he just say? •Notifications are expensive, they hurt people and productivity •Make people work harder to send them by requiring run books •Run books add context to alerts. Other types of context are awesome too •Like graphs

27. WHY do we Monitor?

28.

29.

30.

31.

32. Command Signal Telemetry Data

33.

34. 1. Identify Operational Limitations Y<160bpm X<7m km/h

35. 1. Identify Operational Limitations 2. Monitor those limitations X<7m km/h Y<160bpm

36.

37. A Balancer ?!

38. Balancer >66% Host Availability

39. Balancer >66% Host Availability

40. % IO per instance

41. (Hint: one of these things measures balancing) %hosts alive % IO VS per instance

42. Does not measure balancing Measures balancing 66X %hosts alive VS .2 % IO per instance

43.

44.

45.

46.

47.

48. IT Monitoring != Feedback

49. IT Monitoring != Feedback

50. some silly != balancer

51. WE CAN REDUCE ALERTS BY IMPROVING OUR TELEMETRY SIGNAL

52. What did he just say? •Monitoring isn't a thing. It’s just part of the engineering process •We’re treating it like a thing that only some types of engineers might want to do, and that’s giving us broken feedback •Aerospace engineers are rad, they don’t do that. •Fix your monitoring and your alerts will follow

53.

54.

55.

56.

57.

58.

59.

60.

61.

62.

63.

64.

65.

66.

67. Some Graph in the War Room

68.

69.

70.

71.

72.

73. Some Graph in the War Room

74. Some Graph in the War Room

75. WHAT YOU MONITOR MATTERS

76.

77.

78.

79. C a } < x

80. } < b k a x

81. x k x k x k

82. EVERYBODY OWNS MONITORING

83.

84.

85.

86.

87.

88.

89.

90.

91.

92.

93.

94.

95.

96. What did he just say? • Choose metrics that tell you about the things you care about. •Alert when the things you care about hit limits you understand •All alerts < critical go to chatrooms, ticket systems or dashboards •Critical alers use an automated escalation service that enforces on call policy •Escalated alerts require acknowledgement •Escalated alerts require run book url’s and/or links to graphs of the metric

97. ALERT ON WHAT YOU SEE

98.

99.

100.

101. EVERYONE OWNS ALERTS (and dashboards)

102.

103.

104.

105.

106. The Ultimate Recap • Enforce a notification policy that requires context • Make monitoring an engineering process • Use the same signal for all metrics introspection and notification • Encourage everyone to rely on telemetry data (graphs or it didn’t happen!) • Everyone who collects a metric, gets keys to dashboard and alert design

107. Questions? Office Hours: 1:15pm

Velocity NY 2014: Signal through the noise

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Velocity NY 2014: Signal through the noise