Doing monitoring right

•Download as PPTX, PDF•

1 like•165 views

This presentation was delivered at Gluecon 2018. It covered how to think about monitoring, a framework for incremental improvement in monitoring and common mistakes that teams make when approaching software monitoring.

Software

Monitoring:
Doing it the right way
Saving your sanity, making the world a better place.
John-Daniel Trask
Raygun

Who is this person?
Funny voice.
Weird name.
What’s his deal?

What we’re covering
Doing monitoring the right way.
Getting started, but also helping identify potential issues
With your current monitoring.

Why monitor things?
• You’re not employed to write code.

Business value?
• I got a CS degree mate, not an MBA

Framework
1. Do a best-efforts analysis of what to monitor
• Bad things
• Good things
• Limit to a sprint or two of effort, you won’t get it perfect.
2. Perform post mortems to identify gaps in your monitoring
3. Update/improve monitoring based on findings
4. GOTO 2

Getting started
1. Something is better than nothing.
2. You can go a long way with some simple tools

Metrics & Monitoring
• Metrics are a given value or measure.
• Monitoring encapsulates everything.

Full monitoring: full story about an error

Monitoring vs. Observability
• Is there a difference?

Know what to measure
• You could track almost anything

Crash reporting JavaScript log aggregation
Metrics server (statsd) Alerting and pager tools
Dashboarding tools Usage monitoring
Real User Monitoring Structured and unstructured logging
Up time monitoring Network monitoring
Application performance monitoring Wire-level monitoring
Server monitoring Canary logging
Log aggregation service Distributed tracing
Intrusion detection monitoring Employee device monitoring
Cloud metrics from cloud provider Security monitoring
Custom event tracking Advanced visualizing tooling
Deployment tracking Infrastructure change monitoring
User navigation and click tracking monitoring Infrastructure spend monitoring

The obvious
• Errors & error rate
• Server performance
• Requests per second per service
• Database call times

What about the less obvious?
• Back to basics: business value users!

Amazon example
• When is the page loaded?

What about the less obvious?
• Cost to serve each customer
• Feature use tracking to double down on what customers do the most
• Good things
• Any you’d add?

Connect the dots
• Connect all your data together
• Connect teams

Information Radiators
• A fancy way of saying TV

Averages are lies
• Yet so many monitoring tools focus on them

On Average, everyone here is worth $900m.

Why are quantiles hard?
• You need to store everything

Common mistakes
• Only measuring your servers

Common mistakes
• Only measuring the server

Common mistakes
• Saving money by flying blind

Common mistakes
• Making it difficult to add to new systems

Common mistakes
• Making it difficult to consume the data

Common mistakes
• Just buying/installing a tool doesn’t help

Common mistakes
• Not getting out of the building

Common mistakes
• Anyone have a mistake they’d love to share?

References & Links
• Observability vs. Monitoring: https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c
• Coda Hale, Metrics, Metrics Everywhere: https://www.youtube.com/watch?v=czes-oa0yik
• Google Site Reliability Book: https://landing.google.com/sre/
• Developers are your GDPR risk: https://jdtrask.com/post/software-developers-are-your-biggest-gdpr-risk.html
• Netflix tech blog: https://medium.com/netflix-techblog/

Questions?
Thank you for coming!
@traskjd
@raygunio
Raygun.com (I also have some swag)

If your site were one second slower, how many of your visitors would bounce? If your site were one second faster, how many additional orders would you receive? Bottom line: Do you know what one second of latency is worth to your business? Traditional approaches to performance monitoring are fatally flawed. They measure performance only in a silo, telling you how long key actions took but not putting that information into a context you can use to improve the one metric that ultimately matters: revenue. Bridging this gap requires the collection of performance and business data together, and then analyzing this data using the proper analytic methods. Using modern Real User Monitoring (RUM) techniques, Buddy Brewer will show you how to quantify the impact even one second of latency has on key business metrics like bounce and conversion rate.

6 Guidelines for A/B Testing

Emily Robinson

10 Guidelines for A/B Testing

Emily Robinson

Yelp Tech Talks: Mobile Testing 1, 2, 3

Yelp Engineering

What it would be like to build a system for publishing magazines on mobile de...

PressPad - Digital Publishing Software

Bugs happen, and when they do, developers need to be ready to remove those bugs. When th pressure’s on, it’s important for every developer to have simple rules to apply to fix any bug in any circumstance. In this talk, we’ll walk through debugging a program using repeatable methods, from the basic knowledge you need to debug your application, the tools you’ll use to observe your system, to the proper way to find the cause of your bug. In the end you’ll be able to fix any bug, and have rules that can be used to train other developers on your team.

Dave everettJISC's Green ICT Programme

BA World Boston: Evening the Odds with Monte Carlo Project Forecasting

Wm. Hunter Tammaro

SkyStem Webinar-Close Like a Rock Star

Annette Grotz

Machine Learning Vital Signs

Donald Miner

Machine Learning Vital Signs: Metrics and Monitoring of AI in Production This talk details the tracking of machine learning models in production to ensure model reliability, consistency, and performance into the future. Production models are interacting with the real world, and it is terrifying that often times nobody has any idea how they are performing on live data. The world changes! Bias and variance can creep into your models over time and you should know when that happens.

Literacy Iq Test 1[1]

The Literacy Center

Back-upNightmares8Steve Tester

Ticket101bstien2411

How to Pass CCIE in first Attempt? Tips by CCIE Experts

I-Medita Leanring Solutions

Performance testing mistakes newbies make

Confiz Limited

Conversion Rate Optimization 101 - Kick-Start Your Growth Engine

Kissmetrics on SlideShare

Where do You Start? Where to begin? How do I track the data? How do I setup an A/B test? When do I know if the test is conclusive? 1.The CRO Mindset 1 Not hitting goals? Do more with less. Failing is Part of the Game Follow the Data 2.Start with a Plan 2 Don’t just test to test Create a backlog 3.Collecting Data 3 Lots of Options The standard My preference They are directly integrated into most AB testing tools First we need to build our funnel KISSmetrics makes this dead simple Experiments show in funnel reports automagically LOG IN WITH GOOGLE Start Your Free KISSmetrics Trial 4.Collecting Insights 4 Qualitative is important On-page survey tools Ask questions at funnel drop oﬀ points How can we help? Are you looking for something we don't have? Do you need assistance? How can we improve? What is preventing you from purchasing? UX and Usability testing Heatmapping Look at your analytics :) Use these insights to improve backlog 5.Start Testing 5 What is an A/B Test? A/B Testing Tools Add Javascript to site

Mw ppt

itshield

The anatomy of an A/B Test - JSConf Colombia Workshop

Eduardo Shiota Yasuda

“In God we trust, all others must bring data”. Intuition, experience and well known patterns may give us good indications of successful ideas and features, but nothing gets closer to the truth than data analysis and A/B testing. In this workshop, we’ll show how we do experimentation at Booking: what we test, how to get data through templates and JavaScript, and how we analyse the resulting metrics. We’ll live-code examples, see all potential caveats of dealing with the user tracking on the client-side, and show existent tools you can use to test your own ideas.

Managed-Workstations-Presentation-ENMatt Cornelius

The art of Bugging

Prajna Paramita Biswas

Probing Questions

StephanGattuso

Why OBVA Virtual Assistant for your ebay/amazon store and small business out...Office, Internet

You have no idea what your users want - WordCamp PDX

Evan Solomon

HUSTEF '21 Keynote: Hands Off Exploratory Testing - Managing at Scale

Maaret Pyhäjärvi

Lean Responsive

Josh Jeffryes

Webinar: Keep Calm and Scale Out - A proactive guide to Monitoring MongoDB

MongoDB

Setting up proactive monitoring systems can help you and your team prepare for operations problems before they happen and react appropriately when disaster strikes. In this presentation, we reviewed diagnostic tools and strategies for monitoring MongoDB. We reviewed how to do capacity planning and establish KPIs, and present the monitoring utilities available in MongoDB. The KPIs to monitor in your database, including throughput metrics, database performance, resource utilization, resource saturation, assertions/errors The commands, utilities and monitoring tools to leverage in order to set up your proactive monitoring installation Key alerts to set for monitoring your KPIs

10 signs your testing is not enough

SQALab

What's hot

The Fault In Our Code

Camilo Payan

Dave everettJISC's Green ICT Programme

BA World Boston: Evening the Odds with Monte Carlo Project Forecasting

Wm. Hunter Tammaro

SkyStem Webinar-Close Like a Rock Star

Annette Grotz

Machine Learning Vital Signs

Donald Miner

Literacy Iq Test 1[1]

The Literacy Center

Back-upNightmares8Steve Tester

Ticket101bstien2411

How to Pass CCIE in first Attempt? Tips by CCIE Experts

I-Medita Leanring Solutions

Performance testing mistakes newbies make

Confiz Limited

Conversion Rate Optimization 101 - Kick-Start Your Growth Engine

Kissmetrics on SlideShare

Mw ppt

itshield

The anatomy of an A/B Test - JSConf Colombia Workshop

Eduardo Shiota Yasuda

Managed-Workstations-Presentation-ENMatt Cornelius

The art of Bugging

Prajna Paramita Biswas

Probing Questions

StephanGattuso

Why OBVA Virtual Assistant for your ebay/amazon store and small business out...Office, Internet

You have no idea what your users want - WordCamp PDX

Evan Solomon

HUSTEF '21 Keynote: Hands Off Exploratory Testing - Managing at Scale

Maaret Pyhäjärvi

Lean Responsive

Josh Jeffryes

What's hot (20)

The Fault In Our Code

Dave everett

BA World Boston: Evening the Odds with Monte Carlo Project Forecasting

SkyStem Webinar-Close Like a Rock Star

Machine Learning Vital Signs

Literacy Iq Test 1[1]

Back-upNightmares8

Ticket101

How to Pass CCIE in first Attempt? Tips by CCIE Experts

Performance testing mistakes newbies make

Conversion Rate Optimization 101 - Kick-Start Your Growth Engine

Mw ppt

The anatomy of an A/B Test - JSConf Colombia Workshop

Managed-Workstations-Presentation-EN

The art of Bugging

Probing Questions

Why OBVA Virtual Assistant for your ebay/amazon store and small business out...

You have no idea what your users want - WordCamp PDX

HUSTEF '21 Keynote: Hands Off Exploratory Testing - Managing at Scale

Lean Responsive

Similar to Doing monitoring right

Webinar: Keep Calm and Scale Out - A proactive guide to Monitoring MongoDB

MongoDB

10 signs your testing is not enough

SQALab

Agile Metrics...That Matter

Erik Weber

PQF Overview

Martin Hutchings

Metrics - You are what you measure (DevOps Perth)

Rob Crowley

Methods to Measure Marketing & The Metrics We Move

Teacup Analytics

Brooks Bell Interactive Tama Presentation

Triangle American Marketing Association

Anton Muzhailo - Practical Test Process Improvement using ISTQB

Ievgenii Katsan

SOFTWARE TESTING TRAFUNDAMENTALS OF SOFTWARE TESTING.pptx

Financial Services Innovators

Amp Up Your Testing by Harnessing Test Data

TechWell

The data tsunami is coming—or maybe it’s already here. Data science, big data, and machine learning are the buzzwords of the day. Data is changing our products and the way we build them, so we should also change the way we verify our products. In a world of increasing connectivity and accelerated deadlines, data can provide an edge. But what role should data play in assessing the quality of software? Where does it make sense to use data, and where is it inappropriate? Steve Rowe covers the history of how data fits into testing, explains why data is an important tool to have in your quality toolkit, and presents strategies for adding data to your testing plans and using it more effectively in your testing.

Coradiant

gigamon

Alistair Croll, Interop conference faculty and Coradiant's VP of product management gives an unbiased, top down view of Web performance monitoring. This informative look at Web measurement business goals, operating processes, tools and metrics will give you a solid understanding of the issues, without a product pitch. Coradiant is the leader in Web Performance Monitoring. The award-winning TrueSight Real-User Monitor allows organizations to watch what matters to their business, by delivering accurate, detailed information on the performance and integrity of Web applications in real time. Incident management, service-level management and change-impact management are three key capabilities. TrueSight watches any web or enterprise web application and lets site operators identify problems more quickly, isolate root-cause faster, and effect fixes more quickly than anything else on the market. With TrueSight, every part of an IT organization is made more effective, responsive and productive. For more information, visit http://www.coradiant.com.

Introduction to test for non testers

Mattias Lönnqvist

Software Testing

MusTufa Nullwala

Why do my AB tests suck? measurecamp

Craig Sullivan

The agency's guide to effective user research

UserTesting

Aiden Bordner, co-founder and Principal Designer at Parade, an experience design firm, draws upon over six years of research experience with clients to discuss some of the tools and processes he’s found to make this process easier. From allocating and protecting budget, to lean research techniques, to setting expectations and reducing client anxiety about test day, these tools will help you work research into more project plans and run successful studies that provide valuable design insights.

Ericriesleanstartuppresentationforweb2

Edmund FOng

Mobile EHS and Quality Auditing - Lessons Learned

Nimonik

Smart phones and tablets are becoming commonplace in our offices. With this new technology, it is possible to improve efficiency during an audit, allowing more audits to be conducted with fewer resources. There are opportunities and pitfalls that all companies should be aware of before embarking on a mobile software project. This talk will cover lessons learned at L’Oreal, FedEx and Grupo Bimbo about deploying mobile technology and conducting compliance audits in the workplace.

10 Ways to Use ACT CRM as a CRM Product

Tech Benders

DevOps By The Numbers

XebiaLabs

Stop refreshing vanity metrics & start focusing on the metrics that inform de...

Looker

Stop Refreshing Vanity Metrics & Start Focusing on the Metrics that Inform Decisions There is a propensity to focus on vanity metrics; metrics that show you the score: How many new views, new daily active users, how much revenue last week. You may slice these by different attributes - geography, platform, user demographics. While this can help you understand the high level trends in your business, it does little to tell you how to get better. This slide deck looks at how vanity metrics can distract you from focusing on the analysis that matters, which is identifying and measuring the metrics that drive decisions. There are several real examples of how companies (Venmo, Simply Business, and Looker) have used event data in highly customized ways to make better decisions about their products.

Similar to Doing monitoring right (20)

Webinar: Keep Calm and Scale Out - A proactive guide to Monitoring MongoDB

10 signs your testing is not enough

Agile Metrics...That Matter

PQF Overview

Metrics - You are what you measure (DevOps Perth)

Methods to Measure Marketing & The Metrics We Move

Brooks Bell Interactive Tama Presentation

Anton Muzhailo - Practical Test Process Improvement using ISTQB

SOFTWARE TESTING TRAFUNDAMENTALS OF SOFTWARE TESTING.pptx

Amp Up Your Testing by Harnessing Test Data

Coradiant

Introduction to test for non testers

Software Testing

Why do my AB tests suck? measurecamp

The agency's guide to effective user research

Ericriesleanstartuppresentationforweb2

Mobile EHS and Quality Auditing - Lessons Learned

10 Ways to Use ACT CRM as a CRM Product

DevOps By The Numbers

Stop refreshing vanity metrics & start focusing on the metrics that inform de...

Recently uploaded

Vitthal Shirke Java Microservices Resume.pdf

Vitthal Shirke

Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...

Globus

The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.

Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf

AMB-Review

Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos https://www.amb-review.com/tubetrivia-ai Exclusive Features: AI-Powered Questions, Wide Range of Categories, Adaptive Difficulty, User-Friendly Interface, Multiplayer Mode, Regular Updates. #TubeTriviaAI #QuizVideoMagic #ViralQuizVideos #AIQuizGenerator #EngageExciteExplode #MarketingRevolution #BoostYourTraffic #SocialMediaSuccess #AIContentCreation #UnlimitedTraffic

Essentials of Automations: The Art of Triggers and Actions in FME

Safe Software

In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation. We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios. Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!

Quarkus Hidden and Forbidden Extensions

Max Andersen

Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf

timtebeek1

A Sighting of filterA in Typelevel Rite of Passage

Philip Schwarz

Navigating the Metaverse: A Journey into Virtual Evolution"

Donna Lenk

Cracking the code review at SpringIO 2024

Paco van Beckhoven

Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production. Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process? In this session we will cover: - The Art of Effective Code Reviews - Streamlining the Review Process - Elevating Reviews with Automated Tools By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces

Pro Unity Game Development with C-sharp Book

abdulrafaychaudhry

Large Language Models and the End of Programming

Matt Welsh

Enterprise Resource Planning System in Telangana

NYGGS Automation Suite

Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics. To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/

Lecture 1 Introduction to games development

abdulrafaychaudhry

Understanding Globus Data Transfers with NetSage

Globus

NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?

Enhancing Research Orchestration Capabilities at ORNL.pdf

Globus

Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.

Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...

Globus

Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.

Vitthal Shirke Microservices Resume Montevideo

Vitthal Shirke

Orion Context Broker introduction 20240604

Fermin Galan

2024 RoOUG Security model for the cloud.pptx

Georgi Kodinov

Developing Distributed High-performance Computing Capabilities of an Open Sci...

Globus

COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.

Recently uploaded (20)

Vitthal Shirke Java Microservices Resume.pdf

Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...

Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf

Essentials of Automations: The Art of Triggers and Actions in FME

Quarkus Hidden and Forbidden Extensions

Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf

A Sighting of filterA in Typelevel Rite of Passage

Navigating the Metaverse: A Journey into Virtual Evolution"

Cracking the code review at SpringIO 2024

Pro Unity Game Development with C-sharp Book

Large Language Models and the End of Programming

Enterprise Resource Planning System in Telangana

Lecture 1 Introduction to games development

Understanding Globus Data Transfers with NetSage

Enhancing Research Orchestration Capabilities at ORNL.pdf

Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...

Vitthal Shirke Microservices Resume Montevideo

Orion Context Broker introduction 20240604

2024 RoOUG Security model for the cloud.pptx

Developing Distributed High-performance Computing Capabilities of an Open Sci...

Doing monitoring right

1. Monitoring: Doing it the right way Saving your sanity, making the world a better place. John-Daniel Trask Raygun

2. Who is this person? Funny voice. Weird name. What’s his deal?

3. What we’re covering Doing monitoring the right way. Getting started, but also helping identify potential issues With your current monitoring.

4. Getting started with monitoring

5. Why monitor things? • You’re not employed to write code.

6. Business value? • I got a CS degree mate, not an MBA

7. Framework 1. Do a best-efforts analysis of what to monitor • Bad things • Good things • Limit to a sprint or two of effort, you won’t get it perfect. 2. Perform post mortems to identify gaps in your monitoring 3. Update/improve monitoring based on findings 4. GOTO 2

8. Getting started 1. Something is better than nothing. 2. You can go a long way with some simple tools

10. Metrics & Monitoring • Metrics are a given value or measure. • Monitoring encapsulates everything.

11. Metric: error rate over time

12. Full monitoring: full story about an error

13. Monitoring vs. Observability • Is there a difference?

14. User Server Application

15. Know what to measure • You could track almost anything

16. Crash reporting JavaScript log aggregation Metrics server (statsd) Alerting and pager tools Dashboarding tools Usage monitoring Real User Monitoring Structured and unstructured logging Up time monitoring Network monitoring Application performance monitoring Wire-level monitoring Server monitoring Canary logging Log aggregation service Distributed tracing Intrusion detection monitoring Employee device monitoring Cloud metrics from cloud provider Security monitoring Custom event tracking Advanced visualizing tooling Deployment tracking Infrastructure change monitoring User navigation and click tracking monitoring Infrastructure spend monitoring

17. The obvious • Errors & error rate • Server performance • Requests per second per service • Database call times

18. What about the less obvious? • Back to basics: business value users!

19. Amazon example • When is the page loaded?

20.

21. What about the less obvious? • Cost to serve each customer • Feature use tracking to double down on what customers do the most • Good things • Any you’d add?

22. Getting the most from monitoring

23. Connect the dots • Connect all your data together • Connect teams

24. Information Radiators • A fancy way of saying TV

25.

26. Averages are lies • Yet so many monitoring tools focus on them

27. On Average, everyone here is worth $900m.

28. Quantiles • Median • P90 • P99 • P99.9

29. P25 P75

30.

31. Why are quantiles hard? • You need to store everything

32.

33. Common monitoring mistakes

34. Common mistakes • Only measuring your servers

35.

36. Common mistakes • Only measuring the server

37. Common mistakes • Saving money by flying blind

38. Common mistakes • Bad sampling of data

39. Common mistakes • Building it yourself

40. Common mistakes • Making it difficult to add to new systems

41. Common mistakes • Making it difficult to consume the data

42.

43. Common mistakes • Just buying/installing a tool doesn’t help

44. Common mistakes • Not getting out of the building

45. Common mistakes • NEW: Compliance!

46. Common mistakes • Anyone have a mistake they’d love to share?

47. References & Links • Observability vs. Monitoring: https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c • Coda Hale, Metrics, Metrics Everywhere: https://www.youtube.com/watch?v=czes-oa0yik • Google Site Reliability Book: https://landing.google.com/sre/ • Developers are your GDPR risk: https://jdtrask.com/post/software-developers-are-your-biggest-gdpr-risk.html • Netflix tech blog: https://medium.com/netflix-techblog/

48. Questions? Thank you for coming! @traskjd @raygunio Raygun.com (I also have some swag)

Editor's Notes

I’m John-Daniel Trask, or JD to everyone. First name is two names. I’ve loved code since the age of 9, more than 25 years of coding away any chance I got. I’m a 10 year Microsoft MVP, distinguished alumni and awarded the wellingtonian of the year in science and technology. I have VM snap shots of various machines, and thought it amusing that I was writing monitoring tools when I was in my early teens (“Console” which would track everything). I have been running businesses through high school and university. At high school I sold “browser privacy tools” to class mates… In 2013 we launched Raygun, a software crash reporting product. In 2015 a Real User Monitoring product. And in April announced our innovative approach to APM. We’re processing billions of data points while I’m standing here. A lot of my learnings are from our own experience in monitoring, but also from conversations with customers
Reminder, in case you’re in the wrong room or can’t remember what this talk was going to be about. Target is more for folks getting started, but aim to provide value to even the folks focusing on monitoring in their org. The slides will be posted online. Easiest way to get them once posted: follow me on twitter: traskjd This is about monitoring your software, not everything else (e.g. osquery for monitoring your team machines etc)
How should we be thinking about monitoring? Here’s how to get started, how to think about monitoring and even if you have monitoring in place, hopefully this challenges your thinking about what monitoring is really about.
Coda Hale: You’re not employed to code, you’re employed to create business value.
What is business value? - Adding anew feature that customers want - Improving an existing feature to please customers - Reducing bugs that annoy customers. - Making our software faster so not annoying our customers - Making our site look better (could be worse!) to please customers What is the common thread? Customers. I talk about ‘we write code for human beings’, yet most of us rarely think about the user, or worse – hold them in disdain.
This is a basic getting started framework. Fact is, there’s so much stuff out there to help. Look at Raygun, we do 3 things now – CR, RUM, APM. Still get asked about Logs, custom metrics, uptime monitoring, security reporting, statsd endpoints, wire level monitoring,
Big one for Raygun was StatsD.
This was what got us excited – so easy to start instrumenting our code.
Metrics are great for spotting trends, or issues, but they don’t tell you the why or how. The “what’s broken” indicates the symptom; the “why” indicates a (possibly intermediate) cause.“What” versus “why” is one of the most important distinctions in writing good monitoring with maximum signal and minimum noise.
While here’s the full story, the data behind the metric. Helping me as a developer figure out the HOW and the WHY, so I can resolve the issue.
Discussion going on about these two, whereby the basics seem to be that observability is a super-set of monitoring…. Twitter defined observability as: Monitoring- Alerting/visualization- Distributed systems tracing infrastructure- Log aggregation/analytics However I count all of that as monitoring. https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c
Something at each level. Doesn’t need to be perfect, but shouldn’t lie to you (more on this later!) Why have I ordered it this way? The user is the most important. If they aren’t happy, we aren’t getting paid. Best to track that most The application helps understand things that are likely to impact the user. Server monitoring. But isn’t server monitoring super important? It is, but oftentimes it’s value is correlating to user monitoring. For example, measure user server load experience, if it’s slow, look at the server data being correlated with it. Maybe it’s a sign of maxed out
Next slide
Look at this, here’s just some stuff we could be doing…. So let’s get real. It’s why my framework is to only do some at the start and then build it up over time. Trying to handle everything will waste a lot of time, money and won’t help. You’ll still find issues (kind of like 100% code coverage in unit tests – you still have bugs)
Bias, but errors are a very easy to add and high value thing to track. They are literally where you crap all over your customer. We see this “we don’t use this anymore”, but they have 68,000 users a month getting errors… I wonder what the CEO would think about the team not bothering with 68,000 customers being let down each month. It also gives you the ammunition you need to ask for time to pay down technical debt which is common but engineers typically get asked to keep doing feature development.
While the items that I listed impact users, we also want to be creative and think about the non-obvious.
Forget about the “well technically”, which is common for us engineers. Think about the business value, the end user. That changes what we measure!
There’s lots of things that aren’t immediately obvious. However, they can create enormous business value. Cost to serve is a huge one for many earlier stage organizations. If you’re spending more to provide the service than the customer pays, you won’t be around very long. This is a number typically managed by VP’s or higher, but helping them is never a bad idea. It also leads to helping understand the cost to scale. I’m sure there’s some examples in the audience? What’s a thing you monitored and were surprised by?
Getting the most out of your investment
Connect your data together Key is often being able to easily correlate data across different monitors. For example, seeing a response time start exploading and rapidly identifying if there’s an activity issue on your web server, the underlying database, one of the caches, etc. Connect your teams One of the biggest wins we see is making monitoring more than just an engineering or SRE concern. Being able to lift error reports into Jira is one example – it connects product and project managers and helps them work how they like to, but in collaboration with engineering.
TV’s Just like I believe whiteboards are better than almost any digital equivalent, getting dashboards of live data on the wall is amazing. Suddenly key metrics become part of the water cooler chat. Jump to next slide.
Averages are lies. Why do so many tools in this area use them? Because it’s super cheap. But a cheap lie doesn’t make it a good lie.
Quantiles help us understand distribution
Bell Curve - How we’re taught distributions look like. - This shows the median and the 25% and 75% - This is kind of bullshit. Think back to the Gates example, it ain’t a bell curve distribution. It’s almost always the same in software.
Actual distribution - This is more common - Sometimes you may even see a lump near the end - Understanding outliers is key to better monitoring
Why does more tooling not support this? You need to store A LOT of data, and you need to then look at the % points after sorting it. This gets very slow. Example: 100m events, which is not actually a lot. 8 bits in a byte, 64 bit numbers, you’re loading 762MB of data into memory, sorting it and taking single values at positions. Event if 32bit it’s a lot of data, but remember – 100m events is not that much when it comes to machine data!
Getting the most out of your investment
What happens on your server is not what happens to the user. Ensure you track the customer experience. Note about RUM and what we see with todays very heavy JS frameworks
Noticing a trend here? I’m big on making sure we always focus on the user.
Not uncommon to see tech teams try and avoid the costs associated with monitoring. They might only monitor some things, or only a few servers. This causes problems. Also, asking for money is easy if you are connecting it to the business value. Noticing a pattern here? 
Sampling has a place, but be wary around your tools. Example: ecommerce provider with 1 server, costing 10% of all sales. Another CR tool was sampling but buried that note in their docs, so customer couldn’t see the issue
Always, ALWAYS takes longer than you expect. Not a sales pitch, but if I’ve spent $10m building a product, tell me how you’re going to do it yourself in six months? I want to hire you. Also, statistics can be very hard. Also, introduces concern that maybe the bug is in the monitoring tools. There are great open source projects also, but consider the TCO of now managing that internally DOES BUILDING IT YOURSELF CREATE BUSINESS VALUE? No. Unless you are Netflix etc.
Make it easy to surface statistics, monitor data, etc. If it’s difficult, it likely won’t be added when the time pressure is on. Similar impact as with Unit Tests, oftentimes it won’t be done unless somebody else has already laid all the groundwork with mocks, fakes etc. Make it so easy that it’s not considered a real cost to add (see: impact of StatsD)
Raygun story of CTO’s pet project: error tracking, that almost nobody in the business can use. Did some magical things, shame only one person in this company of thousands actually could use the thing… Other story: one customer had to employ a full time person to teach the team how to use dashboards! wtfbqq
Raygun story of CTO’s pet project: error tracking, that almost nobody in the business can use. Did some magical things, shame only one person in this company of thousands actually could use the thing… Other story: one customer had to employ a full time person to teach the team how to use dashboards! wtfbqq
We see this all the time, and it’s frustrating. Raygun story: The highest value thing we can do, is hold training sessions with the team. Story of Board Meetings (rare, but should be common). Just installing it is kind of like buying your pain killers but never actually using them when in pain.
Remember how almost everything goes back to fellow humans? Look, I know it’s awesome coding away. Raygun Story: Events, taking engineers rather than sales people. 180 degree change. See the impact, feel the pain. Next-level engineer.
Welcome to GDPR. Where all your ‘I will build this or cobble it together myself’ could cost your company 4% of revenue when you’re audited. Youch! Yet, I keep seeing this, and I think it’s the biggest threat to businesses in relation to compliance.
SUM UP WHAT WE COVERED

Doing monitoring right

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Doing monitoring right

Similar to Doing monitoring right (20)

Recently uploaded

Recently uploaded (20)

Doing monitoring right

Editor's Notes