This presentation was delivered at Gluecon 2018. It covered how to think about monitoring, a framework for incremental improvement in monitoring and common mistakes that teams make when approaching software monitoring.
How to measure the business impact of web performanceSOASTA
If your site were one second slower, how many of your visitors would bounce?
If your site were one second faster, how many additional orders would you receive?
Bottom line: Do you know what one second of latency is worth to your business?
Traditional approaches to performance monitoring are fatally flawed. They measure performance only in a silo, telling you how long key actions took but not putting that information into a context you can use to improve the one metric that ultimately matters: revenue. Bridging this gap requires the collection of performance and business data together, and then analyzing this data using the proper analytic methods.
Using modern Real User Monitoring (RUM) techniques, Buddy Brewer will show you how to quantify the impact even one second of latency has on key business metrics like bounce and conversion rate.
We get this question a lot, and being open and transparent we’d like to address it. We have identified four areas that in all probability may create great difficulty for everyone who is trying to build and launch one’s own magazine app.
Read all about at: http://blog.presspadapp.com/what-it-would-be-like-to-build-a-system-for-publishing-magazines-on-mobile-devices/
How to measure the business impact of web performanceSOASTA
If your site were one second slower, how many of your visitors would bounce?
If your site were one second faster, how many additional orders would you receive?
Bottom line: Do you know what one second of latency is worth to your business?
Traditional approaches to performance monitoring are fatally flawed. They measure performance only in a silo, telling you how long key actions took but not putting that information into a context you can use to improve the one metric that ultimately matters: revenue. Bridging this gap requires the collection of performance and business data together, and then analyzing this data using the proper analytic methods.
Using modern Real User Monitoring (RUM) techniques, Buddy Brewer will show you how to quantify the impact even one second of latency has on key business metrics like bounce and conversion rate.
We get this question a lot, and being open and transparent we’d like to address it. We have identified four areas that in all probability may create great difficulty for everyone who is trying to build and launch one’s own magazine app.
Read all about at: http://blog.presspadapp.com/what-it-would-be-like-to-build-a-system-for-publishing-magazines-on-mobile-devices/
Bugs happen, and when they do, developers need to be ready to remove those bugs. When th pressure’s on, it’s important for every developer to have simple rules to apply to fix any bug in any circumstance. In this talk, we’ll walk through debugging a program using repeatable methods, from the basic knowledge you need to debug your application, the tools you’ll use to observe your system, to the proper way to find the cause of your bug. In the end you’ll be able to fix any bug, and have rules that can be used to train other developers on your team.
BA World Boston: Evening the Odds with Monte Carlo Project ForecastingWm. Hunter Tammaro
How to use the Monte Carlo technique to create more valuable forecasts on Agile software development projects, as presented at Project Summit BA World Boston 2019.
This is for all accountants who participate in the month-end close. I bet you often wonder about whether your close process is as great as it can be!
The reality is that whether you’re on a 1-day close or a 10-day close, there are some fundamental commonalities that all strong accounting departments share when it comes to the month-end close.
Machine Learning Vital Signs: Metrics and Monitoring of AI in Production
This talk details the tracking of machine learning models in production to ensure model reliability, consistency, and performance into the future. Production models are interacting with the real world, and it is terrifying that often times nobody has any idea how they are performing on live data. The world changes! Bias and variance can creep into your models over time and you should know when that happens.
This presentation shares the tips by CCIE Experts on how to pass CCIE in first attempt. It shares the dos and don'ts of the examination and helps in earning the certification in first attempt.
This session is about the performance testing mistakes which newbies and even experienced performance tester mostly did while doing performance testing
Where do You Start?
Where to begin? How do I track the data? How do I setup an A/B test? When do I know if the test is conclusive?
1.The CRO Mindset 1
Not hitting goals?
Do more with less.
Failing is Part of the Game
Follow the Data
2.Start with a Plan 2
Don’t just test to test
Create a backlog
3.Collecting Data 3
Lots of Options
The standard
My preference
They are directly integrated into most AB testing tools
First we need to build our funnel
KISSmetrics makes this dead simple
Experiments show in funnel reports automagically
LOG IN WITH GOOGLE Start Your Free KISSmetrics Trial
4.Collecting Insights 4
Qualitative is important
On-page survey tools
Ask questions at funnel drop off points How can we help? Are you looking for something we don't have? Do you need assistance? How can we improve? What is preventing you from purchasing?
UX and Usability testing
Heatmapping
Look at your analytics :)
Use these insights to improve backlog
5.Start Testing 5
What is an A/B Test?
A/B Testing Tools
Add Javascript to site
“In God we trust, all others must bring data”. Intuition, experience and well known patterns may give us good indications of successful ideas and features, but nothing gets closer to the truth than data analysis and A/B testing. In this workshop, we’ll show how we do experimentation at Booking: what we test, how to get data through templates and JavaScript, and how we analyse the resulting metrics. We’ll live-code examples, see all potential caveats of dealing with the user tracking on the client-side, and show existent tools you can use to test your own ideas.
How to dig into support. Learn some tips and tricks to how to better troubleshoot calls. Open Ended Questions vs. Closed, using probing questions and how they can both enhance the troubleshooting process.
Webinar: Keep Calm and Scale Out - A proactive guide to Monitoring MongoDBMongoDB
Setting up proactive monitoring systems can help you and your team prepare for operations problems before they happen and react appropriately when disaster strikes.
In this presentation, we reviewed diagnostic tools and strategies for monitoring MongoDB.
We reviewed how to do capacity planning and establish KPIs, and present the monitoring utilities available in MongoDB.
The KPIs to monitor in your database, including throughput metrics, database performance, resource utilization, resource saturation, assertions/errors
The commands, utilities and monitoring tools to leverage in order to set up your proactive monitoring installation
Key alerts to set for monitoring your KPIs
Bugs happen, and when they do, developers need to be ready to remove those bugs. When th pressure’s on, it’s important for every developer to have simple rules to apply to fix any bug in any circumstance. In this talk, we’ll walk through debugging a program using repeatable methods, from the basic knowledge you need to debug your application, the tools you’ll use to observe your system, to the proper way to find the cause of your bug. In the end you’ll be able to fix any bug, and have rules that can be used to train other developers on your team.
BA World Boston: Evening the Odds with Monte Carlo Project ForecastingWm. Hunter Tammaro
How to use the Monte Carlo technique to create more valuable forecasts on Agile software development projects, as presented at Project Summit BA World Boston 2019.
This is for all accountants who participate in the month-end close. I bet you often wonder about whether your close process is as great as it can be!
The reality is that whether you’re on a 1-day close or a 10-day close, there are some fundamental commonalities that all strong accounting departments share when it comes to the month-end close.
Machine Learning Vital Signs: Metrics and Monitoring of AI in Production
This talk details the tracking of machine learning models in production to ensure model reliability, consistency, and performance into the future. Production models are interacting with the real world, and it is terrifying that often times nobody has any idea how they are performing on live data. The world changes! Bias and variance can creep into your models over time and you should know when that happens.
This presentation shares the tips by CCIE Experts on how to pass CCIE in first attempt. It shares the dos and don'ts of the examination and helps in earning the certification in first attempt.
This session is about the performance testing mistakes which newbies and even experienced performance tester mostly did while doing performance testing
Where do You Start?
Where to begin? How do I track the data? How do I setup an A/B test? When do I know if the test is conclusive?
1.The CRO Mindset 1
Not hitting goals?
Do more with less.
Failing is Part of the Game
Follow the Data
2.Start with a Plan 2
Don’t just test to test
Create a backlog
3.Collecting Data 3
Lots of Options
The standard
My preference
They are directly integrated into most AB testing tools
First we need to build our funnel
KISSmetrics makes this dead simple
Experiments show in funnel reports automagically
LOG IN WITH GOOGLE Start Your Free KISSmetrics Trial
4.Collecting Insights 4
Qualitative is important
On-page survey tools
Ask questions at funnel drop off points How can we help? Are you looking for something we don't have? Do you need assistance? How can we improve? What is preventing you from purchasing?
UX and Usability testing
Heatmapping
Look at your analytics :)
Use these insights to improve backlog
5.Start Testing 5
What is an A/B Test?
A/B Testing Tools
Add Javascript to site
“In God we trust, all others must bring data”. Intuition, experience and well known patterns may give us good indications of successful ideas and features, but nothing gets closer to the truth than data analysis and A/B testing. In this workshop, we’ll show how we do experimentation at Booking: what we test, how to get data through templates and JavaScript, and how we analyse the resulting metrics. We’ll live-code examples, see all potential caveats of dealing with the user tracking on the client-side, and show existent tools you can use to test your own ideas.
How to dig into support. Learn some tips and tricks to how to better troubleshoot calls. Open Ended Questions vs. Closed, using probing questions and how they can both enhance the troubleshooting process.
Webinar: Keep Calm and Scale Out - A proactive guide to Monitoring MongoDBMongoDB
Setting up proactive monitoring systems can help you and your team prepare for operations problems before they happen and react appropriately when disaster strikes.
In this presentation, we reviewed diagnostic tools and strategies for monitoring MongoDB.
We reviewed how to do capacity planning and establish KPIs, and present the monitoring utilities available in MongoDB.
The KPIs to monitor in your database, including throughput metrics, database performance, resource utilization, resource saturation, assertions/errors
The commands, utilities and monitoring tools to leverage in order to set up your proactive monitoring installation
Key alerts to set for monitoring your KPIs
The future for performance management, quality and true continuous improvement for local council planning services. Uses much of the data that councils already send to government, supplements it with some new approaches to customer and quality feedback, and brings it all together in one tidy, holistic report.
Metrics - You are what you measure (DevOps Perth)Rob Crowley
DevOps is no longer just the concern of cutting edge start-ups in Silicon Valley and is gaining wide scale adoption within established industries. This session focuses on the Metrics pillar of DevOps and explores how we can leverage metrics to drive the software delivery process based on data rather than gut feel and opinions.
Methods to Measure Marketing & The Metrics We MoveTeacup Analytics
How do you measure the impact of your marketing strategy? Are you a victim of misleading data spikes up and down? Do you even know which metrics matter when?
Triangle AMA’s September luncheon: Converting Prospects to Customers Through Online Marketing
Brooks Bell, Founder and President of Brooks Bell Interactive, presents how marketers in diverse organizations succeeded in driving customer acquisition with online marketing campaigns.
Amp Up Your Testing by Harnessing Test DataTechWell
The data tsunami is coming—or maybe it’s already here. Data science, big data, and machine learning are the buzzwords of the day. Data is changing our products and the way we build them, so we should also change the way we verify our products. In a world of increasing connectivity and accelerated deadlines, data can provide an edge. But what role should data play in assessing the quality of software? Where does it make sense to use data, and where is it inappropriate? Steve Rowe covers the history of how data fits into testing, explains why data is an important tool to have in your quality toolkit, and presents strategies for adding data to your testing plans and using it more effectively in your testing.
Alistair Croll, Interop conference faculty and Coradiant's VP of product management gives an unbiased, top down view of Web performance monitoring. This informative look at Web measurement business goals, operating processes, tools and metrics will give you a solid understanding of the issues, without a product pitch. Coradiant is the leader in Web Performance Monitoring. The award-winning TrueSight Real-User Monitor allows organizations to watch what matters to their business, by delivering accurate, detailed information on the performance and integrity of Web applications in real time. Incident management, service-level management and change-impact management are three key capabilities. TrueSight watches any web or enterprise web application and lets site operators identify problems more quickly, isolate root-cause faster, and effect fixes more quickly than anything else on the market. With TrueSight, every part of an IT organization is made more effective, responsive and productive. For more information, visit http://www.coradiant.com.
A brief introduction to test for the non-tester. Can be used for both business and development, although it is primarily focused on developers and persons interested in becoming testers.
In this chapter, we will introduce you to the
fundamentals of testing:
why testing is needed;
its limitations, objectives, and purpose;
the principles behind testing;
the process that testers follow;
and some of the psychological factors that testers must consider in their work.
The top reasons and solutions for not getting value out of your AB tests - some practical tips for designing insightful and correctly instrumented test
The agency's guide to effective user researchUserTesting
Aiden Bordner, co-founder and Principal Designer at Parade, an experience design firm, draws upon over six years of research experience with clients to discuss some of the tools and processes he’s found to make this process easier. From allocating and protecting budget, to lean research techniques, to setting expectations and reducing client anxiety about test day, these tools will help you work research into more project plans and run successful studies that provide valuable design insights.
Mobile EHS and Quality Auditing - Lessons LearnedNimonik
Smart phones and tablets are becoming commonplace in our offices. With this new technology, it is possible to improve efficiency during an audit, allowing more audits to be conducted with fewer resources. There are opportunities and pitfalls that all companies should be aware of before embarking on a mobile software project. This talk will cover lessons learned at L’Oreal, FedEx and Grupo Bimbo about deploying mobile technology and conducting compliance audits in the workplace.
Stop refreshing vanity metrics & start focusing on the metrics that inform de...Looker
Stop Refreshing Vanity Metrics & Start Focusing on the Metrics that Inform Decisions
There is a propensity to focus on vanity metrics; metrics that show you the score: How many new views, new daily active users, how much revenue last week. You may slice these by different attributes - geography, platform, user demographics. While this can help you understand the high level trends in your business, it does little to tell you how to get better.
This slide deck looks at how vanity metrics can distract you from focusing on the analysis that matters, which is identifying and measuring the metrics that drive decisions. There are several real examples of how companies (Venmo, Simply Business, and Looker) have used event data in highly customized ways to make better decisions about their products.
Software Engineering, Software Consulting, Tech Lead, Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Transaction, Spring MVC, OpenShift Cloud Platform, Kafka, REST, SOAP, LLD & HLD.
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
Navigating the Metaverse: A Journey into Virtual Evolution"Donna Lenk
Join us for an exploration of the Metaverse's evolution, where innovation meets imagination. Discover new dimensions of virtual events, engage with thought-provoking discussions, and witness the transformative power of digital realms."
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics.
To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/
Understanding Globus Data Transfers with NetSageGlobus
NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Globus
Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.
Software Engineering, Software Consulting, Tech Lead.
Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Security,
Spring Transaction, Spring MVC,
Log4j, REST/SOAP WEB-SERVICES.
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
7. Framework
1. Do a best-efforts analysis of what to monitor
• Bad things
• Good things
• Limit to a sprint or two of effort, you won’t get it perfect.
2. Perform post mortems to identify gaps in your monitoring
3. Update/improve monitoring based on findings
4. GOTO 2
21. What about the less obvious?
• Cost to serve each customer
• Feature use tracking to double down on what customers do the most
• Good things
• Any you’d add?
I’m John-Daniel Trask, or JD to everyone. First name is two names.
I’ve loved code since the age of 9, more than 25 years of coding away any chance I got. I’m a 10 year Microsoft MVP, distinguished alumni and awarded the wellingtonian of the year in science and technology.
I have VM snap shots of various machines, and thought it amusing that I was writing monitoring tools when I was in my early teens (“Console” which would track everything).
I have been running businesses through high school and university. At high school I sold “browser privacy tools” to class mates…
In 2013 we launched Raygun, a software crash reporting product. In 2015 a Real User Monitoring product. And in April announced our innovative approach to APM.
We’re processing billions of data points while I’m standing here. A lot of my learnings are from our own experience in monitoring, but also from conversations with customers
Reminder, in case you’re in the wrong room or can’t remember what this talk was going to be about.
Target is more for folks getting started, but aim to provide value to even the folks focusing on monitoring in their org.
The slides will be posted online. Easiest way to get them once posted: follow me on twitter: traskjd
This is about monitoring your software, not everything else (e.g. osquery for monitoring your team machines etc)
How should we be thinking about monitoring? Here’s how to get started, how to think about monitoring and even if you have monitoring in place, hopefully this challenges your thinking about what monitoring is really about.
Coda Hale: You’re not employed to code, you’re employed to create business value.
What is business value?
- Adding anew feature that customers want
- Improving an existing feature to please customers
- Reducing bugs that annoy customers.
- Making our software faster so not annoying our customers
- Making our site look better (could be worse!) to please customers
What is the common thread? Customers.
I talk about ‘we write code for human beings’, yet most of us rarely think about the user, or worse – hold them in disdain.
This is a basic getting started framework.
Fact is, there’s so much stuff out there to help. Look at Raygun, we do 3 things now – CR, RUM, APM.
Still get asked about Logs, custom metrics, uptime monitoring, security reporting, statsd endpoints, wire level monitoring,
Big one for Raygun was StatsD.
This was what got us excited – so easy to start instrumenting our code.
Metrics are great for spotting trends, or issues, but they don’t tell you the why or how.
The “what’s broken” indicates the symptom; the “why” indicates a (possibly intermediate) cause.“What” versus “why” is one of the most important distinctions in writing good monitoring with maximum signal and minimum noise.
While here’s the full story, the data behind the metric. Helping me as a developer figure out the HOW and the WHY, so I can resolve the issue.
Discussion going on about these two, whereby the basics seem to be that observability is a super-set of monitoring….
Twitter defined observability as:
Monitoring- Alerting/visualization- Distributed systems tracing infrastructure- Log aggregation/analytics
However I count all of that as monitoring.
https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c
Something at each level. Doesn’t need to be perfect, but shouldn’t lie to you (more on this later!)
Why have I ordered it this way?
The user is the most important. If they aren’t happy, we aren’t getting paid. Best to track that most
The application helps understand things that are likely to impact the user.
Server monitoring.
But isn’t server monitoring super important? It is, but oftentimes it’s value is correlating to user monitoring. For example, measure user server load experience, if it’s slow, look at the server data being correlated with it. Maybe it’s a sign of maxed out
Next slide
Look at this, here’s just some stuff we could be doing…. So let’s get real.
It’s why my framework is to only do some at the start and then build it up over time. Trying to handle everything will waste a lot of time, money and won’t help. You’ll still find issues (kind of like 100% code coverage in unit tests – you still have bugs)
Bias, but errors are a very easy to add and high value thing to track. They are literally where you crap all over your customer.
We see this “we don’t use this anymore”, but they have 68,000 users a month getting errors… I wonder what the CEO would think about the team not bothering with 68,000 customers being let down each month. It also gives you the ammunition you need to ask for time to pay down technical debt which is common but engineers typically get asked to keep doing feature development.
While the items that I listed impact users, we also want to be creative and think about the non-obvious.
Forget about the “well technically”, which is common for us engineers.
Think about the business value, the end user. That changes what we measure!
There’s lots of things that aren’t immediately obvious. However, they can create enormous business value.
Cost to serve is a huge one for many earlier stage organizations. If you’re spending more to provide the service than the customer pays, you won’t be around very long. This is a number typically managed by VP’s or higher, but helping them is never a bad idea.
It also leads to helping understand the cost to scale.
I’m sure there’s some examples in the audience? What’s a thing you monitored and were surprised by?
Getting the most out of your investment
Connect your data together
Key is often being able to easily correlate data across different monitors. For example, seeing a response time start exploading and rapidly identifying if there’s an activity issue on your web server, the underlying database, one of the caches, etc.
Connect your teams
One of the biggest wins we see is making monitoring more than just an engineering or SRE concern. Being able to lift error reports into Jira is one example – it connects product and project managers and helps them work how they like to, but in collaboration with engineering.
TV’s
Just like I believe whiteboards are better than almost any digital equivalent, getting dashboards of live data on the wall is amazing. Suddenly key metrics become part of the water cooler chat.
Jump to next slide.
Averages are lies.
Why do so many tools in this area use them? Because it’s super cheap. But a cheap lie doesn’t make it a good lie.
Quantiles help us understand distribution
Bell Curve
- How we’re taught distributions look like.
- This shows the median and the 25% and 75%
- This is kind of bullshit. Think back to the Gates example, it ain’t a bell curve distribution. It’s almost always the same in software.
Actual distribution
- This is more common
- Sometimes you may even see a lump near the end
- Understanding outliers is key to better monitoring
Why does more tooling not support this?
You need to store A LOT of data, and you need to then look at the % points after sorting it. This gets very slow.
Example: 100m events, which is not actually a lot. 8 bits in a byte, 64 bit numbers, you’re loading 762MB of data into memory, sorting it and taking single values at positions. Event if 32bit it’s a lot of data, but remember – 100m events is not that much when it comes to machine data!
Getting the most out of your investment
What happens on your server is not what happens to the user.
Ensure you track the customer experience.
Note about RUM and what we see with todays very heavy JS frameworks
Noticing a trend here?
I’m big on making sure we always focus on the user.
Not uncommon to see tech teams try and avoid the costs associated with monitoring.
They might only monitor some things, or only a few servers. This causes problems.
Also, asking for money is easy if you are connecting it to the business value.
Noticing a pattern here?
Sampling has a place, but be wary around your tools.
Example: ecommerce provider with 1 server, costing 10% of all sales. Another CR tool was sampling but buried that note in their docs, so customer couldn’t see the issue
Always, ALWAYS takes longer than you expect.
Not a sales pitch, but if I’ve spent $10m building a product, tell me how you’re going to do it yourself in six months? I want to hire you.
Also, statistics can be very hard.
Also, introduces concern that maybe the bug is in the monitoring tools.
There are great open source projects also, but consider the TCO of now managing that internally
DOES BUILDING IT YOURSELF CREATE BUSINESS VALUE? No. Unless you are Netflix etc.
Make it easy to surface statistics, monitor data, etc. If it’s difficult, it likely won’t be added when the time pressure is on.
Similar impact as with Unit Tests, oftentimes it won’t be done unless somebody else has already laid all the groundwork with mocks, fakes etc.
Make it so easy that it’s not considered a real cost to add (see: impact of StatsD)
Raygun story of CTO’s pet project: error tracking, that almost nobody in the business can use. Did some magical things, shame only one person in this company of thousands actually could use the thing…
Other story: one customer had to employ a full time person to teach the team how to use dashboards! wtfbqq
Raygun story of CTO’s pet project: error tracking, that almost nobody in the business can use. Did some magical things, shame only one person in this company of thousands actually could use the thing…
Other story: one customer had to employ a full time person to teach the team how to use dashboards! wtfbqq
We see this all the time, and it’s frustrating.
Raygun story: The highest value thing we can do, is hold training sessions with the team.
Story of Board Meetings (rare, but should be common).
Just installing it is kind of like buying your pain killers but never actually using them when in pain.
Remember how almost everything goes back to fellow humans?
Look, I know it’s awesome coding away.
Raygun Story: Events, taking engineers rather than sales people. 180 degree change. See the impact, feel the pain. Next-level engineer.
Welcome to GDPR. Where all your ‘I will build this or cobble it together myself’ could cost your company 4% of revenue when you’re audited. Youch!
Yet, I keep seeing this, and I think it’s the biggest threat to businesses in relation to compliance.