SlideShare a Scribd company logo
1 of 27
Download to read offline
chronosphere.io
From Cardinal(ity) Sins to
Cost Efficient Metrics
Aggregation
Paige Cruz, retired SRE
open source observability advocate
chronosphere.io
CFO looking at the
o11y bill
chronosphere.io
chronosphere.io
chronosphere.io
Cloud Native Observability
bills are outrageous
Cloud Native Data Growth
7
Cloud
(IaaS,
VM-based)
2008 - 2018
Cloud Native
(Microservices and Containers)
2018 - ?
On-Premises
(Data center)
1998 - 2008
Business
Increase in
Scale
Observability
Data Increase
in Scale
*Source: ESG Distributed Cloud Series: Observability, Feb 2022, Scott Sinclair and Rob Strechay
chronosphere.io
Most recently [vendor] was looked at to help monitor a small Kubernetes test cluster. 3
nodes.
Now the base rate of $18/mo is fine…except now they charge $1 per container per month
past 10 containers per host.
Since K8s (depending on how you install it) runs a bunch of little containers handling various
back end things, you might not deploy anything to the cluster and still be WAY over that 10
container limit.
In our case it came out to like $200/mo to monitor 3 nodes - that were nowhere fully
loaded.
- Hacker News thread
chronosphere.io
Data volume
Experiment:
- Hello World app on 4 node
Kubernetes cluster with
Tracing, End User Metrics
(EUM), Logs, Metrics
(containers / nodes)
- 30 days == +450 GB
Mighty Metrics
“ 1 in 10 metrics are
actually directly
queried
- ServiceNow
Contributing Factors to the Metrics Bill
12
How many
things you’re
monitoring
# of containers
and infra
components
How often each
metric is
scraped
Metric
Granularity
How long you
keep the data
Retention
Window
How many
unique combos
of dimensions
on metrics
Cardinality
12
13
14
Cost of monitoring can be a factor in determining how quickly
to deprecate or sunset features/services/environments
# of containers and infra components
14
15
Emission time = adjust scrape_interval (from 10s samples ->
30s samples)
Ingest time = aggregate
Over time post-storage = downsampling
Metric Granularity
15
16
Aggregation: Roll Up
16
17
For operational metrics……most (99.9%) of queries do
not pass 7 days but average retention at original
granularity ranges from 2-4 weeks
Retention Window
17
18
Low value tags or entire metrics should be dropped
as early as possible
Dropping Data
18
19
Cardinality
19
Auditing Your
Metrics
“
What is the value
of this metric?
- You, when auditing metrics
Auditing Your Metrics
22
22
● Scope what your team is responsible for
○ filter queries with team:YOURS
● Identify easy wins. Metrics that aren’t
○ In a monitor definition
○ Directly queried by end users
○ Powering charts for visited dashboards
● Identify labels that are unnecessary
○ e.g. prometheus instance label or instance_type
● Share your successes!
23
23
24
24
CFO looking at the
cost efficiency of
metrics
Resources
- How Gloo uses the OTel Collector to drop metrics/labels
and provide the Minimum Metrics Set
- How to drop and delete metrics in Prometheus
- How can recording and data roll-up rules help your
metrics?
- Observability is Too Damn Expensive - DevOpsDays London
Catch up with me:
- Rescuing On-Call Engineers
(send your manager)
- KubeCon OTel 101: Let’s
Instrument! (tracing) workshop
- There’s No Place Like
Production Conf42 Incident
Management
paigerduty@
chronosphere.io
hachyderm.io
LinkedIn
Q&A

More Related Content

Similar to From Cardinal(ity) Sins to Cost-Efficient Metrics Aggregation

Cloud-Native Fundamentals: Accelerating Development with Continuous Integration
Cloud-Native Fundamentals: Accelerating Development with Continuous IntegrationCloud-Native Fundamentals: Accelerating Development with Continuous Integration
Cloud-Native Fundamentals: Accelerating Development with Continuous Integration
VMware Tanzu
 
Connecting Kafka to Cash (CKC) (Lyndon Hedderly, Confluent) Kafka Summit Lond...
Connecting Kafka to Cash (CKC) (Lyndon Hedderly, Confluent) Kafka Summit Lond...Connecting Kafka to Cash (CKC) (Lyndon Hedderly, Confluent) Kafka Summit Lond...
Connecting Kafka to Cash (CKC) (Lyndon Hedderly, Confluent) Kafka Summit Lond...
confluent
 

Similar to From Cardinal(ity) Sins to Cost-Efficient Metrics Aggregation (20)

Building and deploying microservices with event sourcing, CQRS and Docker (Ha...
Building and deploying microservices with event sourcing, CQRS and Docker (Ha...Building and deploying microservices with event sourcing, CQRS and Docker (Ha...
Building and deploying microservices with event sourcing, CQRS and Docker (Ha...
 
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience SharingClickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
 
ClickHouse Paris Meetup. ClickHouse at ContentSquare, by Christophe Kalenzaga...
ClickHouse Paris Meetup. ClickHouse at ContentSquare, by Christophe Kalenzaga...ClickHouse Paris Meetup. ClickHouse at ContentSquare, by Christophe Kalenzaga...
ClickHouse Paris Meetup. ClickHouse at ContentSquare, by Christophe Kalenzaga...
 
Container world 2019 Canary Release
Container world 2019 Canary ReleaseContainer world 2019 Canary Release
Container world 2019 Canary Release
 
Cloud-Native Fundamentals: Accelerating Development with Continuous Integration
Cloud-Native Fundamentals: Accelerating Development with Continuous IntegrationCloud-Native Fundamentals: Accelerating Development with Continuous Integration
Cloud-Native Fundamentals: Accelerating Development with Continuous Integration
 
Victor Chang: Cloud computing business framework
Victor Chang: Cloud computing business frameworkVictor Chang: Cloud computing business framework
Victor Chang: Cloud computing business framework
 
Connecting Kafka to Cash (CKC) (Lyndon Hedderly, Confluent) Kafka Summit Lond...
Connecting Kafka to Cash (CKC) (Lyndon Hedderly, Confluent) Kafka Summit Lond...Connecting Kafka to Cash (CKC) (Lyndon Hedderly, Confluent) Kafka Summit Lond...
Connecting Kafka to Cash (CKC) (Lyndon Hedderly, Confluent) Kafka Summit Lond...
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
 
IBM Monitoring and Event Management Solutions
IBM Monitoring and Event Management SolutionsIBM Monitoring and Event Management Solutions
IBM Monitoring and Event Management Solutions
 
Running OpenStack in Production
Running OpenStack in ProductionRunning OpenStack in Production
Running OpenStack in Production
 
Outsourcing IT Projects to Managed Hosting of the Cloud
Outsourcing IT Projects to Managed Hosting of the CloudOutsourcing IT Projects to Managed Hosting of the Cloud
Outsourcing IT Projects to Managed Hosting of the Cloud
 
Case Study: HCL Technologies On Capacity Planning for Cloud and Virtualized E...
Case Study: HCL Technologies On Capacity Planning for Cloud and Virtualized E...Case Study: HCL Technologies On Capacity Planning for Cloud and Virtualized E...
Case Study: HCL Technologies On Capacity Planning for Cloud and Virtualized E...
 
1806 cosmic progress
1806 cosmic progress1806 cosmic progress
1806 cosmic progress
 
Jazz for Service Management
Jazz for Service ManagementJazz for Service Management
Jazz for Service Management
 
IRJET- Predicting Bitcoin Prices using Convolutional Neural Network Algor...
IRJET-  	  Predicting Bitcoin Prices using Convolutional Neural Network Algor...IRJET-  	  Predicting Bitcoin Prices using Convolutional Neural Network Algor...
IRJET- Predicting Bitcoin Prices using Convolutional Neural Network Algor...
 
When Data Visualizations and Data Imports Just Don’t Work
When Data Visualizations and Data Imports Just Don’t WorkWhen Data Visualizations and Data Imports Just Don’t Work
When Data Visualizations and Data Imports Just Don’t Work
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01
 
Adtech x Scala x Performance tuning
Adtech x Scala x Performance tuningAdtech x Scala x Performance tuning
Adtech x Scala x Performance tuning
 
Enhancing Data Security in Cloud Storage Auditing With Key Abstraction
Enhancing Data Security in Cloud Storage Auditing With Key AbstractionEnhancing Data Security in Cloud Storage Auditing With Key Abstraction
Enhancing Data Security in Cloud Storage Auditing With Key Abstraction
 
The Future of Secure Digital Transactions: QTMaaS
The Future of Secure Digital Transactions: QTMaaSThe Future of Secure Digital Transactions: QTMaaS
The Future of Secure Digital Transactions: QTMaaS
 

More from Paige Cruz

More from Paige Cruz (14)

Power Up with Podman - Kubernetes Community Day LA
Power Up with Podman - Kubernetes Community Day LAPower Up with Podman - Kubernetes Community Day LA
Power Up with Podman - Kubernetes Community Day LA
 
99.99% of Your Traces Are (Probably) Trash (SRECon NA 2024).pdf
99.99% of Your Traces  Are (Probably) Trash (SRECon NA 2024).pdf99.99% of Your Traces  Are (Probably) Trash (SRECon NA 2024).pdf
99.99% of Your Traces Are (Probably) Trash (SRECon NA 2024).pdf
 
OTel Orientation: How to Train Teams (OTel in Practice)
OTel Orientation: How to Train Teams (OTel in Practice)OTel Orientation: How to Train Teams (OTel in Practice)
OTel Orientation: How to Train Teams (OTel in Practice)
 
Avoiding Alert Bankruptcy and Burnout
 Avoiding Alert Bankruptcy and Burnout Avoiding Alert Bankruptcy and Burnout
Avoiding Alert Bankruptcy and Burnout
 
Tracing Adventures from PR - Production
Tracing Adventures from PR - ProductionTracing Adventures from PR - Production
Tracing Adventures from PR - Production
 
Threat Modeling in the Cloud
Threat Modeling in the CloudThreat Modeling in the Cloud
Threat Modeling in the Cloud
 
There's No Place Like Production
There's No Place Like ProductionThere's No Place Like Production
There's No Place Like Production
 
Taming Feral DevOps
Taming Feral DevOps Taming Feral DevOps
Taming Feral DevOps
 
SRECon23 Cognitive Apprenticeship in Action_ Alert Triage Hour of Power
SRECon23 Cognitive Apprenticeship in Action_ Alert Triage Hour of PowerSRECon23 Cognitive Apprenticeship in Action_ Alert Triage Hour of Power
SRECon23 Cognitive Apprenticeship in Action_ Alert Triage Hour of Power
 
Pushing Observability Uphill - The Single “Pain” of Glass
Pushing Observability Uphill - The Single “Pain” of GlassPushing Observability Uphill - The Single “Pain” of Glass
Pushing Observability Uphill - The Single “Pain” of Glass
 
Power Up with Podman
Power Up with PodmanPower Up with Podman
Power Up with Podman
 
Intro to Instrumentation
Intro to InstrumentationIntro to Instrumentation
Intro to Instrumentation
 
99.9% of Your Traces are Trash
99.9% of Your Traces are Trash99.9% of Your Traces are Trash
99.9% of Your Traces are Trash
 
3rd Wave Observability: Open or Bust
3rd Wave Observability: Open or Bust 3rd Wave Observability: Open or Bust
3rd Wave Observability: Open or Bust
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

From Cardinal(ity) Sins to Cost-Efficient Metrics Aggregation

  • 1. chronosphere.io From Cardinal(ity) Sins to Cost Efficient Metrics Aggregation Paige Cruz, retired SRE open source observability advocate
  • 7. Cloud Native Data Growth 7 Cloud (IaaS, VM-based) 2008 - 2018 Cloud Native (Microservices and Containers) 2018 - ? On-Premises (Data center) 1998 - 2008 Business Increase in Scale Observability Data Increase in Scale *Source: ESG Distributed Cloud Series: Observability, Feb 2022, Scott Sinclair and Rob Strechay
  • 8. chronosphere.io Most recently [vendor] was looked at to help monitor a small Kubernetes test cluster. 3 nodes. Now the base rate of $18/mo is fine…except now they charge $1 per container per month past 10 containers per host. Since K8s (depending on how you install it) runs a bunch of little containers handling various back end things, you might not deploy anything to the cluster and still be WAY over that 10 container limit. In our case it came out to like $200/mo to monitor 3 nodes - that were nowhere fully loaded. - Hacker News thread
  • 9. chronosphere.io Data volume Experiment: - Hello World app on 4 node Kubernetes cluster with Tracing, End User Metrics (EUM), Logs, Metrics (containers / nodes) - 30 days == +450 GB
  • 11. “ 1 in 10 metrics are actually directly queried - ServiceNow
  • 12. Contributing Factors to the Metrics Bill 12 How many things you’re monitoring # of containers and infra components How often each metric is scraped Metric Granularity How long you keep the data Retention Window How many unique combos of dimensions on metrics Cardinality 12
  • 13. 13
  • 14. 14 Cost of monitoring can be a factor in determining how quickly to deprecate or sunset features/services/environments # of containers and infra components 14
  • 15. 15 Emission time = adjust scrape_interval (from 10s samples -> 30s samples) Ingest time = aggregate Over time post-storage = downsampling Metric Granularity 15
  • 17. 17 For operational metrics……most (99.9%) of queries do not pass 7 days but average retention at original granularity ranges from 2-4 weeks Retention Window 17
  • 18. 18 Low value tags or entire metrics should be dropped as early as possible Dropping Data 18
  • 21. “ What is the value of this metric? - You, when auditing metrics
  • 22. Auditing Your Metrics 22 22 ● Scope what your team is responsible for ○ filter queries with team:YOURS ● Identify easy wins. Metrics that aren’t ○ In a monitor definition ○ Directly queried by end users ○ Powering charts for visited dashboards ● Identify labels that are unnecessary ○ e.g. prometheus instance label or instance_type ● Share your successes!
  • 23. 23 23
  • 24. 24 24 CFO looking at the cost efficiency of metrics
  • 25. Resources - How Gloo uses the OTel Collector to drop metrics/labels and provide the Minimum Metrics Set - How to drop and delete metrics in Prometheus - How can recording and data roll-up rules help your metrics? - Observability is Too Damn Expensive - DevOpsDays London
  • 26. Catch up with me: - Rescuing On-Call Engineers (send your manager) - KubeCon OTel 101: Let’s Instrument! (tracing) workshop - There’s No Place Like Production Conf42 Incident Management paigerduty@ chronosphere.io hachyderm.io LinkedIn
  • 27. Q&A