SlideShare a Scribd company logo
1 of 35
Reduce SRE Stress
Minimizing Service Downtime with Grafana, InfluxDB
and Telegraf
© 2022 NetApp, Inc. All rights reserved.
Dustin Sorge
Lead Site Reliability Engineer
© 2022 NetApp, Inc. All rights reserved.
Agenda
• About NetApp
• BAERO Team
• About me
• What is SRE?
• Life Before InfluxDB
• Example NetApp Services
• Key Metrics for SRE
• SLOs, SLIs, Error Budgets, MTTA, MTTR, MTBF
• Visualization via custom dashboards
• Measuring Key Metrics with InfluxDB
• Service outage measurements
• Strategic tags and fields
• Real-time SLI calculation example with TICKscript
• Coordinated Incident Response
• Follow the Sun
• Slack alerting via Kapacitor
• Measuring System Resources
• Tips
• Next Steps
• Summary
• Questions?
© 2022 NetApp, Inc. All rights reserved.
• Global cloud-led, data-centric
software company
• Industry-leading software,
systems,
and cloud services
• Founded in 1992, headquartered
in
San Jose, California
• $5.74B FY21 revenue
• 11,000+ employees and 4,000+
partners helping organizations
thrive in a hybrid
multi-cloud world
• Fortune 500 company (NASDAQ:
NTAP)
NetApp at a glance
Industry-leading software and systems
Object storage Converged and hybrid
cloud infrastructure
Flash and hybrid
storage
Protection
and security
Enterprise solutions
Industry-leading cloud services
Cloud
storage
Compute
operations
Cloud
controls
Cloud services
and analytics
Industry-leading solutions with an open ecosystem of partners
NetApp’s BAERO Team
Build, Automation, and EngineeRing Operations
• Responsible for NetApp’s Build, Test, and Automation Infrastructure
• Driven by several critical services that support NetApp ONTAP Engineering’s
development and quality
• Data ONTAP Build Farm
• Common Test Lab Environment
• Continuous Integration Testing Environment
• AI-driven Code Pre-submission Testing
© 2022 NetApp, Inc. All rights reserved.
About me
• Reside in Pittsburgh, PA
• Site Reliability Engineering Technical
Lead
• Been with NetApp since 2011
• Graduate of both University of
Pittsburgh and Carnegie Mellon
University
• Previously High Performance
Computing Operations Engineer at
the Pittsburgh Supercomputing
Center
© 2022 NetApp, Inc. All rights reserved.
What is SRE?
• Site Reliability Engineering
• SRE is a role whose primary focus is on production engineering.
• SREs blend the skillsets of software engineering and operations to deliver and maintain highly available
services.
• SRE is an implementation of the DevOps philosophy.
• SRE is a role, DevOps is a way of doing things
• SRE is narrowly focused on service uptime
• “Automate yourself out of a job”
© 2022 NetApp, Inc. All rights reserved.
What is SRE?
© 2022 NetApp, Inc. All rights reserved.
The Balloon Game!
Life Before InfluxDB
© 2022 NetApp, Inc. All rights reserved.
• Small number of metrics DBs stored in relational databases.
Custom Metrics
• Reliance on traditional Linux observability tools
• top/htop/atop
• mpstat
• iostat
• sar
System Resource Monitoring
• Nagios
Alerting
Why InfluxDB?
Scalability!
• Evaluated Prometheus as a possible solution
• Was common practice to write to Prometheus and then store data in InfluxDB.
• Cut out the middle-man
© 2022 NetApp, Inc. All rights reserved.
Architecture
© 2022 NetApp, Inc. All rights reserved.
InfluxDB Enterprise Data
Node
meta
node
Grafana
queries
response
Kapacitor email
slack
pager
duty
Telegraf
Linux
Servers
Database
Servers
Misc.
infra
Jenkins
Scripts/Automation
Flask
REST
API
Service
Grafana API
shared
storage
InfluxDB
OSS
T
Example
NetApp
Services
© 2022 NetApp, Inc. All rights reserved.
NetApp CTL Environment
© 2022 NetApp, Inc. All rights reserved.
NetApp’s Common Test Lab (CTL) consists of thousands of nodes (physical and virtual) and thousands of
client VMs to support NetApp Developer’s and QA Engineer’s testing requirements.
Common Workflows
• Execution
• Submit a single test for a defined compute configuration and obtain the results via email
immediately following the test.
• Reservation
• Obtain compute and clients for a predefined period for unlimited testing in the reservation
window.
~32K
Compute
Nodes
~35K
Client
VMs
~100,000
hours of
ONTAP
testing per
month
~90
unique
hardware
configs
NetApp CIT Environment
© 2022 NetApp, Inc. All rights reserved.
NetApp’s Continuous Integration Test (CIT) Environment is crucial for protecting our code line stability.
Provide bisect and revert functionality to ensure problematic code submissions stay out of the code line.
Bisect
• Run integration tests across narrowing range of changes to narrow down the problematic change.
Revert
• Undo changes to code that caused integration test failures.
~440,000
hours of
ONTAP
testing per
month
103
reverts per
month
3.9K test
and ONTAP
submissions
per month
~11K CIT
tests
running on
95%
virtualized
HW
Key Metrics for
SRE
© 2022 NetApp, Inc. All rights reserved.
SLI and SLO
Service Level Indicator
• Gives us a way to quantify the level of
service being provided.
• Moving targets
• Can adjust as service improves
• Displayed as a percentage
• (Good events / Expected events)*100
© 2022 NetApp, Inc. All rights reserved.
Service Level Objective
• Sets the target level of availability for a
given service.
• Should not have a goal SLO of 100%.
• Meeting and/or exceeding SLO should
result in happy consumers of your
service.
SLI SLO
Error Budgets
• Error budget = 100% - SLO
• Provides a clear, objective metric that states how “unreliable” a service is allowed to
be.
• Outages from both infrastructure and unstable features will be taken out of the error budget.
• Allows development teams and SRE to balance innovation and reliability.
• Larger error budgets allow development to take more risks.
• Smaller error budgets mean development should be more conservative.
• Should be constantly measured and displayed.
© 2022 NetApp, Inc. All rights reserved.
MTTA, MTTR, and MTBF
© 2022 NetApp, Inc. All rights reserved.
MTTA
Mean Time to Alert
• The average amount of
time it takes to know that
there is a problem.
MTTR
Mean Time to Recovery
• The average amount of
time it takes to recover
the service.
Mean Time to Respond
• The average amount of
time for SRE to engage.
MTBF
Mean Time Between Failure
• The average amount of
time between disruptions
for a given service.
Bad Thing
Happens
Alert is
Triggered
SRE
Engages
Service
Recovered
Outage Timeline
Measuring Key
Metrics with
InfluxDB
© 2022 NetApp, Inc. All rights reserved.
Outage Measurement
• Service specific outage information is stored in an outage measurement
• Record is populated by SRE REST API Service following incident Postmortem Analysis.
• Custom scripts read ctl_outage measurement and generate current SLO, MTTA, MTTR,
and MTBF metrics to be displayed by SRE metrics dashboard in Grafana.
• Example: CTL Service Outage
• Database: ctl
• Measurement: ctl_outage
• Tags:
• site
• service
• planned
• category
• Fields:
• description
• outage_start (epoch seconds)
• outage_end (epoch seconds)
• outage_notify (epoch seconds)
© 2022 NetApp, Inc. All rights reserved.
Example SLI: reservation_add() API latency
© 2022 NetApp, Inc. All rights reserved.
api.log
2022/07/11 03:16:31 api::Controller::Default::jsonrpc INFO: {"remote_ip_source":"X-Forwarded-
For","api_user":”<sanitized>","remote_ip":”<sanitized>","api_caller":"gui","params":{"priority":7,"testbed_attributes":"","testbed_type":"1
Vsim","project":"Wafl","bad_model_combo_check":1,"init":"config/1Vsim.nsha.sdot","secondary_owner":"","build_variant":"x86_64.sim","project
_type":"other","pool":"common","project_reason":"WRR CP Trigger
test","owner":”<sanitized>","build_promoted":0,"build_root":”<sanitized>","duration":14,"build_resolved":"later","lab":"common","clients":"
","label":""},"site":"CTL","latency_ms":2824,"request_id":”<sanitized>","method":"reservation_add"} [YzahrwVy7-f2] (ea1af4cc)
telegraf.conf
[[inputs.logparser]]
files = [”<path_to_api_log>"]
watch_method = "poll"
from_beginning = false
[inputs.logparser.tags]
service = "api"
[inputs.logparser.grok]
timezone = "America/New_York"
measurement = "api_latency_raw"
patterns = [
'%{TS_UNIX:timestamp:ts-"2006/01/02 15:04:05"} api::Controller::Default::jsonrpc %{LEVEL:level:tag}: %{GREEDYDATA:api_json} %{REQUESTID:instance:drop}
%{SERVICEID:instance:drop}',
'%{GREEDYDATA:garbage:drop}'
]
custom_patterns = '''
TS_UNIX %{YEAR}/%{MONTHNUM}/%{MONTHDAY} %{HOUR}:%{MINUTE}:%{SECOND}
LEVEL (bINFOb)
REQUESTID ([.*])
SERVICEID ((.*))
'''
[[processors.parser]]
merge = "override"
parse_fields = ["api_json"]
data_format = "json"
tag_keys = ["api_user", "method", "site", "api_caller"]
json_string_fields = ["params", "latency_ms"]
Example SLI: reservation_add() API latency cont.
© 2022 NetApp, Inc. All rights reserved.
Calculate and Publish SLI Value in Realtime
Example SLI: reservation_add() API latency cont.
© 2022 NetApp, Inc. All rights reserved.
Display on Grafana Dashboard
Alert to Slack if Necessary
Coordinated
Incident
Response
© 2022 NetApp, Inc. All rights reserved.
Follow The Sun
• SRE is best implemented with a “follow the sun” model.
• Geographic distribution allows for more complete coverage of
services.
• Reduces off-hour pages
© 2022 NetApp, Inc. All rights reserved.
Slack Alerting with Kapacitor
• Custom alerts created with TICKscripts are routed to service-specific alerting channels
in Slack.
• Allows for immediate response and global collaboration in a single location.
• SREs from US and India able to triage issues in real time which minimizes service
downtime.
© 2022 NetApp, Inc. All rights reserved.
Slack Alerting with Kapacitor cont.
© 2022 NetApp, Inc. All rights reserved.
Ex: Using deadman alert to detect potential issues
Issue is triaged in channel thread
Back to normal!
Measuring
System
Resources
© 2022 NetApp, Inc. All rights reserved.
Telegraf For The Win!
© 2022 NetApp, Inc. All rights reserved.
• System stats collected by default with
Telegraf out of the box!
• Once a Grafana template is established
new hosts can be added in seconds!
1. Install telegraf RPM
2. Copy telegraf config to
/etc/telegraf/telegraf.conf
3. Start telegraf service
Alerting for Resource Utilization
© 2022 NetApp, Inc. All rights reserved.
Know your file systems are filling up before
the box tips over!
Know when CPU is spiking on critical
infrastructure!
You can alert on whatever system stats you wish!
Tips
© 2022 NetApp, Inc. All rights reserved.
Watcher of Watchers
• Write telegraf data from enterprise data nodes to an open-source instance of InfluxDB
• Single-node InfluxDB OSS (it’s FREE)
• Critical when attempting to triage performance issues with your enterprise cluster.
• Enables you to use Grafana to visualize system resource utilization of enterprise data
nodes.
© 2022 NetApp, Inc. All rights reserved.
Runaway Series Cardinality
• Be aware of what makes a good tag and what doesn’t!
• Data with a large degree of uniqueness does not make a good tag
• Example: unique job IDs
• Fields don’t have to be numerical!
• Unexpected spikes in memory consumption may be result of runaway series cardinality
• Use chronograf to quickly explore DB/RP combos for tags with high values
• Tags are indexed in memory so problems may manifest as memory spikes
© 2022 NetApp, Inc. All rights reserved.
Next Steps
© 2022 NetApp, Inc. All rights reserved.
Leverage Flux!
• TICKscripts have been very effective for down sampling and custom alerting
• Should look to see how we can leverage Flux to enhance our workflows
Hardware Modernization
• Currently using 4x8 enterprise license on VMs
• Backend data and wal volumes are AFF mounted over iSCSI
• As workloads increase, we may want to look at bare-metal with NVMe drives
• Increased CPU
• more CPUs = more concurrent compactions!
Summary
© 2022 NetApp, Inc. All rights reserved.
Questions?
© 2022 NetApp, Inc. All rights reserved.

More Related Content

What's hot

OpenShift 4, the smarter Kubernetes platform
OpenShift 4, the smarter Kubernetes platformOpenShift 4, the smarter Kubernetes platform
OpenShift 4, the smarter Kubernetes platformKangaroot
 
What Causes Engine Knocking in your Mini Cooper from Certified Mechanics in P...
What Causes Engine Knocking in your Mini Cooper from Certified Mechanics in P...What Causes Engine Knocking in your Mini Cooper from Certified Mechanics in P...
What Causes Engine Knocking in your Mini Cooper from Certified Mechanics in P...European Auto Tech
 
OSMC 2022 | OpenTelemetry 101 by Dotan Horovit s.pdf
OSMC 2022 | OpenTelemetry 101 by Dotan Horovit s.pdfOSMC 2022 | OpenTelemetry 101 by Dotan Horovit s.pdf
OSMC 2022 | OpenTelemetry 101 by Dotan Horovit s.pdfNETWAYS
 
GitOps with Amazon EKS Anywhere by Dan Budris
GitOps with Amazon EKS Anywhere by Dan BudrisGitOps with Amazon EKS Anywhere by Dan Budris
GitOps with Amazon EKS Anywhere by Dan BudrisWeaveworks
 
Application performance monitoring with Elastic APM and the ELK stack
Application performance monitoring with Elastic APM and the ELK stackApplication performance monitoring with Elastic APM and the ELK stack
Application performance monitoring with Elastic APM and the ELK stackAlain Lompo
 
Free GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOpsFree GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOpsWeaveworks
 
InfluxDB + Telegraf Operator: Easy Kubernetes Monitoring
InfluxDB + Telegraf Operator: Easy Kubernetes MonitoringInfluxDB + Telegraf Operator: Easy Kubernetes Monitoring
InfluxDB + Telegraf Operator: Easy Kubernetes MonitoringInfluxData
 
OpenTelemetry 101 FTW
OpenTelemetry 101 FTWOpenTelemetry 101 FTW
OpenTelemetry 101 FTWNGINX, Inc.
 
Everything You wanted to Know About Distributed Tracing
Everything You wanted to Know About Distributed TracingEverything You wanted to Know About Distributed Tracing
Everything You wanted to Know About Distributed TracingAmuhinda Hungai
 
The Power of GitOps with Flux & GitOps Toolkit
The Power of GitOps with Flux & GitOps ToolkitThe Power of GitOps with Flux & GitOps Toolkit
The Power of GitOps with Flux & GitOps ToolkitWeaveworks
 
5 Signs your Mini Cooper Water Pump is Failing in Mountain View
5 Signs your Mini Cooper Water Pump is Failing in Mountain View5 Signs your Mini Cooper Water Pump is Failing in Mountain View
5 Signs your Mini Cooper Water Pump is Failing in Mountain ViewBavarium Autoworks
 
Get started with gitops and flux
Get started with gitops and fluxGet started with gitops and flux
Get started with gitops and fluxLibbySchulze1
 
Gitops: the kubernetes way
Gitops: the kubernetes wayGitops: the kubernetes way
Gitops: the kubernetes waysparkfabrik
 
Road to (Enterprise) Observability
Road to (Enterprise) ObservabilityRoad to (Enterprise) Observability
Road to (Enterprise) ObservabilityChristoph Engelbert
 
Kubernetes GitOps featuring GitHub, Kustomize and ArgoCD
Kubernetes GitOps featuring GitHub, Kustomize and ArgoCDKubernetes GitOps featuring GitHub, Kustomize and ArgoCD
Kubernetes GitOps featuring GitHub, Kustomize and ArgoCDSunnyvale
 
DevSecops: Defined, tools, characteristics, tools, frameworks, benefits and c...
DevSecops: Defined, tools, characteristics, tools, frameworks, benefits and c...DevSecops: Defined, tools, characteristics, tools, frameworks, benefits and c...
DevSecops: Defined, tools, characteristics, tools, frameworks, benefits and c...Mohamed Nizzad
 
CNCF Meetup - OpenShift Overview
CNCF Meetup - OpenShift OverviewCNCF Meetup - OpenShift Overview
CNCF Meetup - OpenShift OverviewSumit Shatwara
 
Terraform and Weave GitOps: Build a Fully Automated Application Stack
Terraform and Weave GitOps: Build a Fully Automated Application StackTerraform and Weave GitOps: Build a Fully Automated Application Stack
Terraform and Weave GitOps: Build a Fully Automated Application StackWeaveworks
 
Let's build Developer Portal with Backstage
Let's build Developer Portal with BackstageLet's build Developer Portal with Backstage
Let's build Developer Portal with BackstageOpsta
 

What's hot (20)

OpenShift 4, the smarter Kubernetes platform
OpenShift 4, the smarter Kubernetes platformOpenShift 4, the smarter Kubernetes platform
OpenShift 4, the smarter Kubernetes platform
 
What Causes Engine Knocking in your Mini Cooper from Certified Mechanics in P...
What Causes Engine Knocking in your Mini Cooper from Certified Mechanics in P...What Causes Engine Knocking in your Mini Cooper from Certified Mechanics in P...
What Causes Engine Knocking in your Mini Cooper from Certified Mechanics in P...
 
OSMC 2022 | OpenTelemetry 101 by Dotan Horovit s.pdf
OSMC 2022 | OpenTelemetry 101 by Dotan Horovit s.pdfOSMC 2022 | OpenTelemetry 101 by Dotan Horovit s.pdf
OSMC 2022 | OpenTelemetry 101 by Dotan Horovit s.pdf
 
GitOps with Amazon EKS Anywhere by Dan Budris
GitOps with Amazon EKS Anywhere by Dan BudrisGitOps with Amazon EKS Anywhere by Dan Budris
GitOps with Amazon EKS Anywhere by Dan Budris
 
Application performance monitoring with Elastic APM and the ELK stack
Application performance monitoring with Elastic APM and the ELK stackApplication performance monitoring with Elastic APM and the ELK stack
Application performance monitoring with Elastic APM and the ELK stack
 
Free GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOpsFree GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOps
 
InfluxDB + Telegraf Operator: Easy Kubernetes Monitoring
InfluxDB + Telegraf Operator: Easy Kubernetes MonitoringInfluxDB + Telegraf Operator: Easy Kubernetes Monitoring
InfluxDB + Telegraf Operator: Easy Kubernetes Monitoring
 
OpenTelemetry 101 FTW
OpenTelemetry 101 FTWOpenTelemetry 101 FTW
OpenTelemetry 101 FTW
 
Everything You wanted to Know About Distributed Tracing
Everything You wanted to Know About Distributed TracingEverything You wanted to Know About Distributed Tracing
Everything You wanted to Know About Distributed Tracing
 
The Power of GitOps with Flux & GitOps Toolkit
The Power of GitOps with Flux & GitOps ToolkitThe Power of GitOps with Flux & GitOps Toolkit
The Power of GitOps with Flux & GitOps Toolkit
 
5 Signs your Mini Cooper Water Pump is Failing in Mountain View
5 Signs your Mini Cooper Water Pump is Failing in Mountain View5 Signs your Mini Cooper Water Pump is Failing in Mountain View
5 Signs your Mini Cooper Water Pump is Failing in Mountain View
 
Get started with gitops and flux
Get started with gitops and fluxGet started with gitops and flux
Get started with gitops and flux
 
Gitops: the kubernetes way
Gitops: the kubernetes wayGitops: the kubernetes way
Gitops: the kubernetes way
 
Road to (Enterprise) Observability
Road to (Enterprise) ObservabilityRoad to (Enterprise) Observability
Road to (Enterprise) Observability
 
Kubernetes GitOps featuring GitHub, Kustomize and ArgoCD
Kubernetes GitOps featuring GitHub, Kustomize and ArgoCDKubernetes GitOps featuring GitHub, Kustomize and ArgoCD
Kubernetes GitOps featuring GitHub, Kustomize and ArgoCD
 
DevSecops: Defined, tools, characteristics, tools, frameworks, benefits and c...
DevSecops: Defined, tools, characteristics, tools, frameworks, benefits and c...DevSecops: Defined, tools, characteristics, tools, frameworks, benefits and c...
DevSecops: Defined, tools, characteristics, tools, frameworks, benefits and c...
 
Introduction to Chaos Engineering
Introduction to Chaos EngineeringIntroduction to Chaos Engineering
Introduction to Chaos Engineering
 
CNCF Meetup - OpenShift Overview
CNCF Meetup - OpenShift OverviewCNCF Meetup - OpenShift Overview
CNCF Meetup - OpenShift Overview
 
Terraform and Weave GitOps: Build a Fully Automated Application Stack
Terraform and Weave GitOps: Build a Fully Automated Application StackTerraform and Weave GitOps: Build a Fully Automated Application Stack
Terraform and Weave GitOps: Build a Fully Automated Application Stack
 
Let's build Developer Portal with Backstage
Let's build Developer Portal with BackstageLet's build Developer Portal with Backstage
Let's build Developer Portal with Backstage
 

Similar to Reduce SRE Stress: Minimizing Service Downtime with Grafana, InfluxDB and Telegraf

How to create custom dashboards in Elastic Search / Kibana with Performance V...
How to create custom dashboards in Elastic Search / Kibana with Performance V...How to create custom dashboards in Elastic Search / Kibana with Performance V...
How to create custom dashboards in Elastic Search / Kibana with Performance V...PerformanceVision (previously SecurActive)
 
Machine Learning to Turbo-Charge the Ops Portion of DevOps
Machine Learning to Turbo-Charge the Ops Portion of DevOpsMachine Learning to Turbo-Charge the Ops Portion of DevOps
Machine Learning to Turbo-Charge the Ops Portion of DevOpsDeborah Schalm
 
Correlate Log Data with Business Metrics Like a Jedi
Correlate Log Data with Business Metrics Like a JediCorrelate Log Data with Business Metrics Like a Jedi
Correlate Log Data with Business Metrics Like a JediTrevor Parsons
 
Visualizing Your Network Health - Know your Network
Visualizing Your Network Health - Know your NetworkVisualizing Your Network Health - Know your Network
Visualizing Your Network Health - Know your NetworkDellNMS
 
Application Modernisation with PKS
Application Modernisation with PKSApplication Modernisation with PKS
Application Modernisation with PKSPhil Reay
 
Application Modernisation with PKS
Application Modernisation with PKSApplication Modernisation with PKS
Application Modernisation with PKSPhil Reay
 
Nesma autumn conference 2015 - Is FPA a valuable addition to predictable agil...
Nesma autumn conference 2015 - Is FPA a valuable addition to predictable agil...Nesma autumn conference 2015 - Is FPA a valuable addition to predictable agil...
Nesma autumn conference 2015 - Is FPA a valuable addition to predictable agil...Nesma
 
What serverless means for enterprise apps
What serverless means for enterprise appsWhat serverless means for enterprise apps
What serverless means for enterprise appsSumit Sarkar
 
Praxistaugliche notes strategien 4 cloud
Praxistaugliche notes strategien 4 cloudPraxistaugliche notes strategien 4 cloud
Praxistaugliche notes strategien 4 cloudRoman Weber
 
Service-Level Objective for Serverless Applications
Service-Level Objective for Serverless ApplicationsService-Level Objective for Serverless Applications
Service-Level Objective for Serverless Applicationsalekn
 
DevOps in the Cloud with Microsoft Azure
DevOps in the Cloud with Microsoft AzureDevOps in the Cloud with Microsoft Azure
DevOps in the Cloud with Microsoft Azuregjuljo
 
How to Migrate Applications Off a Mainframe
How to Migrate Applications Off a MainframeHow to Migrate Applications Off a Mainframe
How to Migrate Applications Off a MainframeVMware Tanzu
 
Microservices: Data & Design - Miguel Cervantes
Microservices: Data & Design - Miguel CervantesMicroservices: Data & Design - Miguel Cervantes
Microservices: Data & Design - Miguel CervantesAmazon Web Services
 
Distributed Solar Systems at EDF Renewables and AWS IoT: A Natural Fit (PUT30...
Distributed Solar Systems at EDF Renewables and AWS IoT: A Natural Fit (PUT30...Distributed Solar Systems at EDF Renewables and AWS IoT: A Natural Fit (PUT30...
Distributed Solar Systems at EDF Renewables and AWS IoT: A Natural Fit (PUT30...Amazon Web Services
 
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...Startupfest
 
Become a Performance Diagnostics Hero
Become a Performance Diagnostics HeroBecome a Performance Diagnostics Hero
Become a Performance Diagnostics HeroTechWell
 
Faster, more Secure Application Modernization and Replatforming with PKS - Ku...
Faster, more Secure Application Modernization and Replatforming with PKS - Ku...Faster, more Secure Application Modernization and Replatforming with PKS - Ku...
Faster, more Secure Application Modernization and Replatforming with PKS - Ku...VMware Tanzu
 
Implementing Enterprise DevOps: Real Life Experiences
Implementing Enterprise DevOps: Real Life ExperiencesImplementing Enterprise DevOps: Real Life Experiences
Implementing Enterprise DevOps: Real Life ExperiencesPerforce
 
Migrating from a monolith to microservices – is it worth it?
Migrating from a monolith to microservices – is it worth it?Migrating from a monolith to microservices – is it worth it?
Migrating from a monolith to microservices – is it worth it?Katherine Golovinova
 
StasD & Graphite - Measure anything, Measure Everything
StasD & Graphite - Measure anything, Measure EverythingStasD & Graphite - Measure anything, Measure Everything
StasD & Graphite - Measure anything, Measure EverythingAvi Revivo
 

Similar to Reduce SRE Stress: Minimizing Service Downtime with Grafana, InfluxDB and Telegraf (20)

How to create custom dashboards in Elastic Search / Kibana with Performance V...
How to create custom dashboards in Elastic Search / Kibana with Performance V...How to create custom dashboards in Elastic Search / Kibana with Performance V...
How to create custom dashboards in Elastic Search / Kibana with Performance V...
 
Machine Learning to Turbo-Charge the Ops Portion of DevOps
Machine Learning to Turbo-Charge the Ops Portion of DevOpsMachine Learning to Turbo-Charge the Ops Portion of DevOps
Machine Learning to Turbo-Charge the Ops Portion of DevOps
 
Correlate Log Data with Business Metrics Like a Jedi
Correlate Log Data with Business Metrics Like a JediCorrelate Log Data with Business Metrics Like a Jedi
Correlate Log Data with Business Metrics Like a Jedi
 
Visualizing Your Network Health - Know your Network
Visualizing Your Network Health - Know your NetworkVisualizing Your Network Health - Know your Network
Visualizing Your Network Health - Know your Network
 
Application Modernisation with PKS
Application Modernisation with PKSApplication Modernisation with PKS
Application Modernisation with PKS
 
Application Modernisation with PKS
Application Modernisation with PKSApplication Modernisation with PKS
Application Modernisation with PKS
 
Nesma autumn conference 2015 - Is FPA a valuable addition to predictable agil...
Nesma autumn conference 2015 - Is FPA a valuable addition to predictable agil...Nesma autumn conference 2015 - Is FPA a valuable addition to predictable agil...
Nesma autumn conference 2015 - Is FPA a valuable addition to predictable agil...
 
What serverless means for enterprise apps
What serverless means for enterprise appsWhat serverless means for enterprise apps
What serverless means for enterprise apps
 
Praxistaugliche notes strategien 4 cloud
Praxistaugliche notes strategien 4 cloudPraxistaugliche notes strategien 4 cloud
Praxistaugliche notes strategien 4 cloud
 
Service-Level Objective for Serverless Applications
Service-Level Objective for Serverless ApplicationsService-Level Objective for Serverless Applications
Service-Level Objective for Serverless Applications
 
DevOps in the Cloud with Microsoft Azure
DevOps in the Cloud with Microsoft AzureDevOps in the Cloud with Microsoft Azure
DevOps in the Cloud with Microsoft Azure
 
How to Migrate Applications Off a Mainframe
How to Migrate Applications Off a MainframeHow to Migrate Applications Off a Mainframe
How to Migrate Applications Off a Mainframe
 
Microservices: Data & Design - Miguel Cervantes
Microservices: Data & Design - Miguel CervantesMicroservices: Data & Design - Miguel Cervantes
Microservices: Data & Design - Miguel Cervantes
 
Distributed Solar Systems at EDF Renewables and AWS IoT: A Natural Fit (PUT30...
Distributed Solar Systems at EDF Renewables and AWS IoT: A Natural Fit (PUT30...Distributed Solar Systems at EDF Renewables and AWS IoT: A Natural Fit (PUT30...
Distributed Solar Systems at EDF Renewables and AWS IoT: A Natural Fit (PUT30...
 
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
 
Become a Performance Diagnostics Hero
Become a Performance Diagnostics HeroBecome a Performance Diagnostics Hero
Become a Performance Diagnostics Hero
 
Faster, more Secure Application Modernization and Replatforming with PKS - Ku...
Faster, more Secure Application Modernization and Replatforming with PKS - Ku...Faster, more Secure Application Modernization and Replatforming with PKS - Ku...
Faster, more Secure Application Modernization and Replatforming with PKS - Ku...
 
Implementing Enterprise DevOps: Real Life Experiences
Implementing Enterprise DevOps: Real Life ExperiencesImplementing Enterprise DevOps: Real Life Experiences
Implementing Enterprise DevOps: Real Life Experiences
 
Migrating from a monolith to microservices – is it worth it?
Migrating from a monolith to microservices – is it worth it?Migrating from a monolith to microservices – is it worth it?
Migrating from a monolith to microservices – is it worth it?
 
StasD & Graphite - Measure anything, Measure Everything
StasD & Graphite - Measure anything, Measure EverythingStasD & Graphite - Measure anything, Measure Everything
StasD & Graphite - Measure anything, Measure Everything
 

More from InfluxData

Announcing InfluxDB Clustered
Announcing InfluxDB ClusteredAnnouncing InfluxDB Clustered
Announcing InfluxDB ClusteredInfluxData
 
Best Practices for Leveraging the Apache Arrow Ecosystem
Best Practices for Leveraging the Apache Arrow EcosystemBest Practices for Leveraging the Apache Arrow Ecosystem
Best Practices for Leveraging the Apache Arrow EcosystemInfluxData
 
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...InfluxData
 
Power Your Predictive Analytics with InfluxDB
Power Your Predictive Analytics with InfluxDBPower Your Predictive Analytics with InfluxDB
Power Your Predictive Analytics with InfluxDBInfluxData
 
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base InfluxData
 
Build an Edge-to-Cloud Solution with the MING Stack
Build an Edge-to-Cloud Solution with the MING StackBuild an Edge-to-Cloud Solution with the MING Stack
Build an Edge-to-Cloud Solution with the MING StackInfluxData
 
Meet the Founders: An Open Discussion About Rewriting Using Rust
Meet the Founders: An Open Discussion About Rewriting Using RustMeet the Founders: An Open Discussion About Rewriting Using Rust
Meet the Founders: An Open Discussion About Rewriting Using RustInfluxData
 
Introducing InfluxDB Cloud Dedicated
Introducing InfluxDB Cloud DedicatedIntroducing InfluxDB Cloud Dedicated
Introducing InfluxDB Cloud DedicatedInfluxData
 
Gain Better Observability with OpenTelemetry and InfluxDB
Gain Better Observability with OpenTelemetry and InfluxDB Gain Better Observability with OpenTelemetry and InfluxDB
Gain Better Observability with OpenTelemetry and InfluxDB InfluxData
 
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...InfluxData
 
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
How Delft University's Engineering Students Make Their EV Formula-Style Race ...How Delft University's Engineering Students Make Their EV Formula-Style Race ...
How Delft University's Engineering Students Make Their EV Formula-Style Race ...InfluxData
 
Introducing InfluxDB’s New Time Series Database Storage Engine
Introducing InfluxDB’s New Time Series Database Storage EngineIntroducing InfluxDB’s New Time Series Database Storage Engine
Introducing InfluxDB’s New Time Series Database Storage EngineInfluxData
 
Start Automating InfluxDB Deployments at the Edge with balena
Start Automating InfluxDB Deployments at the Edge with balena Start Automating InfluxDB Deployments at the Edge with balena
Start Automating InfluxDB Deployments at the Edge with balena InfluxData
 
Understanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage EngineUnderstanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage EngineInfluxData
 
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDBStreamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDBInfluxData
 
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...InfluxData
 
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022InfluxData
 
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022InfluxData
 
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...InfluxData
 
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022InfluxData
 

More from InfluxData (20)

Announcing InfluxDB Clustered
Announcing InfluxDB ClusteredAnnouncing InfluxDB Clustered
Announcing InfluxDB Clustered
 
Best Practices for Leveraging the Apache Arrow Ecosystem
Best Practices for Leveraging the Apache Arrow EcosystemBest Practices for Leveraging the Apache Arrow Ecosystem
Best Practices for Leveraging the Apache Arrow Ecosystem
 
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
 
Power Your Predictive Analytics with InfluxDB
Power Your Predictive Analytics with InfluxDBPower Your Predictive Analytics with InfluxDB
Power Your Predictive Analytics with InfluxDB
 
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
 
Build an Edge-to-Cloud Solution with the MING Stack
Build an Edge-to-Cloud Solution with the MING StackBuild an Edge-to-Cloud Solution with the MING Stack
Build an Edge-to-Cloud Solution with the MING Stack
 
Meet the Founders: An Open Discussion About Rewriting Using Rust
Meet the Founders: An Open Discussion About Rewriting Using RustMeet the Founders: An Open Discussion About Rewriting Using Rust
Meet the Founders: An Open Discussion About Rewriting Using Rust
 
Introducing InfluxDB Cloud Dedicated
Introducing InfluxDB Cloud DedicatedIntroducing InfluxDB Cloud Dedicated
Introducing InfluxDB Cloud Dedicated
 
Gain Better Observability with OpenTelemetry and InfluxDB
Gain Better Observability with OpenTelemetry and InfluxDB Gain Better Observability with OpenTelemetry and InfluxDB
Gain Better Observability with OpenTelemetry and InfluxDB
 
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
 
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
How Delft University's Engineering Students Make Their EV Formula-Style Race ...How Delft University's Engineering Students Make Their EV Formula-Style Race ...
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
 
Introducing InfluxDB’s New Time Series Database Storage Engine
Introducing InfluxDB’s New Time Series Database Storage EngineIntroducing InfluxDB’s New Time Series Database Storage Engine
Introducing InfluxDB’s New Time Series Database Storage Engine
 
Start Automating InfluxDB Deployments at the Edge with balena
Start Automating InfluxDB Deployments at the Edge with balena Start Automating InfluxDB Deployments at the Edge with balena
Start Automating InfluxDB Deployments at the Edge with balena
 
Understanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage EngineUnderstanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage Engine
 
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDBStreamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
 
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
 
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
 
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
 
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
 
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
 

Recently uploaded

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 

Recently uploaded (20)

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 

Reduce SRE Stress: Minimizing Service Downtime with Grafana, InfluxDB and Telegraf

  • 1. Reduce SRE Stress Minimizing Service Downtime with Grafana, InfluxDB and Telegraf © 2022 NetApp, Inc. All rights reserved. Dustin Sorge Lead Site Reliability Engineer
  • 2. © 2022 NetApp, Inc. All rights reserved. Agenda • About NetApp • BAERO Team • About me • What is SRE? • Life Before InfluxDB • Example NetApp Services • Key Metrics for SRE • SLOs, SLIs, Error Budgets, MTTA, MTTR, MTBF • Visualization via custom dashboards • Measuring Key Metrics with InfluxDB • Service outage measurements • Strategic tags and fields • Real-time SLI calculation example with TICKscript • Coordinated Incident Response • Follow the Sun • Slack alerting via Kapacitor • Measuring System Resources • Tips • Next Steps • Summary • Questions?
  • 3. © 2022 NetApp, Inc. All rights reserved. • Global cloud-led, data-centric software company • Industry-leading software, systems, and cloud services • Founded in 1992, headquartered in San Jose, California • $5.74B FY21 revenue • 11,000+ employees and 4,000+ partners helping organizations thrive in a hybrid multi-cloud world • Fortune 500 company (NASDAQ: NTAP) NetApp at a glance Industry-leading software and systems Object storage Converged and hybrid cloud infrastructure Flash and hybrid storage Protection and security Enterprise solutions Industry-leading cloud services Cloud storage Compute operations Cloud controls Cloud services and analytics Industry-leading solutions with an open ecosystem of partners
  • 4. NetApp’s BAERO Team Build, Automation, and EngineeRing Operations • Responsible for NetApp’s Build, Test, and Automation Infrastructure • Driven by several critical services that support NetApp ONTAP Engineering’s development and quality • Data ONTAP Build Farm • Common Test Lab Environment • Continuous Integration Testing Environment • AI-driven Code Pre-submission Testing © 2022 NetApp, Inc. All rights reserved.
  • 5. About me • Reside in Pittsburgh, PA • Site Reliability Engineering Technical Lead • Been with NetApp since 2011 • Graduate of both University of Pittsburgh and Carnegie Mellon University • Previously High Performance Computing Operations Engineer at the Pittsburgh Supercomputing Center © 2022 NetApp, Inc. All rights reserved.
  • 6. What is SRE? • Site Reliability Engineering • SRE is a role whose primary focus is on production engineering. • SREs blend the skillsets of software engineering and operations to deliver and maintain highly available services. • SRE is an implementation of the DevOps philosophy. • SRE is a role, DevOps is a way of doing things • SRE is narrowly focused on service uptime • “Automate yourself out of a job” © 2022 NetApp, Inc. All rights reserved.
  • 7. What is SRE? © 2022 NetApp, Inc. All rights reserved. The Balloon Game!
  • 8. Life Before InfluxDB © 2022 NetApp, Inc. All rights reserved. • Small number of metrics DBs stored in relational databases. Custom Metrics • Reliance on traditional Linux observability tools • top/htop/atop • mpstat • iostat • sar System Resource Monitoring • Nagios Alerting
  • 9. Why InfluxDB? Scalability! • Evaluated Prometheus as a possible solution • Was common practice to write to Prometheus and then store data in InfluxDB. • Cut out the middle-man © 2022 NetApp, Inc. All rights reserved.
  • 10. Architecture © 2022 NetApp, Inc. All rights reserved. InfluxDB Enterprise Data Node meta node Grafana queries response Kapacitor email slack pager duty Telegraf Linux Servers Database Servers Misc. infra Jenkins Scripts/Automation Flask REST API Service Grafana API shared storage InfluxDB OSS T
  • 11. Example NetApp Services © 2022 NetApp, Inc. All rights reserved.
  • 12. NetApp CTL Environment © 2022 NetApp, Inc. All rights reserved. NetApp’s Common Test Lab (CTL) consists of thousands of nodes (physical and virtual) and thousands of client VMs to support NetApp Developer’s and QA Engineer’s testing requirements. Common Workflows • Execution • Submit a single test for a defined compute configuration and obtain the results via email immediately following the test. • Reservation • Obtain compute and clients for a predefined period for unlimited testing in the reservation window. ~32K Compute Nodes ~35K Client VMs ~100,000 hours of ONTAP testing per month ~90 unique hardware configs
  • 13. NetApp CIT Environment © 2022 NetApp, Inc. All rights reserved. NetApp’s Continuous Integration Test (CIT) Environment is crucial for protecting our code line stability. Provide bisect and revert functionality to ensure problematic code submissions stay out of the code line. Bisect • Run integration tests across narrowing range of changes to narrow down the problematic change. Revert • Undo changes to code that caused integration test failures. ~440,000 hours of ONTAP testing per month 103 reverts per month 3.9K test and ONTAP submissions per month ~11K CIT tests running on 95% virtualized HW
  • 14. Key Metrics for SRE © 2022 NetApp, Inc. All rights reserved.
  • 15. SLI and SLO Service Level Indicator • Gives us a way to quantify the level of service being provided. • Moving targets • Can adjust as service improves • Displayed as a percentage • (Good events / Expected events)*100 © 2022 NetApp, Inc. All rights reserved. Service Level Objective • Sets the target level of availability for a given service. • Should not have a goal SLO of 100%. • Meeting and/or exceeding SLO should result in happy consumers of your service. SLI SLO
  • 16. Error Budgets • Error budget = 100% - SLO • Provides a clear, objective metric that states how “unreliable” a service is allowed to be. • Outages from both infrastructure and unstable features will be taken out of the error budget. • Allows development teams and SRE to balance innovation and reliability. • Larger error budgets allow development to take more risks. • Smaller error budgets mean development should be more conservative. • Should be constantly measured and displayed. © 2022 NetApp, Inc. All rights reserved.
  • 17. MTTA, MTTR, and MTBF © 2022 NetApp, Inc. All rights reserved. MTTA Mean Time to Alert • The average amount of time it takes to know that there is a problem. MTTR Mean Time to Recovery • The average amount of time it takes to recover the service. Mean Time to Respond • The average amount of time for SRE to engage. MTBF Mean Time Between Failure • The average amount of time between disruptions for a given service. Bad Thing Happens Alert is Triggered SRE Engages Service Recovered Outage Timeline
  • 18. Measuring Key Metrics with InfluxDB © 2022 NetApp, Inc. All rights reserved.
  • 19. Outage Measurement • Service specific outage information is stored in an outage measurement • Record is populated by SRE REST API Service following incident Postmortem Analysis. • Custom scripts read ctl_outage measurement and generate current SLO, MTTA, MTTR, and MTBF metrics to be displayed by SRE metrics dashboard in Grafana. • Example: CTL Service Outage • Database: ctl • Measurement: ctl_outage • Tags: • site • service • planned • category • Fields: • description • outage_start (epoch seconds) • outage_end (epoch seconds) • outage_notify (epoch seconds) © 2022 NetApp, Inc. All rights reserved.
  • 20. Example SLI: reservation_add() API latency © 2022 NetApp, Inc. All rights reserved. api.log 2022/07/11 03:16:31 api::Controller::Default::jsonrpc INFO: {"remote_ip_source":"X-Forwarded- For","api_user":”<sanitized>","remote_ip":”<sanitized>","api_caller":"gui","params":{"priority":7,"testbed_attributes":"","testbed_type":"1 Vsim","project":"Wafl","bad_model_combo_check":1,"init":"config/1Vsim.nsha.sdot","secondary_owner":"","build_variant":"x86_64.sim","project _type":"other","pool":"common","project_reason":"WRR CP Trigger test","owner":”<sanitized>","build_promoted":0,"build_root":”<sanitized>","duration":14,"build_resolved":"later","lab":"common","clients":" ","label":""},"site":"CTL","latency_ms":2824,"request_id":”<sanitized>","method":"reservation_add"} [YzahrwVy7-f2] (ea1af4cc) telegraf.conf [[inputs.logparser]] files = [”<path_to_api_log>"] watch_method = "poll" from_beginning = false [inputs.logparser.tags] service = "api" [inputs.logparser.grok] timezone = "America/New_York" measurement = "api_latency_raw" patterns = [ '%{TS_UNIX:timestamp:ts-"2006/01/02 15:04:05"} api::Controller::Default::jsonrpc %{LEVEL:level:tag}: %{GREEDYDATA:api_json} %{REQUESTID:instance:drop} %{SERVICEID:instance:drop}', '%{GREEDYDATA:garbage:drop}' ] custom_patterns = ''' TS_UNIX %{YEAR}/%{MONTHNUM}/%{MONTHDAY} %{HOUR}:%{MINUTE}:%{SECOND} LEVEL (bINFOb) REQUESTID ([.*]) SERVICEID ((.*)) ''' [[processors.parser]] merge = "override" parse_fields = ["api_json"] data_format = "json" tag_keys = ["api_user", "method", "site", "api_caller"] json_string_fields = ["params", "latency_ms"]
  • 21. Example SLI: reservation_add() API latency cont. © 2022 NetApp, Inc. All rights reserved. Calculate and Publish SLI Value in Realtime
  • 22. Example SLI: reservation_add() API latency cont. © 2022 NetApp, Inc. All rights reserved. Display on Grafana Dashboard Alert to Slack if Necessary
  • 24. Follow The Sun • SRE is best implemented with a “follow the sun” model. • Geographic distribution allows for more complete coverage of services. • Reduces off-hour pages © 2022 NetApp, Inc. All rights reserved.
  • 25. Slack Alerting with Kapacitor • Custom alerts created with TICKscripts are routed to service-specific alerting channels in Slack. • Allows for immediate response and global collaboration in a single location. • SREs from US and India able to triage issues in real time which minimizes service downtime. © 2022 NetApp, Inc. All rights reserved.
  • 26. Slack Alerting with Kapacitor cont. © 2022 NetApp, Inc. All rights reserved. Ex: Using deadman alert to detect potential issues Issue is triaged in channel thread Back to normal!
  • 27. Measuring System Resources © 2022 NetApp, Inc. All rights reserved.
  • 28. Telegraf For The Win! © 2022 NetApp, Inc. All rights reserved. • System stats collected by default with Telegraf out of the box! • Once a Grafana template is established new hosts can be added in seconds! 1. Install telegraf RPM 2. Copy telegraf config to /etc/telegraf/telegraf.conf 3. Start telegraf service
  • 29. Alerting for Resource Utilization © 2022 NetApp, Inc. All rights reserved. Know your file systems are filling up before the box tips over! Know when CPU is spiking on critical infrastructure! You can alert on whatever system stats you wish!
  • 30. Tips © 2022 NetApp, Inc. All rights reserved.
  • 31. Watcher of Watchers • Write telegraf data from enterprise data nodes to an open-source instance of InfluxDB • Single-node InfluxDB OSS (it’s FREE) • Critical when attempting to triage performance issues with your enterprise cluster. • Enables you to use Grafana to visualize system resource utilization of enterprise data nodes. © 2022 NetApp, Inc. All rights reserved.
  • 32. Runaway Series Cardinality • Be aware of what makes a good tag and what doesn’t! • Data with a large degree of uniqueness does not make a good tag • Example: unique job IDs • Fields don’t have to be numerical! • Unexpected spikes in memory consumption may be result of runaway series cardinality • Use chronograf to quickly explore DB/RP combos for tags with high values • Tags are indexed in memory so problems may manifest as memory spikes © 2022 NetApp, Inc. All rights reserved.
  • 33. Next Steps © 2022 NetApp, Inc. All rights reserved. Leverage Flux! • TICKscripts have been very effective for down sampling and custom alerting • Should look to see how we can leverage Flux to enhance our workflows Hardware Modernization • Currently using 4x8 enterprise license on VMs • Backend data and wal volumes are AFF mounted over iSCSI • As workloads increase, we may want to look at bare-metal with NVMe drives • Increased CPU • more CPUs = more concurrent compactions!
  • 34. Summary © 2022 NetApp, Inc. All rights reserved.
  • 35. Questions? © 2022 NetApp, Inc. All rights reserved.