SlideShare a Scribd company logo
Gianluca Arbezzano / Site Reliability Engineer
Kubernetes Monitoring
with InfluxDB
© 2018 InfluxData. All rights reserved.2
Who I am
Gianluca Arbezzano
Site Reliability Engineer @InfluxData
• http://gianarb.it
• @gianarb
What I like:
• I make dirty hacks that look awesome
• I grow my vegetables 🍅🌻🍆
• Travel for fun and work
© 2019 InfluxData. All rights reserved.3
© 2019 InfluxData. All rights reserved.4
© 2018 InfluxData. All rights reserved.5
© 2018 InfluxData. All rights reserved.6
How distributed systems
monitoring is different
● Partial failure
● Fault tolerance and resiliency
○ Space dimension = replications
○ Time dimension = retries
● “normal state” is hard to define
© 2018 InfluxData. All rights reserved.8
DB 1
DB 1
Client A Client B
Load Balancer
Load Balancer
Cache A Cache B
Kubernetes
© 2018 InfluxData. All rights reserved.10
Kubernetes architecture diagram
© 2018 InfluxData. All rights reserved.11
Telegraf as daemonset to get nodes stats
[[inputs.internal]]
[[inputs.cpu]]
[[inputs.disk]]
ignore_fs = ["tmpfs", "devtmpfs", "devfs"]
[[inputs.diskio]]
[[inputs.kernel]]
[[inputs.mem]]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]
[[inputs.docker]]
endpoint = "unix:///var/run/docker.sock"
[[inputs.kubernetes]]
url = "http://127.0.0.1:10255"
© 2018 InfluxData. All rights reserved.12
Telegraf as daemonset to get nodes stats
volumeMounts:
- name: sys
mountPath: /rootfs/sys
readOnly: true
- name: docker
mountPath: /var/run/docker.sock
readOnly: true
- name: proc
mountPath: /rootfs/proc
readOnly: true
- name: utmp
mountPath: /var/run/utmp
readOnly: true
● hostNetwork: true
● dnsPolicy: clusterFist
© 2018 InfluxData. All rights reserved.13
Telegraf as daemonset reachable from a container
env:
- name: HOST_IP
valueFrom:
fieldRef:
fieldPath: status.hostIP
- name: MONITOR_HOST
value: "http://$(HOST_IP):8086"
[[inputs.http_listener]]
## Address and port to host HTTP listener on
service_address = ":8086"
© 2018 InfluxData. All rights reserved.14
Telegraf as a Sidecar
© 2018 InfluxData. All rights reserved.15
Telegraf as a Sidecar
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: "etcd"
labels:
spec:
serviceName: "etcd"
replicas: 3
template:
metadata:
name: "etcd"
labels:
component: "etcd"
spec:
containers:
- name: "telegraf"
image: "docker.io/library/telegraf:1.4"
- name: "etcd"
image: "quay.io/coreos/etcd:v3.2.9"
© 2018 InfluxData. All rights reserved.16
https://www.influxdata.com/blog/monitoring-kubernetes-architecture/
Feedback from “real life”
● High number of Telegraf running inside the
cluster
● For Prometheus metrics there is a better way
(I will tell you how later)
● Pull vs Push
© 2018 InfluxData. All rights reserved.18
© 2018 InfluxData. All rights reserved.19
Pull and Push
© 2018 InfluxData. All rights reserved.20
/metrics
# HELP storage_cache_age_seconds Age in seconds of the current cache (time since last snapshot or initialisation).
# TYPE storage_cache_age_seconds gauge
storage_cache_age_seconds{engine_id="0",node_id="0"} 112.999976922
storage_cache_age_seconds{engine_id="1",node_id="0"} 26.999942596
storage_cache_age_seconds{engine_id="16",node_id="0"} 188.999943578
storage_cache_age_seconds{engine_id="17",node_id="0"} 127.999951674
storage_cache_age_seconds{engine_id="24",node_id="0"} 591.999925169
storage_cache_age_seconds{engine_id="25",node_id="0"} 444.999924453
storage_cache_age_seconds{engine_id="32",node_id="0"} 578.999943156
storage_cache_age_seconds{engine_id="33",node_id="0"} 340.99992462
storage_cache_age_seconds{engine_id="40",node_id="0"} 427.999951022
storage_cache_age_seconds{engine_id="41",node_id="0"} 375.999926161
© 2018 InfluxData. All rights reserved.21
© 2018 InfluxData. All rights reserved.22
Kubernetes discovery with the Prometheus Plugin
[[inputs.prometheus]]
monitor_kubernetes_pods = true
Enabling this option will allow the plugin to scrape for prometheus annotation on Kubernetes pods.
• prometheus.io/scrape Enable scraping for this pod.
• prometheus.io/scheme If the metrics endpoint is secured then you will need to set this to https &
most likely set the tls config. (default 'http')
• prometheus.io/path Override the path for the metrics endpoint on the service. (default '/metrics')
• prometheus.io/port Used to override the port. (default 9102)
© 2018 InfluxData. All rights reserved.23
Monitor your ingestion pipeline
• internal_memstats
• internal_agent
– metrics_dropped
– metrics_gathered
• internal_gather
• internal_write
start = 6h
interval = 3m
from(bucket: "kube-infra/monthly")
|> range(start: start)
|> filter(fn: (r) =>
r._measurement == "internal_agent"
and r.env == "acc"
and r.host =~ /^telegraf-prom-discovery/)
|> filter(fn: (r) =>
r._field == "metrics_dropped"
or r._field == "metrics_gathered"
or r._field == "metrics_written")
|> window(every: interval)
|> mean() // defaults to "_value"
|> group(columns: ["_field"])
|> derivative(nonNegative: true, timeColumn: "_stop")
© 2018 InfluxData. All rights reserved.25
Monitor your ingestion pipeline
• You can use
inputs.http_response to
check if telegraf is healthy.
• You can configure k8s
Liveness and Readiness
Probe to manage Telegraf
availability
© 2018 InfluxData. All rights reserved.26
ReadinessProbe and LivenessProbe
LivenessProbe: applications eventually transition to broken states,
and cannot recover except by being restarted. Kubernetes provides
liveness probes to detect and remedy such situations.
ReadinessProbe: applications eventually get busy or temporary
unavailable. A pod with a containers reporting that they are not
ready does not receive traffic through Kubernetes Services.
© 2018 InfluxData. All rights reserved.27
Telegraf as Sidecar gives you control
[[inputs.internal]]
[[inputs.prometheus]]
urls = ["http://127.0.0.1:9999/metrics"]
[[processors.converter]]
[processors.converter.tags]
string = ["user_agent"]
[[outputs.influxdb]]
urls = ["$MONITOR_HOST"]
database = "$MONITOR_DATABASE"
timeout = "5s"
[[outputs.influxdb_v2]]
urls=["http://us-west-2-1.aws.cloud2.influxdata.com"]
token = "$TOKEN"
organization = "$ORG"
bucket = "$BUCKET"
timeout = "5s"
namepass = ["internal"]
© 2018 InfluxData. All rights reserved.28
Telegraf Guard Rails
[[inputs.internal]]
[[inputs.prometheus]]
urls = ["http://127.0.0.1:9999/metrics"]
[[processors.tag_limit]]
limit = 3
## List of tags to preferentially preserve
keep = ["handler", "method", "status"]
[[outputs.influxdb]]
urls = ["$MONITOR_HOST"]
database = "$MONITOR_DATABASE"
timeout = "5s"
[[outputs.influxdb_v2]]
urls=["http://us-west-2-1.aws.cloud2.influxdata.com"]
token = "$TOKEN"
organization = "$ORG"
bucket = "$BUCKET"
timeout = "5s"
namepass = ["internal"]
© 2018 InfluxData. All rights reserved.29
Lessons
Scaling is NOT More Manual Processes
Scaling is NOT saying “You’re Doing it Wrong”
Scaling IS Empowering Developers
Scaling IS Predictability of Failure Modes
© 2018 InfluxData. All rights reserved.30
Lesson
Architecture is a never ending story…
Telegraf as sidecar for your developers writes to the daemonset ->
daemonset for your ops with safeguard writes to influxdb.
Maybe complex but possible!
Monitor is up
when you are down
InfluxDB makes everything simpler but your monitor
notifies you when your infrastructure is down. It is not
simple.
● Different infrastructure
● Reliability team
● Redundancy
● Or you can use a SaaS (InfluxCloud is 100%
compatible with OSS for write/read)
© 2018 InfluxData. All rights reserved.32
Number of pod restart
from(bucket:"kube-infra/monthly")
|> range(start: dashboardTime, stop: upperDashboardTime)
|> filter(fn: (r) =>
r._measurement == "kube_pod_container_status_restarts_total"
and r._field == "counter"
and r.container == "xxxx"
and r.namespace == "xxxx")
|> difference(nonNegative: true)
|> group()
|> aggregateWindow(every: autoInterval, fn: sum, createEmpty: false)
© 2018 InfluxData. All rights reserved.33
Persistent Volume % usage
start = -20m
from(bucket: "kube-infra/monthly")
|> range(start: start)
|> filter(fn: (r) =>
r._measurement == "kubernetes_pod_volume"
and (r._field == "used_bytes" or r._field == "capacity_bytes"))
|> aggregateWindow(every: 5m, fn: mean, createEmpty: false)
|> pivot(rowKey: ["_time"], columnKey: ["_field"], valueColumn: "_value")
|> map(fn: (r) => ({_time: r._time, _value: 100.0 * r.used_bytes / r.capacity_bytes})
Thank You!
@gianarb

More Related Content

Similar to Learn How to Use a Time Series Platform to Monitor All Aspects of Your Kubernetes Deployment

Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
InfluxData
 
Performance vision Version 2.15 news
Performance vision Version 2.15 newsPerformance vision Version 2.15 news
Performance vision Version 2.15 news
PerformanceVision (previously SecurActive)
 
Taming the Tiger: Tips and Tricks for Using Telegraf
Taming the Tiger: Tips and Tricks for Using TelegrafTaming the Tiger: Tips and Tricks for Using Telegraf
Taming the Tiger: Tips and Tricks for Using Telegraf
InfluxData
 
Taming the Tiger: Tips and Tricks for Using Telegraf
Taming the Tiger: Tips and Tricks for Using TelegrafTaming the Tiger: Tips and Tricks for Using Telegraf
Taming the Tiger: Tips and Tricks for Using Telegraf
InfluxData
 
InfluxDB 101 – Concepts and Architecture by Michael DeSa, Software Engineer |...
InfluxDB 101 – Concepts and Architecture by Michael DeSa, Software Engineer |...InfluxDB 101 – Concepts and Architecture by Michael DeSa, Software Engineer |...
InfluxDB 101 – Concepts and Architecture by Michael DeSa, Software Engineer |...
InfluxData
 
Shashi Raina [AWS] & Al Sargent [InfluxData] | Build Modern Monitoring with I...
Shashi Raina [AWS] & Al Sargent [InfluxData] | Build Modern Monitoring with I...Shashi Raina [AWS] & Al Sargent [InfluxData] | Build Modern Monitoring with I...
Shashi Raina [AWS] & Al Sargent [InfluxData] | Build Modern Monitoring with I...
InfluxData
 
使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster 使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster
inwin stack
 
Securing your Oracle Fusion Middleware Environment, On-Prem and in the Cloud
Securing your Oracle Fusion Middleware Environment, On-Prem and in the CloudSecuring your Oracle Fusion Middleware Environment, On-Prem and in the Cloud
Securing your Oracle Fusion Middleware Environment, On-Prem and in the Cloud
Revelation Technologies
 
Incrementalism: An Industrial Strategy For Adopting Modern Automation
Incrementalism: An Industrial Strategy For Adopting Modern AutomationIncrementalism: An Industrial Strategy For Adopting Modern Automation
Incrementalism: An Industrial Strategy For Adopting Modern Automation
Sean Chittenden
 
Ruby Driver Explained: DataStax Webinar May 5th 2015
Ruby Driver Explained: DataStax Webinar May 5th 2015Ruby Driver Explained: DataStax Webinar May 5th 2015
Ruby Driver Explained: DataStax Webinar May 5th 2015
DataStax
 
Advanced kapacitor
Advanced kapacitorAdvanced kapacitor
Advanced kapacitor
InfluxData
 
Modern Scheduling for Modern Applications with Nomad
Modern Scheduling for Modern Applications with NomadModern Scheduling for Modern Applications with Nomad
Modern Scheduling for Modern Applications with Nomad
Mitchell Pronschinske
 
InfluxDB Client Libraries and Applications | Miroslav Malecha | Bonitoo
InfluxDB Client Libraries and Applications | Miroslav Malecha | BonitooInfluxDB Client Libraries and Applications | Miroslav Malecha | Bonitoo
InfluxDB Client Libraries and Applications | Miroslav Malecha | Bonitoo
InfluxData
 
JAX London 2021: Jumpstart Your Cloud Native Development: An Overview of Prac...
JAX London 2021: Jumpstart Your Cloud Native Development: An Overview of Prac...JAX London 2021: Jumpstart Your Cloud Native Development: An Overview of Prac...
JAX London 2021: Jumpstart Your Cloud Native Development: An Overview of Prac...
Daniel Bryant
 
Monitoring in Motion: Monitoring Containers and Amazon ECS
Monitoring in Motion: Monitoring Containers and Amazon ECSMonitoring in Motion: Monitoring Containers and Amazon ECS
Monitoring in Motion: Monitoring Containers and Amazon ECS
Amazon Web Services
 
Hta t07-did-you-read-the-news-http-request-hijacking
Hta t07-did-you-read-the-news-http-request-hijackingHta t07-did-you-read-the-news-http-request-hijacking
Hta t07-did-you-read-the-news-http-request-hijacking
Комсс Файквэе
 
'DOCKER' & CLOUD: ENABLERS For DEVOPS
'DOCKER' & CLOUD:  ENABLERS For DEVOPS'DOCKER' & CLOUD:  ENABLERS For DEVOPS
'DOCKER' & CLOUD: ENABLERS For DEVOPS
ACA IT-Solutions
 
Docker and Cloud - Enables for DevOps - by ACA-IT
Docker and Cloud - Enables for DevOps - by ACA-ITDocker and Cloud - Enables for DevOps - by ACA-IT
Docker and Cloud - Enables for DevOps - by ACA-IT
Stijn Wijndaele
 
Self scaling Multi cloud nomad workloads
Self scaling Multi cloud nomad workloadsSelf scaling Multi cloud nomad workloads
Self scaling Multi cloud nomad workloads
Bram Vogelaar
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
Arvind Kumar G.S
 

Similar to Learn How to Use a Time Series Platform to Monitor All Aspects of Your Kubernetes Deployment (20)

Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
 
Performance vision Version 2.15 news
Performance vision Version 2.15 newsPerformance vision Version 2.15 news
Performance vision Version 2.15 news
 
Taming the Tiger: Tips and Tricks for Using Telegraf
Taming the Tiger: Tips and Tricks for Using TelegrafTaming the Tiger: Tips and Tricks for Using Telegraf
Taming the Tiger: Tips and Tricks for Using Telegraf
 
Taming the Tiger: Tips and Tricks for Using Telegraf
Taming the Tiger: Tips and Tricks for Using TelegrafTaming the Tiger: Tips and Tricks for Using Telegraf
Taming the Tiger: Tips and Tricks for Using Telegraf
 
InfluxDB 101 – Concepts and Architecture by Michael DeSa, Software Engineer |...
InfluxDB 101 – Concepts and Architecture by Michael DeSa, Software Engineer |...InfluxDB 101 – Concepts and Architecture by Michael DeSa, Software Engineer |...
InfluxDB 101 – Concepts and Architecture by Michael DeSa, Software Engineer |...
 
Shashi Raina [AWS] & Al Sargent [InfluxData] | Build Modern Monitoring with I...
Shashi Raina [AWS] & Al Sargent [InfluxData] | Build Modern Monitoring with I...Shashi Raina [AWS] & Al Sargent [InfluxData] | Build Modern Monitoring with I...
Shashi Raina [AWS] & Al Sargent [InfluxData] | Build Modern Monitoring with I...
 
使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster 使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster
 
Securing your Oracle Fusion Middleware Environment, On-Prem and in the Cloud
Securing your Oracle Fusion Middleware Environment, On-Prem and in the CloudSecuring your Oracle Fusion Middleware Environment, On-Prem and in the Cloud
Securing your Oracle Fusion Middleware Environment, On-Prem and in the Cloud
 
Incrementalism: An Industrial Strategy For Adopting Modern Automation
Incrementalism: An Industrial Strategy For Adopting Modern AutomationIncrementalism: An Industrial Strategy For Adopting Modern Automation
Incrementalism: An Industrial Strategy For Adopting Modern Automation
 
Ruby Driver Explained: DataStax Webinar May 5th 2015
Ruby Driver Explained: DataStax Webinar May 5th 2015Ruby Driver Explained: DataStax Webinar May 5th 2015
Ruby Driver Explained: DataStax Webinar May 5th 2015
 
Advanced kapacitor
Advanced kapacitorAdvanced kapacitor
Advanced kapacitor
 
Modern Scheduling for Modern Applications with Nomad
Modern Scheduling for Modern Applications with NomadModern Scheduling for Modern Applications with Nomad
Modern Scheduling for Modern Applications with Nomad
 
InfluxDB Client Libraries and Applications | Miroslav Malecha | Bonitoo
InfluxDB Client Libraries and Applications | Miroslav Malecha | BonitooInfluxDB Client Libraries and Applications | Miroslav Malecha | Bonitoo
InfluxDB Client Libraries and Applications | Miroslav Malecha | Bonitoo
 
JAX London 2021: Jumpstart Your Cloud Native Development: An Overview of Prac...
JAX London 2021: Jumpstart Your Cloud Native Development: An Overview of Prac...JAX London 2021: Jumpstart Your Cloud Native Development: An Overview of Prac...
JAX London 2021: Jumpstart Your Cloud Native Development: An Overview of Prac...
 
Monitoring in Motion: Monitoring Containers and Amazon ECS
Monitoring in Motion: Monitoring Containers and Amazon ECSMonitoring in Motion: Monitoring Containers and Amazon ECS
Monitoring in Motion: Monitoring Containers and Amazon ECS
 
Hta t07-did-you-read-the-news-http-request-hijacking
Hta t07-did-you-read-the-news-http-request-hijackingHta t07-did-you-read-the-news-http-request-hijacking
Hta t07-did-you-read-the-news-http-request-hijacking
 
'DOCKER' & CLOUD: ENABLERS For DEVOPS
'DOCKER' & CLOUD:  ENABLERS For DEVOPS'DOCKER' & CLOUD:  ENABLERS For DEVOPS
'DOCKER' & CLOUD: ENABLERS For DEVOPS
 
Docker and Cloud - Enables for DevOps - by ACA-IT
Docker and Cloud - Enables for DevOps - by ACA-ITDocker and Cloud - Enables for DevOps - by ACA-IT
Docker and Cloud - Enables for DevOps - by ACA-IT
 
Self scaling Multi cloud nomad workloads
Self scaling Multi cloud nomad workloadsSelf scaling Multi cloud nomad workloads
Self scaling Multi cloud nomad workloads
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
 

More from DevOps.com

Modernizing on IBM Z Made Easier With Open Source Software
Modernizing on IBM Z Made Easier With Open Source SoftwareModernizing on IBM Z Made Easier With Open Source Software
Modernizing on IBM Z Made Easier With Open Source Software
DevOps.com
 
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
DevOps.com
 
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
DevOps.com
 
Next Generation Vulnerability Assessment Using Datadog and Snyk
Next Generation Vulnerability Assessment Using Datadog and SnykNext Generation Vulnerability Assessment Using Datadog and Snyk
Next Generation Vulnerability Assessment Using Datadog and Snyk
DevOps.com
 
Vulnerability Discovery in the Cloud
Vulnerability Discovery in the CloudVulnerability Discovery in the Cloud
Vulnerability Discovery in the Cloud
DevOps.com
 
2021 Open Source Governance: Top Ten Trends and Predictions
2021 Open Source Governance: Top Ten Trends and Predictions2021 Open Source Governance: Top Ten Trends and Predictions
2021 Open Source Governance: Top Ten Trends and Predictions
DevOps.com
 
A New Year’s Ransomware Resolution
A New Year’s Ransomware ResolutionA New Year’s Ransomware Resolution
A New Year’s Ransomware Resolution
DevOps.com
 
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
DevOps.com
 
Don't Panic! Effective Incident Response
Don't Panic! Effective Incident ResponseDon't Panic! Effective Incident Response
Don't Panic! Effective Incident Response
DevOps.com
 
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's Culture
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's CultureCreating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's Culture
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's Culture
DevOps.com
 
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with Teleport
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with TeleportRole Based Access Controls (RBAC) for SSH and Kubernetes Access with Teleport
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with Teleport
DevOps.com
 
Monitoring Serverless Applications with Datadog
Monitoring Serverless Applications with DatadogMonitoring Serverless Applications with Datadog
Monitoring Serverless Applications with Datadog
DevOps.com
 
Deliver your App Anywhere … Publicly or Privately
Deliver your App Anywhere … Publicly or PrivatelyDeliver your App Anywhere … Publicly or Privately
Deliver your App Anywhere … Publicly or Privately
DevOps.com
 
Securing medical apps in the age of covid final
Securing medical apps in the age of covid finalSecuring medical apps in the age of covid final
Securing medical apps in the age of covid final
DevOps.com
 
How to Build a Healthy On-Call Culture
How to Build a Healthy On-Call CultureHow to Build a Healthy On-Call Culture
How to Build a Healthy On-Call Culture
DevOps.com
 
The Evolving Role of the Developer in 2021
The Evolving Role of the Developer in 2021The Evolving Role of the Developer in 2021
The Evolving Role of the Developer in 2021
DevOps.com
 
Service Mesh: Two Big Words But Do You Need It?
Service Mesh: Two Big Words But Do You Need It?Service Mesh: Two Big Words But Do You Need It?
Service Mesh: Two Big Words But Do You Need It?
DevOps.com
 
Secure Data Sharing in OpenShift Environments
Secure Data Sharing in OpenShift EnvironmentsSecure Data Sharing in OpenShift Environments
Secure Data Sharing in OpenShift Environments
DevOps.com
 
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...
DevOps.com
 
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...
DevOps.com
 

More from DevOps.com (20)

Modernizing on IBM Z Made Easier With Open Source Software
Modernizing on IBM Z Made Easier With Open Source SoftwareModernizing on IBM Z Made Easier With Open Source Software
Modernizing on IBM Z Made Easier With Open Source Software
 
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
 
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
 
Next Generation Vulnerability Assessment Using Datadog and Snyk
Next Generation Vulnerability Assessment Using Datadog and SnykNext Generation Vulnerability Assessment Using Datadog and Snyk
Next Generation Vulnerability Assessment Using Datadog and Snyk
 
Vulnerability Discovery in the Cloud
Vulnerability Discovery in the CloudVulnerability Discovery in the Cloud
Vulnerability Discovery in the Cloud
 
2021 Open Source Governance: Top Ten Trends and Predictions
2021 Open Source Governance: Top Ten Trends and Predictions2021 Open Source Governance: Top Ten Trends and Predictions
2021 Open Source Governance: Top Ten Trends and Predictions
 
A New Year’s Ransomware Resolution
A New Year’s Ransomware ResolutionA New Year’s Ransomware Resolution
A New Year’s Ransomware Resolution
 
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
 
Don't Panic! Effective Incident Response
Don't Panic! Effective Incident ResponseDon't Panic! Effective Incident Response
Don't Panic! Effective Incident Response
 
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's Culture
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's CultureCreating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's Culture
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's Culture
 
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with Teleport
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with TeleportRole Based Access Controls (RBAC) for SSH and Kubernetes Access with Teleport
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with Teleport
 
Monitoring Serverless Applications with Datadog
Monitoring Serverless Applications with DatadogMonitoring Serverless Applications with Datadog
Monitoring Serverless Applications with Datadog
 
Deliver your App Anywhere … Publicly or Privately
Deliver your App Anywhere … Publicly or PrivatelyDeliver your App Anywhere … Publicly or Privately
Deliver your App Anywhere … Publicly or Privately
 
Securing medical apps in the age of covid final
Securing medical apps in the age of covid finalSecuring medical apps in the age of covid final
Securing medical apps in the age of covid final
 
How to Build a Healthy On-Call Culture
How to Build a Healthy On-Call CultureHow to Build a Healthy On-Call Culture
How to Build a Healthy On-Call Culture
 
The Evolving Role of the Developer in 2021
The Evolving Role of the Developer in 2021The Evolving Role of the Developer in 2021
The Evolving Role of the Developer in 2021
 
Service Mesh: Two Big Words But Do You Need It?
Service Mesh: Two Big Words But Do You Need It?Service Mesh: Two Big Words But Do You Need It?
Service Mesh: Two Big Words But Do You Need It?
 
Secure Data Sharing in OpenShift Environments
Secure Data Sharing in OpenShift EnvironmentsSecure Data Sharing in OpenShift Environments
Secure Data Sharing in OpenShift Environments
 
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...
 
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...
 

Recently uploaded

A Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's ArchitectureA Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's Architecture
ScyllaDB
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
Fwdays
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
zjhamm304
 
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeckPoznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
FilipTomaszewski5
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ajin Abraham
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
Fwdays
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024
Vadym Kazulkin
 
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
Fwdays
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
christinelarrosa
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Pitangent Analytics & Technology Solutions Pvt. Ltd
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
Safe Software
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
Jason Yip
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 

Recently uploaded (20)

A Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's ArchitectureA Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's Architecture
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
 
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeckPoznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024
 
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
 

Learn How to Use a Time Series Platform to Monitor All Aspects of Your Kubernetes Deployment

  • 1. Gianluca Arbezzano / Site Reliability Engineer Kubernetes Monitoring with InfluxDB
  • 2. © 2018 InfluxData. All rights reserved.2 Who I am Gianluca Arbezzano Site Reliability Engineer @InfluxData • http://gianarb.it • @gianarb What I like: • I make dirty hacks that look awesome • I grow my vegetables 🍅🌻🍆 • Travel for fun and work
  • 3. © 2019 InfluxData. All rights reserved.3
  • 4. © 2019 InfluxData. All rights reserved.4
  • 5. © 2018 InfluxData. All rights reserved.5
  • 6. © 2018 InfluxData. All rights reserved.6
  • 7. How distributed systems monitoring is different ● Partial failure ● Fault tolerance and resiliency ○ Space dimension = replications ○ Time dimension = retries ● “normal state” is hard to define
  • 8. © 2018 InfluxData. All rights reserved.8 DB 1 DB 1 Client A Client B Load Balancer Load Balancer Cache A Cache B
  • 10. © 2018 InfluxData. All rights reserved.10 Kubernetes architecture diagram
  • 11. © 2018 InfluxData. All rights reserved.11 Telegraf as daemonset to get nodes stats [[inputs.internal]] [[inputs.cpu]] [[inputs.disk]] ignore_fs = ["tmpfs", "devtmpfs", "devfs"] [[inputs.diskio]] [[inputs.kernel]] [[inputs.mem]] [[inputs.processes]] [[inputs.swap]] [[inputs.system]] [[inputs.docker]] endpoint = "unix:///var/run/docker.sock" [[inputs.kubernetes]] url = "http://127.0.0.1:10255"
  • 12. © 2018 InfluxData. All rights reserved.12 Telegraf as daemonset to get nodes stats volumeMounts: - name: sys mountPath: /rootfs/sys readOnly: true - name: docker mountPath: /var/run/docker.sock readOnly: true - name: proc mountPath: /rootfs/proc readOnly: true - name: utmp mountPath: /var/run/utmp readOnly: true ● hostNetwork: true ● dnsPolicy: clusterFist
  • 13. © 2018 InfluxData. All rights reserved.13 Telegraf as daemonset reachable from a container env: - name: HOST_IP valueFrom: fieldRef: fieldPath: status.hostIP - name: MONITOR_HOST value: "http://$(HOST_IP):8086" [[inputs.http_listener]] ## Address and port to host HTTP listener on service_address = ":8086"
  • 14. © 2018 InfluxData. All rights reserved.14 Telegraf as a Sidecar
  • 15. © 2018 InfluxData. All rights reserved.15 Telegraf as a Sidecar apiVersion: apps/v1beta1 kind: StatefulSet metadata: name: "etcd" labels: spec: serviceName: "etcd" replicas: 3 template: metadata: name: "etcd" labels: component: "etcd" spec: containers: - name: "telegraf" image: "docker.io/library/telegraf:1.4" - name: "etcd" image: "quay.io/coreos/etcd:v3.2.9"
  • 16. © 2018 InfluxData. All rights reserved.16 https://www.influxdata.com/blog/monitoring-kubernetes-architecture/
  • 17. Feedback from “real life” ● High number of Telegraf running inside the cluster ● For Prometheus metrics there is a better way (I will tell you how later) ● Pull vs Push
  • 18. © 2018 InfluxData. All rights reserved.18
  • 19. © 2018 InfluxData. All rights reserved.19 Pull and Push
  • 20. © 2018 InfluxData. All rights reserved.20 /metrics # HELP storage_cache_age_seconds Age in seconds of the current cache (time since last snapshot or initialisation). # TYPE storage_cache_age_seconds gauge storage_cache_age_seconds{engine_id="0",node_id="0"} 112.999976922 storage_cache_age_seconds{engine_id="1",node_id="0"} 26.999942596 storage_cache_age_seconds{engine_id="16",node_id="0"} 188.999943578 storage_cache_age_seconds{engine_id="17",node_id="0"} 127.999951674 storage_cache_age_seconds{engine_id="24",node_id="0"} 591.999925169 storage_cache_age_seconds{engine_id="25",node_id="0"} 444.999924453 storage_cache_age_seconds{engine_id="32",node_id="0"} 578.999943156 storage_cache_age_seconds{engine_id="33",node_id="0"} 340.99992462 storage_cache_age_seconds{engine_id="40",node_id="0"} 427.999951022 storage_cache_age_seconds{engine_id="41",node_id="0"} 375.999926161
  • 21. © 2018 InfluxData. All rights reserved.21
  • 22. © 2018 InfluxData. All rights reserved.22 Kubernetes discovery with the Prometheus Plugin [[inputs.prometheus]] monitor_kubernetes_pods = true Enabling this option will allow the plugin to scrape for prometheus annotation on Kubernetes pods. • prometheus.io/scrape Enable scraping for this pod. • prometheus.io/scheme If the metrics endpoint is secured then you will need to set this to https & most likely set the tls config. (default 'http') • prometheus.io/path Override the path for the metrics endpoint on the service. (default '/metrics') • prometheus.io/port Used to override the port. (default 9102)
  • 23. © 2018 InfluxData. All rights reserved.23 Monitor your ingestion pipeline • internal_memstats • internal_agent – metrics_dropped – metrics_gathered • internal_gather • internal_write
  • 24. start = 6h interval = 3m from(bucket: "kube-infra/monthly") |> range(start: start) |> filter(fn: (r) => r._measurement == "internal_agent" and r.env == "acc" and r.host =~ /^telegraf-prom-discovery/) |> filter(fn: (r) => r._field == "metrics_dropped" or r._field == "metrics_gathered" or r._field == "metrics_written") |> window(every: interval) |> mean() // defaults to "_value" |> group(columns: ["_field"]) |> derivative(nonNegative: true, timeColumn: "_stop")
  • 25. © 2018 InfluxData. All rights reserved.25 Monitor your ingestion pipeline • You can use inputs.http_response to check if telegraf is healthy. • You can configure k8s Liveness and Readiness Probe to manage Telegraf availability
  • 26. © 2018 InfluxData. All rights reserved.26 ReadinessProbe and LivenessProbe LivenessProbe: applications eventually transition to broken states, and cannot recover except by being restarted. Kubernetes provides liveness probes to detect and remedy such situations. ReadinessProbe: applications eventually get busy or temporary unavailable. A pod with a containers reporting that they are not ready does not receive traffic through Kubernetes Services.
  • 27. © 2018 InfluxData. All rights reserved.27 Telegraf as Sidecar gives you control [[inputs.internal]] [[inputs.prometheus]] urls = ["http://127.0.0.1:9999/metrics"] [[processors.converter]] [processors.converter.tags] string = ["user_agent"] [[outputs.influxdb]] urls = ["$MONITOR_HOST"] database = "$MONITOR_DATABASE" timeout = "5s" [[outputs.influxdb_v2]] urls=["http://us-west-2-1.aws.cloud2.influxdata.com"] token = "$TOKEN" organization = "$ORG" bucket = "$BUCKET" timeout = "5s" namepass = ["internal"]
  • 28. © 2018 InfluxData. All rights reserved.28 Telegraf Guard Rails [[inputs.internal]] [[inputs.prometheus]] urls = ["http://127.0.0.1:9999/metrics"] [[processors.tag_limit]] limit = 3 ## List of tags to preferentially preserve keep = ["handler", "method", "status"] [[outputs.influxdb]] urls = ["$MONITOR_HOST"] database = "$MONITOR_DATABASE" timeout = "5s" [[outputs.influxdb_v2]] urls=["http://us-west-2-1.aws.cloud2.influxdata.com"] token = "$TOKEN" organization = "$ORG" bucket = "$BUCKET" timeout = "5s" namepass = ["internal"]
  • 29. © 2018 InfluxData. All rights reserved.29 Lessons Scaling is NOT More Manual Processes Scaling is NOT saying “You’re Doing it Wrong” Scaling IS Empowering Developers Scaling IS Predictability of Failure Modes
  • 30. © 2018 InfluxData. All rights reserved.30 Lesson Architecture is a never ending story… Telegraf as sidecar for your developers writes to the daemonset -> daemonset for your ops with safeguard writes to influxdb. Maybe complex but possible!
  • 31. Monitor is up when you are down InfluxDB makes everything simpler but your monitor notifies you when your infrastructure is down. It is not simple. ● Different infrastructure ● Reliability team ● Redundancy ● Or you can use a SaaS (InfluxCloud is 100% compatible with OSS for write/read)
  • 32. © 2018 InfluxData. All rights reserved.32 Number of pod restart from(bucket:"kube-infra/monthly") |> range(start: dashboardTime, stop: upperDashboardTime) |> filter(fn: (r) => r._measurement == "kube_pod_container_status_restarts_total" and r._field == "counter" and r.container == "xxxx" and r.namespace == "xxxx") |> difference(nonNegative: true) |> group() |> aggregateWindow(every: autoInterval, fn: sum, createEmpty: false)
  • 33. © 2018 InfluxData. All rights reserved.33 Persistent Volume % usage start = -20m from(bucket: "kube-infra/monthly") |> range(start: start) |> filter(fn: (r) => r._measurement == "kubernetes_pod_volume" and (r._field == "used_bytes" or r._field == "capacity_bytes")) |> aggregateWindow(every: 5m, fn: mean, createEmpty: false) |> pivot(rowKey: ["_time"], columnKey: ["_field"], valueColumn: "_value") |> map(fn: (r) => ({_time: r._time, _value: 100.0 * r.used_bytes / r.capacity_bytes})