SlideShare a Scribd company logo
you should watch
Jorge Salamero - @bencerillo
15 Kubernetes
failure points
Jorge Salamero
Tech Marketing aka container gamer @ Sysdig
github.com/bencer
@bencerillo
OSS fan
Monitoring, containers, IoT/home-automation, cars
About me
Monitoring & Security Platform for Containers
Monitoring 15 Kubernetes failure points
- Apps
- Hosts
- Orchestration
- Containers
- Yourself
https://sysdig.com/blog/monitoring-kubernetes-with-sysdig-cloud/
https://sysdig.com/blog/alerting-kubernetes/
The holy service metrics
- KPI / biz metrics / synthetic
monitoring / user metrics
- Google SRE book:
“The Four Golden Signals”
Latency+Traffic+Errors+Saturation
USE method
- Utilization
(how busy we are, close to 100% bottleneck)
- Saturation
(amount of work waiting on the queue)
- Errors
RED method
- Request Rate
- Request Errors
- Request Duration
The holy service metrics
- Code instrumentation (statsd, JMX
or Prometheus metrics):
var httpDurationsHistogram := prometheus.NewHistogramVec(prometheus.HistogramOpts{
Name: "http_durations_histogram_seconds",
Help: "Seconds spent serving HTTP requests.",
Buckets: prometheus.DefBuckets,
}, []string{"method", "route", "status_code"})
prometheus.MustRegister(httpDurationsHistogram)
- or Sysdig autodiscovery ;-)
1. connections per second
net.request.count
2. response time
net.response.time
3. errors
net.request.error.count
Prometheus + Grafana UI
Kubernetes orchestration
Kubernetes hierarchy
Services vs hosts+containers
Kubernetes metadata: labels
Pod
app: shopping
tier: api
Pod
app: shopping
tier: db
Pod
app: social
tier: api
role: search
Pod
app: social
tier: api
role: search
Leverage metadata (by service)
Leverage metadata (by pod)
Health vs state monitoring
- Health:
- CPU, memory, disk
- connections, response time,
errors
Health vs state monitoring
- State (orchestration):
- Are containers up and
running properly?
Health vs state monitoring
- kube-state-metrics
https://github.com/kubernetes/kube-state-metrics
https://sysdig.com/blog/introducing-kube-state-metrics/
calculate new metrics based on
the state of Kubernetes
resources
Container scheduling
- Need to deploy a container:
- given the requirements,
where can we run it?
and let’s ignore affinity, taints and tolerations:
https://sysdig.com/blog/kubernetes-scheduler/
- capacity planning
4. node availability
Based on the host or the kubelet component status:
kube_node_status_condition{condition="Ready",status="true"} == 0
count(kube_node_status_condition{condition="Ready",status="true"} == 0) > 1 and
(count(kube_node_status_condition{condition="Ready",status="true"} == 0) /
count(kube_node_status_condition{condition="Ready",status="true"})) > 0.2
count(up{job="kubelet"} == 0) / count(up{job="kubelet"}) * 100 > 3
kube_node_status_condition: kube_node_status_ready,
kube_node_status_out_of_disk, kube_node_status_memory_pressure,
kube_node_status_disk_pressure, and kube_node_status_network_unavailable
Sysdig alert UI
Container resource requirements
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
https://github.com/kubernetes-incubator/cluster-capacity
5. CPU resources
6. memory resources
kube_node_status_capacity_pods
kube_node_status_allocatable_pods
kube_node_status_capacity_cpu_cores
kube_node_status_capacity_memory_bytes
kube_node_status_allocatable_cpu_cores
kube_node_status_allocatable_memory_bytes
capacity - used (by OS and kube services) = allocatable
Container disk requirements
here things get more complicated...
- ephemeral disk usage
- persistent volumes claims
7. disk resources
predict_linear(node_filesystem_free[30m], 3600 * 2) < 0
kube_node_status_condition: kube_node_status_out_of_disk
but within containers this is still WIP, at least Kubernetes 1.8:
container_fs_* doesn’t work with PV
https://github.com/kubernetes/kubernetes/pull/59170
https://github.com/kubernetes/kubernetes/pull/51553
https://kubernetes.io/docs/concepts/cluster-administration/controller-metrics/
Container orchestration
- ReplicationController
- ReplicaSet
- Deployment
- DaemonSet
- StatefulSet
Kubernetes deployments
Is Kubernetes doing what is
supposed to to?
Orchestration needs monitoring too.
8. running instances
9. desired instances
((kube_deployment_status_replicas_updated != kube_deployment_spec_replicas)
or
(kube_deployment_status_replicas_available != kube_deployment_spec_replicas))
10. deployment updates glitches
kube_deployment_status_observed_generation !=
kube_deployment_metadata_generation
kube_deployment_spec_paused
kube_deployment_spec_strategy_rollingupdate_max_unavailable
Container livecycle state
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/
Liveness probes
To know when to restart a container:
livenessProbe:
httpGet:
path: /healthz
port: 8080
httpHeaders:
- name: X-Custom-Header
value: Awesome
initialDelaySeconds: 3
periodSeconds: 3
Ready-ness probes
To know when a container is ready to start accepting traffic:
readinessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5
11. pod status
kube_pod_status_phase: Pending|Running|Succeeded|Failed|Unknown
kube_pod_status_ready
kube_pod_status_scheduled
kube_pod_container_status_waiting
kube_pod_container_status_running
kube_pod_container_status_terminated
kube_pod_container_status_ready
12. pod restarts
You can look at this as a metric or as an event:
ALERT PodRestartingTooMuch
IF rate(k8s_pod_status_restartCount[1m]) > 1/(5*60)
FOR 1h
LABELS { severity="warning" }
ANNOTATIONS {
summary = "Pod {{$labels.namespace}}/{{$label.name}} restarting too
much.",
description = "Pod {{$labels.namespace}}/{{$label.name}} restarting too
much.",
}
CrashLoopBackOff event
https://sysdig.com/blog/debug-kubernetes-crashloopbackoff/
Sysdig Inspect
https://github.com/draios/sysdig-inspect
Kubernetes internals
- APIserver
- KubeDNS / Istio
- container registry
- any other piece of Kubernetes
https://sysdig.com/blog/monitor-etcd/
13. APIserver
rate(apiserver_request_count{code=~"^(?:5..)$"}[5m]) /
rate(apiserver_request_count[5m])* 100 > 5
apiserver_latency_seconds:quantile{quantile="0.99",subresource!="log",verb!
~"^(?:WATCH|WATCHLIST|PROXY|CONNECT)$"}> 4
Or just do Golden signals on APIserver endpoint too :-)
14. KubeDNS / Istio
histogram_quantile(0.95,
sum(rate(kubedns_probe_kubedns_latency_ms_bucket[1m])) BY (le,
kubernetes_pod_name)) > 1000
All export native metrics in Prometheus format, just scrape them!
https://sysdig.com/blog/monitor-istio/
What are we deploying?
- CI/CD and commits
- Manual deploys
You need to validate what you
tell Kubernetes too!
15. monitor your commands
kubeval: validates YAML and JSON config files
https://github.com/garethr/kubeval
kube-diff: show differences between running state and version controlled configuration
https://github.com/weaveworks/kubediff
Configuration reconciliation discussion:
https://github.com/kubernetes/kubernetes/issues/1702
Although this is getting automated too:
https://sysdig.com/blog/kubernetes-scaler/
Recap
1. connections per second
2. response time
3. errors
4. node availability
5. CPU resources
6. memory resources
7. disk and external resources
Recap (2)
8. running instances
9. desired instances
10. deployment updates glitches
Recap (3)
11. pod status
12. pod restarts
13. APIserver health
14. KubeDNS / Istio health
15. monitor your commands
Grazie!
Jorge Salamero - @bencerillo
https://sysdig.com/blog/

More Related Content

What's hot

2. estrutura e elementos estruturais básicos.
2. estrutura e elementos estruturais básicos.2. estrutura e elementos estruturais básicos.
2. estrutura e elementos estruturais básicos.
PREFEITURA DE VÁRZEA GRANDE
 
Kayu sni2002 samb.paku-baut
Kayu sni2002   samb.paku-bautKayu sni2002   samb.paku-baut
Kayu sni2002 samb.paku-baut
andangsadewa
 
PRESENTASI BEKISTING.ppt
PRESENTASI BEKISTING.pptPRESENTASI BEKISTING.ppt
PRESENTASI BEKISTING.ppt
BrigittaMelanieSandr
 
PPT KP IRIGASI.pptx
PPT KP IRIGASI.pptxPPT KP IRIGASI.pptx
PPT KP IRIGASI.pptx
Juliomanbryanmanuho
 
Beton prategang
Beton prategangBeton prategang
Beton prategang
Poten Novo
 
Slump test pada beton (Angga Nugraha)
Slump test pada beton (Angga Nugraha)Slump test pada beton (Angga Nugraha)
Slump test pada beton (Angga Nugraha)
Angga Nugraha
 
Analisis pelaksanaan dan kekuatan pile cap tipe bp 20
Analisis  pelaksanaan  dan kekuatan  pile  cap  tipe  bp  20Analisis  pelaksanaan  dan kekuatan  pile  cap  tipe  bp  20
Analisis pelaksanaan dan kekuatan pile cap tipe bp 20
Aan Kurniawan
 
Desain sengkang struktur beton bertulang
Desain sengkang struktur beton bertulangDesain sengkang struktur beton bertulang
Desain sengkang struktur beton bertulang
Shaleh Afif Hasibuan
 
diklat pisk palembang Pengendalian mutu beton
diklat pisk palembang Pengendalian mutu beton diklat pisk palembang Pengendalian mutu beton
diklat pisk palembang Pengendalian mutu beton
Abdul Majid
 
03. pelaksanaan konstruksi jembatan
03. pelaksanaan konstruksi jembatan03. pelaksanaan konstruksi jembatan
03. pelaksanaan konstruksi jembatan
DedyEko4
 
Tahap tahap pembangunan gedung lima lantai
Tahap tahap pembangunan gedung lima lantaiTahap tahap pembangunan gedung lima lantai
Tahap tahap pembangunan gedung lima lantai
Henday Kurniawan
 

What's hot (11)

2. estrutura e elementos estruturais básicos.
2. estrutura e elementos estruturais básicos.2. estrutura e elementos estruturais básicos.
2. estrutura e elementos estruturais básicos.
 
Kayu sni2002 samb.paku-baut
Kayu sni2002   samb.paku-bautKayu sni2002   samb.paku-baut
Kayu sni2002 samb.paku-baut
 
PRESENTASI BEKISTING.ppt
PRESENTASI BEKISTING.pptPRESENTASI BEKISTING.ppt
PRESENTASI BEKISTING.ppt
 
PPT KP IRIGASI.pptx
PPT KP IRIGASI.pptxPPT KP IRIGASI.pptx
PPT KP IRIGASI.pptx
 
Beton prategang
Beton prategangBeton prategang
Beton prategang
 
Slump test pada beton (Angga Nugraha)
Slump test pada beton (Angga Nugraha)Slump test pada beton (Angga Nugraha)
Slump test pada beton (Angga Nugraha)
 
Analisis pelaksanaan dan kekuatan pile cap tipe bp 20
Analisis  pelaksanaan  dan kekuatan  pile  cap  tipe  bp  20Analisis  pelaksanaan  dan kekuatan  pile  cap  tipe  bp  20
Analisis pelaksanaan dan kekuatan pile cap tipe bp 20
 
Desain sengkang struktur beton bertulang
Desain sengkang struktur beton bertulangDesain sengkang struktur beton bertulang
Desain sengkang struktur beton bertulang
 
diklat pisk palembang Pengendalian mutu beton
diklat pisk palembang Pengendalian mutu beton diklat pisk palembang Pengendalian mutu beton
diklat pisk palembang Pengendalian mutu beton
 
03. pelaksanaan konstruksi jembatan
03. pelaksanaan konstruksi jembatan03. pelaksanaan konstruksi jembatan
03. pelaksanaan konstruksi jembatan
 
Tahap tahap pembangunan gedung lima lantai
Tahap tahap pembangunan gedung lima lantaiTahap tahap pembangunan gedung lima lantai
Tahap tahap pembangunan gedung lima lantai
 

Similar to 15 kubernetes failure points you should watch

DevOpsDaysRiga 2018: Andrew Martin - Continuous Kubernetes Security
DevOpsDaysRiga 2018: Andrew Martin - Continuous Kubernetes Security DevOpsDaysRiga 2018: Andrew Martin - Continuous Kubernetes Security
DevOpsDaysRiga 2018: Andrew Martin - Continuous Kubernetes Security
DevOpsDays Riga
 
Monitoring akka cluster on kubernetes
Monitoring akka cluster on kubernetesMonitoring akka cluster on kubernetes
Monitoring akka cluster on kubernetes
Seva Dolgopolov
 
Kubernetes - Sailing a Sea of Containers
Kubernetes - Sailing a Sea of ContainersKubernetes - Sailing a Sea of Containers
Kubernetes - Sailing a Sea of Containers
Kel Cecil
 
Control Plane: Continuous Kubernetes Security (DevSecOps - London Gathering, ...
Control Plane: Continuous Kubernetes Security (DevSecOps - London Gathering, ...Control Plane: Continuous Kubernetes Security (DevSecOps - London Gathering, ...
Control Plane: Continuous Kubernetes Security (DevSecOps - London Gathering, ...
Michael Man
 
Hands-On Introduction to Kubernetes at LISA17
Hands-On Introduction to Kubernetes at LISA17Hands-On Introduction to Kubernetes at LISA17
Hands-On Introduction to Kubernetes at LISA17
Ryan Jarvinen
 
Build Your Own CaaS (Container as a Service)
Build Your Own CaaS (Container as a Service)Build Your Own CaaS (Container as a Service)
Build Your Own CaaS (Container as a Service)
HungWei Chiu
 
Keynote #Tech - Google : aperçu de la gestion des services distribués chez Go...
Keynote #Tech - Google : aperçu de la gestion des services distribués chez Go...Keynote #Tech - Google : aperçu de la gestion des services distribués chez Go...
Keynote #Tech - Google : aperçu de la gestion des services distribués chez Go...
Paris Open Source Summit
 
Kubernetes extensibility: CRDs & Operators
Kubernetes extensibility: CRDs & OperatorsKubernetes extensibility: CRDs & Operators
Kubernetes extensibility: CRDs & Operators
SIGHUP
 
Kubernetes extensibility: crd & operators
Kubernetes extensibility: crd & operators Kubernetes extensibility: crd & operators
Kubernetes extensibility: crd & operators
Giacomo Tirabassi
 
Using kubernetes to lose your fear of using containers
Using kubernetes to lose your fear of using containersUsing kubernetes to lose your fear of using containers
Using kubernetes to lose your fear of using containers
josfuecas
 
Azure kubernetes service (aks) part 3
Azure kubernetes service (aks)   part 3Azure kubernetes service (aks)   part 3
Azure kubernetes service (aks) part 3
Nilesh Gule
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with Prometheus
Tobias Schmidt
 
A GitOps model for High Availability and Disaster Recovery on EKS
A GitOps model for High Availability and Disaster Recovery on EKSA GitOps model for High Availability and Disaster Recovery on EKS
A GitOps model for High Availability and Disaster Recovery on EKS
Weaveworks
 
using Mithril.js + postgREST to build and consume API's
using Mithril.js + postgREST to build and consume API'susing Mithril.js + postgREST to build and consume API's
using Mithril.js + postgREST to build and consume API's
Antônio Roberto Silva
 
使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster 使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster
inwin stack
 
How Honestbee Does CI/CD on Kubernetes - Vincent DeSmet
How Honestbee Does CI/CD on Kubernetes - Vincent DeSmetHow Honestbee Does CI/CD on Kubernetes - Vincent DeSmet
How Honestbee Does CI/CD on Kubernetes - Vincent DeSmet
DevOpsDaysJKT
 
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...
Codemotion
 
Cluster management with Kubernetes
Cluster management with KubernetesCluster management with Kubernetes
Cluster management with Kubernetes
Satnam Singh
 
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdfServerless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
Stavros Kontopoulos
 
Kubernetes x PaaS – コンテナアプリケーションのNoOpsへの挑戦
Kubernetes x PaaS – コンテナアプリケーションのNoOpsへの挑戦Kubernetes x PaaS – コンテナアプリケーションのNoOpsへの挑戦
Kubernetes x PaaS – コンテナアプリケーションのNoOpsへの挑戦
Yoichi Kawasaki
 

Similar to 15 kubernetes failure points you should watch (20)

DevOpsDaysRiga 2018: Andrew Martin - Continuous Kubernetes Security
DevOpsDaysRiga 2018: Andrew Martin - Continuous Kubernetes Security DevOpsDaysRiga 2018: Andrew Martin - Continuous Kubernetes Security
DevOpsDaysRiga 2018: Andrew Martin - Continuous Kubernetes Security
 
Monitoring akka cluster on kubernetes
Monitoring akka cluster on kubernetesMonitoring akka cluster on kubernetes
Monitoring akka cluster on kubernetes
 
Kubernetes - Sailing a Sea of Containers
Kubernetes - Sailing a Sea of ContainersKubernetes - Sailing a Sea of Containers
Kubernetes - Sailing a Sea of Containers
 
Control Plane: Continuous Kubernetes Security (DevSecOps - London Gathering, ...
Control Plane: Continuous Kubernetes Security (DevSecOps - London Gathering, ...Control Plane: Continuous Kubernetes Security (DevSecOps - London Gathering, ...
Control Plane: Continuous Kubernetes Security (DevSecOps - London Gathering, ...
 
Hands-On Introduction to Kubernetes at LISA17
Hands-On Introduction to Kubernetes at LISA17Hands-On Introduction to Kubernetes at LISA17
Hands-On Introduction to Kubernetes at LISA17
 
Build Your Own CaaS (Container as a Service)
Build Your Own CaaS (Container as a Service)Build Your Own CaaS (Container as a Service)
Build Your Own CaaS (Container as a Service)
 
Keynote #Tech - Google : aperçu de la gestion des services distribués chez Go...
Keynote #Tech - Google : aperçu de la gestion des services distribués chez Go...Keynote #Tech - Google : aperçu de la gestion des services distribués chez Go...
Keynote #Tech - Google : aperçu de la gestion des services distribués chez Go...
 
Kubernetes extensibility: CRDs & Operators
Kubernetes extensibility: CRDs & OperatorsKubernetes extensibility: CRDs & Operators
Kubernetes extensibility: CRDs & Operators
 
Kubernetes extensibility: crd & operators
Kubernetes extensibility: crd & operators Kubernetes extensibility: crd & operators
Kubernetes extensibility: crd & operators
 
Using kubernetes to lose your fear of using containers
Using kubernetes to lose your fear of using containersUsing kubernetes to lose your fear of using containers
Using kubernetes to lose your fear of using containers
 
Azure kubernetes service (aks) part 3
Azure kubernetes service (aks)   part 3Azure kubernetes service (aks)   part 3
Azure kubernetes service (aks) part 3
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with Prometheus
 
A GitOps model for High Availability and Disaster Recovery on EKS
A GitOps model for High Availability and Disaster Recovery on EKSA GitOps model for High Availability and Disaster Recovery on EKS
A GitOps model for High Availability and Disaster Recovery on EKS
 
using Mithril.js + postgREST to build and consume API's
using Mithril.js + postgREST to build and consume API'susing Mithril.js + postgREST to build and consume API's
using Mithril.js + postgREST to build and consume API's
 
使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster 使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster
 
How Honestbee Does CI/CD on Kubernetes - Vincent DeSmet
How Honestbee Does CI/CD on Kubernetes - Vincent DeSmetHow Honestbee Does CI/CD on Kubernetes - Vincent DeSmet
How Honestbee Does CI/CD on Kubernetes - Vincent DeSmet
 
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...
 
Cluster management with Kubernetes
Cluster management with KubernetesCluster management with Kubernetes
Cluster management with Kubernetes
 
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdfServerless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
 
Kubernetes x PaaS – コンテナアプリケーションのNoOpsへの挑戦
Kubernetes x PaaS – コンテナアプリケーションのNoOpsへの挑戦Kubernetes x PaaS – コンテナアプリケーションのNoOpsへの挑戦
Kubernetes x PaaS – コンテナアプリケーションのNoOpsへの挑戦
 

More from Sysdig

Wordpress y Docker, de desarrollo a produccion
Wordpress y Docker, de desarrollo a produccionWordpress y Docker, de desarrollo a produccion
Wordpress y Docker, de desarrollo a produccion
Sysdig
 
What Prometheus means for monitoring vendors
What Prometheus means for monitoring vendorsWhat Prometheus means for monitoring vendors
What Prometheus means for monitoring vendors
Sysdig
 
Docker Runtime Security
Docker Runtime SecurityDocker Runtime Security
Docker Runtime Security
Sysdig
 
CI / CD / CS - Continuous Security in Kubernetes
CI / CD / CS - Continuous Security in KubernetesCI / CD / CS - Continuous Security in Kubernetes
CI / CD / CS - Continuous Security in Kubernetes
Sysdig
 
Continuous Security
Continuous SecurityContinuous Security
Continuous Security
Sysdig
 
The top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitorThe top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitor
Sysdig
 
The top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitorThe top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitor
Sysdig
 
Behavioural activity monitoring on CoreOS with Sysdig Falco
Behavioural activity monitoring on CoreOS with Sysdig FalcoBehavioural activity monitoring on CoreOS with Sysdig Falco
Behavioural activity monitoring on CoreOS with Sysdig Falco
Sysdig
 
How to Monitor Microservices
How to Monitor MicroservicesHow to Monitor Microservices
How to Monitor Microservices
Sysdig
 
WTF my container just spawned a shell!
WTF my container just spawned a shell!WTF my container just spawned a shell!
WTF my container just spawned a shell!
Sysdig
 
Trace everything, when APM meets SysAdmins
Trace everything, when APM meets SysAdminsTrace everything, when APM meets SysAdmins
Trace everything, when APM meets SysAdmins
Sysdig
 
You're monitoring Kubernetes Wrong
You're monitoring Kubernetes WrongYou're monitoring Kubernetes Wrong
You're monitoring Kubernetes Wrong
Sysdig
 
The Dark Art of Container Monitoring - Spanish
The Dark Art of Container Monitoring - SpanishThe Dark Art of Container Monitoring - Spanish
The Dark Art of Container Monitoring - Spanish
Sysdig
 
Lions, Tigers and Deers: What building zoos can teach us about securing micro...
Lions, Tigers and Deers: What building zoos can teach us about securing micro...Lions, Tigers and Deers: What building zoos can teach us about securing micro...
Lions, Tigers and Deers: What building zoos can teach us about securing micro...
Sysdig
 
Building Trustworthy Containers
Building Trustworthy ContainersBuilding Trustworthy Containers
Building Trustworthy Containers
Sysdig
 
A brief history of system calls
A brief history of system callsA brief history of system calls
A brief history of system calls
Sysdig
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing Tools
Sysdig
 
Extending Sysdig with Chisel
Extending Sysdig with ChiselExtending Sysdig with Chisel
Extending Sysdig with Chisel
Sysdig
 
Intro to sysdig in 15 minutes
Intro to sysdig in 15 minutesIntro to sysdig in 15 minutes
Intro to sysdig in 15 minutes
Sysdig
 
Troubleshooting Kubernetes
Troubleshooting KubernetesTroubleshooting Kubernetes
Troubleshooting Kubernetes
Sysdig
 

More from Sysdig (20)

Wordpress y Docker, de desarrollo a produccion
Wordpress y Docker, de desarrollo a produccionWordpress y Docker, de desarrollo a produccion
Wordpress y Docker, de desarrollo a produccion
 
What Prometheus means for monitoring vendors
What Prometheus means for monitoring vendorsWhat Prometheus means for monitoring vendors
What Prometheus means for monitoring vendors
 
Docker Runtime Security
Docker Runtime SecurityDocker Runtime Security
Docker Runtime Security
 
CI / CD / CS - Continuous Security in Kubernetes
CI / CD / CS - Continuous Security in KubernetesCI / CD / CS - Continuous Security in Kubernetes
CI / CD / CS - Continuous Security in Kubernetes
 
Continuous Security
Continuous SecurityContinuous Security
Continuous Security
 
The top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitorThe top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitor
 
The top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitorThe top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitor
 
Behavioural activity monitoring on CoreOS with Sysdig Falco
Behavioural activity monitoring on CoreOS with Sysdig FalcoBehavioural activity monitoring on CoreOS with Sysdig Falco
Behavioural activity monitoring on CoreOS with Sysdig Falco
 
How to Monitor Microservices
How to Monitor MicroservicesHow to Monitor Microservices
How to Monitor Microservices
 
WTF my container just spawned a shell!
WTF my container just spawned a shell!WTF my container just spawned a shell!
WTF my container just spawned a shell!
 
Trace everything, when APM meets SysAdmins
Trace everything, when APM meets SysAdminsTrace everything, when APM meets SysAdmins
Trace everything, when APM meets SysAdmins
 
You're monitoring Kubernetes Wrong
You're monitoring Kubernetes WrongYou're monitoring Kubernetes Wrong
You're monitoring Kubernetes Wrong
 
The Dark Art of Container Monitoring - Spanish
The Dark Art of Container Monitoring - SpanishThe Dark Art of Container Monitoring - Spanish
The Dark Art of Container Monitoring - Spanish
 
Lions, Tigers and Deers: What building zoos can teach us about securing micro...
Lions, Tigers and Deers: What building zoos can teach us about securing micro...Lions, Tigers and Deers: What building zoos can teach us about securing micro...
Lions, Tigers and Deers: What building zoos can teach us about securing micro...
 
Building Trustworthy Containers
Building Trustworthy ContainersBuilding Trustworthy Containers
Building Trustworthy Containers
 
A brief history of system calls
A brief history of system callsA brief history of system calls
A brief history of system calls
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing Tools
 
Extending Sysdig with Chisel
Extending Sysdig with ChiselExtending Sysdig with Chisel
Extending Sysdig with Chisel
 
Intro to sysdig in 15 minutes
Intro to sysdig in 15 minutesIntro to sysdig in 15 minutes
Intro to sysdig in 15 minutes
 
Troubleshooting Kubernetes
Troubleshooting KubernetesTroubleshooting Kubernetes
Troubleshooting Kubernetes
 

Recently uploaded

官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
171ticu
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
Yasser Mahgoub
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
Hitesh Mohapatra
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
abbyasa1014
 
CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1
PKavitha10
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
21UME003TUSHARDEB
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
MDSABBIROJJAMANPAYEL
 
artificial intelligence and data science contents.pptx
artificial intelligence and data science contents.pptxartificial intelligence and data science contents.pptx
artificial intelligence and data science contents.pptx
GauravCar
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
ElakkiaU
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
Gino153088
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
KrishnaveniKrishnara1
 
People as resource Grade IX.pdf minimala
People as resource Grade IX.pdf minimalaPeople as resource Grade IX.pdf minimala
People as resource Grade IX.pdf minimala
riddhimaagrawal986
 
cnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classicationcnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classication
SakkaravarthiShanmug
 
Applications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdfApplications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdf
Atif Razi
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
insn4465
 
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURSCompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
RamonNovais6
 
Certificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi AhmedCertificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi Ahmed
Mahmoud Morsy
 
integral complex analysis chapter 06 .pdf
integral complex analysis chapter 06 .pdfintegral complex analysis chapter 06 .pdf
integral complex analysis chapter 06 .pdf
gaafergoudaay7aga
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
IJECEIAES
 

Recently uploaded (20)

官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
 
CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
 
artificial intelligence and data science contents.pptx
artificial intelligence and data science contents.pptxartificial intelligence and data science contents.pptx
artificial intelligence and data science contents.pptx
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
 
People as resource Grade IX.pdf minimala
People as resource Grade IX.pdf minimalaPeople as resource Grade IX.pdf minimala
People as resource Grade IX.pdf minimala
 
cnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classicationcnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classication
 
Applications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdfApplications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdf
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
 
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURSCompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
 
Certificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi AhmedCertificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi Ahmed
 
integral complex analysis chapter 06 .pdf
integral complex analysis chapter 06 .pdfintegral complex analysis chapter 06 .pdf
integral complex analysis chapter 06 .pdf
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
 

15 kubernetes failure points you should watch

  • 1. you should watch Jorge Salamero - @bencerillo 15 Kubernetes failure points
  • 2. Jorge Salamero Tech Marketing aka container gamer @ Sysdig github.com/bencer @bencerillo OSS fan Monitoring, containers, IoT/home-automation, cars About me
  • 3. Monitoring & Security Platform for Containers
  • 4. Monitoring 15 Kubernetes failure points - Apps - Hosts - Orchestration - Containers - Yourself https://sysdig.com/blog/monitoring-kubernetes-with-sysdig-cloud/ https://sysdig.com/blog/alerting-kubernetes/
  • 5. The holy service metrics - KPI / biz metrics / synthetic monitoring / user metrics - Google SRE book: “The Four Golden Signals” Latency+Traffic+Errors+Saturation
  • 6. USE method - Utilization (how busy we are, close to 100% bottleneck) - Saturation (amount of work waiting on the queue) - Errors
  • 7. RED method - Request Rate - Request Errors - Request Duration
  • 8. The holy service metrics - Code instrumentation (statsd, JMX or Prometheus metrics): var httpDurationsHistogram := prometheus.NewHistogramVec(prometheus.HistogramOpts{ Name: "http_durations_histogram_seconds", Help: "Seconds spent serving HTTP requests.", Buckets: prometheus.DefBuckets, }, []string{"method", "route", "status_code"}) prometheus.MustRegister(httpDurationsHistogram) - or Sysdig autodiscovery ;-)
  • 9. 1. connections per second net.request.count 2. response time net.response.time 3. errors net.request.error.count
  • 14. Kubernetes metadata: labels Pod app: shopping tier: api Pod app: shopping tier: db Pod app: social tier: api role: search Pod app: social tier: api role: search
  • 17. Health vs state monitoring - Health: - CPU, memory, disk - connections, response time, errors
  • 18. Health vs state monitoring - State (orchestration): - Are containers up and running properly?
  • 19. Health vs state monitoring - kube-state-metrics https://github.com/kubernetes/kube-state-metrics https://sysdig.com/blog/introducing-kube-state-metrics/ calculate new metrics based on the state of Kubernetes resources
  • 20. Container scheduling - Need to deploy a container: - given the requirements, where can we run it? and let’s ignore affinity, taints and tolerations: https://sysdig.com/blog/kubernetes-scheduler/ - capacity planning
  • 21. 4. node availability Based on the host or the kubelet component status: kube_node_status_condition{condition="Ready",status="true"} == 0 count(kube_node_status_condition{condition="Ready",status="true"} == 0) > 1 and (count(kube_node_status_condition{condition="Ready",status="true"} == 0) / count(kube_node_status_condition{condition="Ready",status="true"})) > 0.2 count(up{job="kubelet"} == 0) / count(up{job="kubelet"}) * 100 > 3 kube_node_status_condition: kube_node_status_ready, kube_node_status_out_of_disk, kube_node_status_memory_pressure, kube_node_status_disk_pressure, and kube_node_status_network_unavailable
  • 23. Container resource requirements resources: requests: memory: "256Mi" cpu: "250m" limits: memory: "512Mi" cpu: "500m" https://github.com/kubernetes-incubator/cluster-capacity
  • 24. 5. CPU resources 6. memory resources kube_node_status_capacity_pods kube_node_status_allocatable_pods kube_node_status_capacity_cpu_cores kube_node_status_capacity_memory_bytes kube_node_status_allocatable_cpu_cores kube_node_status_allocatable_memory_bytes capacity - used (by OS and kube services) = allocatable
  • 25. Container disk requirements here things get more complicated... - ephemeral disk usage - persistent volumes claims
  • 26. 7. disk resources predict_linear(node_filesystem_free[30m], 3600 * 2) < 0 kube_node_status_condition: kube_node_status_out_of_disk but within containers this is still WIP, at least Kubernetes 1.8: container_fs_* doesn’t work with PV https://github.com/kubernetes/kubernetes/pull/59170 https://github.com/kubernetes/kubernetes/pull/51553 https://kubernetes.io/docs/concepts/cluster-administration/controller-metrics/
  • 27. Container orchestration - ReplicationController - ReplicaSet - Deployment - DaemonSet - StatefulSet
  • 28. Kubernetes deployments Is Kubernetes doing what is supposed to to? Orchestration needs monitoring too.
  • 30. 9. desired instances ((kube_deployment_status_replicas_updated != kube_deployment_spec_replicas) or (kube_deployment_status_replicas_available != kube_deployment_spec_replicas))
  • 31. 10. deployment updates glitches kube_deployment_status_observed_generation != kube_deployment_metadata_generation kube_deployment_spec_paused kube_deployment_spec_strategy_rollingupdate_max_unavailable
  • 33. Liveness probes To know when to restart a container: livenessProbe: httpGet: path: /healthz port: 8080 httpHeaders: - name: X-Custom-Header value: Awesome initialDelaySeconds: 3 periodSeconds: 3
  • 34. Ready-ness probes To know when a container is ready to start accepting traffic: readinessProbe: exec: command: - cat - /tmp/healthy initialDelaySeconds: 5 periodSeconds: 5
  • 35. 11. pod status kube_pod_status_phase: Pending|Running|Succeeded|Failed|Unknown kube_pod_status_ready kube_pod_status_scheduled kube_pod_container_status_waiting kube_pod_container_status_running kube_pod_container_status_terminated kube_pod_container_status_ready
  • 36. 12. pod restarts You can look at this as a metric or as an event: ALERT PodRestartingTooMuch IF rate(k8s_pod_status_restartCount[1m]) > 1/(5*60) FOR 1h LABELS { severity="warning" } ANNOTATIONS { summary = "Pod {{$labels.namespace}}/{{$label.name}} restarting too much.", description = "Pod {{$labels.namespace}}/{{$label.name}} restarting too much.", }
  • 39. Kubernetes internals - APIserver - KubeDNS / Istio - container registry - any other piece of Kubernetes https://sysdig.com/blog/monitor-etcd/
  • 40. 13. APIserver rate(apiserver_request_count{code=~"^(?:5..)$"}[5m]) / rate(apiserver_request_count[5m])* 100 > 5 apiserver_latency_seconds:quantile{quantile="0.99",subresource!="log",verb! ~"^(?:WATCH|WATCHLIST|PROXY|CONNECT)$"}> 4 Or just do Golden signals on APIserver endpoint too :-)
  • 41. 14. KubeDNS / Istio histogram_quantile(0.95, sum(rate(kubedns_probe_kubedns_latency_ms_bucket[1m])) BY (le, kubernetes_pod_name)) > 1000 All export native metrics in Prometheus format, just scrape them! https://sysdig.com/blog/monitor-istio/
  • 42. What are we deploying? - CI/CD and commits - Manual deploys You need to validate what you tell Kubernetes too!
  • 43. 15. monitor your commands kubeval: validates YAML and JSON config files https://github.com/garethr/kubeval kube-diff: show differences between running state and version controlled configuration https://github.com/weaveworks/kubediff Configuration reconciliation discussion: https://github.com/kubernetes/kubernetes/issues/1702 Although this is getting automated too: https://sysdig.com/blog/kubernetes-scaler/
  • 44. Recap 1. connections per second 2. response time 3. errors 4. node availability 5. CPU resources 6. memory resources 7. disk and external resources
  • 45. Recap (2) 8. running instances 9. desired instances 10. deployment updates glitches
  • 46. Recap (3) 11. pod status 12. pod restarts 13. APIserver health 14. KubeDNS / Istio health 15. monitor your commands
  • 47. Grazie! Jorge Salamero - @bencerillo https://sysdig.com/blog/