Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

of

Crashing Pods: How to Compensate for such an Outage? Slide 1 Crashing Pods: How to Compensate for such an Outage? Slide 2 Crashing Pods: How to Compensate for such an Outage? Slide 3 Crashing Pods: How to Compensate for such an Outage? Slide 4 Crashing Pods: How to Compensate for such an Outage? Slide 5 Crashing Pods: How to Compensate for such an Outage? Slide 6 Crashing Pods: How to Compensate for such an Outage? Slide 7 Crashing Pods: How to Compensate for such an Outage? Slide 8 Crashing Pods: How to Compensate for such an Outage? Slide 9 Crashing Pods: How to Compensate for such an Outage? Slide 10 Crashing Pods: How to Compensate for such an Outage? Slide 11 Crashing Pods: How to Compensate for such an Outage? Slide 12 Crashing Pods: How to Compensate for such an Outage? Slide 13 Crashing Pods: How to Compensate for such an Outage? Slide 14 Crashing Pods: How to Compensate for such an Outage? Slide 15 Crashing Pods: How to Compensate for such an Outage? Slide 16 Crashing Pods: How to Compensate for such an Outage? Slide 17 Crashing Pods: How to Compensate for such an Outage? Slide 18 Crashing Pods: How to Compensate for such an Outage? Slide 19 Crashing Pods: How to Compensate for such an Outage? Slide 20
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

0 Likes

Share

Download to read offline

Crashing Pods: How to Compensate for such an Outage?

Download to read offline

Kubernetes offers a lot of functionalities to keep the downtime of pods very low. Graceful shutdown and zero downtime deployments are definitely possible with Kubernetes. However, this only applies to the proper transition of containers or pods. Despite all precautions taken by Kubernetes, it can happen that a service crash leads to HTTP 5xx responses. Other measures must be taken to fully compensate for services that are in such an error state.

This session shows why the classic approach with a resilience framework cannot completely solve these types of problems. For this purpose, the pod lifecycle is taken into account and the Kubernetes workflow for replacing faulty pods is analyzed. One possible solution strategy is client side load balancing. A service mesh tool like Istio is used to demonstrate what it takes to achieve full compensation using this strategy.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

Crashing Pods: How to Compensate for such an Outage?

  1. 1. Crashing Pods – How to compensate for such an outage? Michael Hofmann Hofmann IT-Consulting info@hofmann-itconsulting.de https://hofmann-itconsulting.de
  2. 2. Crashing pods? ● Controlled (error state): rolling update ● Application deadlock – Thread pool full – Thread deadlock situation (detection in JVM) ● Memory Leak (out of memory) ● Bug in application or application server
  3. 3. Mitigation/Compensation Strategies ● Quick recognition of error state for recovery ● Short time for eventual consistency ● Controlled error state (e.g. rolling update) ● Intelligent routing (outlier detection) ● Classic resilience
  4. 4. Kubernetes Architecture Source: https://kubernetes.io/docs/concepts/overview/components/
  5. 5. Liveness and readiness probes spec: containers: - name: crashing-pod image: hofmann/crashing-pod:latest imagePullPolicy: Never ports: - containerPort: 9080 livenessProbe: httpGet: path: /health/live port: 9080 initialDelaySeconds: 10 periodSeconds: 5 timeoutSeconds: 1 failureThreshold: 5 readinessProbe: httpGet: path: /health/ready port: 9080 initialDelaySeconds: 15 periodSeconds: 5 timeoutSeconds: 1 failureThreshold: 5 ● Difference liveness and readiness probe ● Only pods with a successful readiness probe will be assigned to a service (endpoint, get IP)
  6. 6. From service to pod ● Service – Called by client: <svc-name>.<ns>.svc.cluster.local – Basis for dns naming – References pod by labels ● Pods – Assigned IPs ● Endpoints – Connects service to pod-instances (IPs) – stored in etcd: IP and port – Endpoint refresh: pod created, pod deleted, pod label modified – Basis for: kube-proxy, ingress controller, coreDNS, cloud provider, service mesh
  7. 7. Workflow
  8. 8. Endpoint outdated ● Kubelet: – Readiness probes – Housekeeping interval to update endpoint ● Kube-proxy (iptables settings) ● Kubernetes DNS (coreDNS) ● Caching of DNS values in client
  9. 9. Rolling update ● Update running pods – Defined by rolling update strategy ● Influenced by – liveness and readiness probes – preStop lifecycle hook ● Distributed infrastructure can react on error state (update components) – SIGTERM (not SIGKILL) ● Shutdown hook in application server (finish open requests) ● Target: zero-downtime-deployment (should...)
  10. 10. Rolling update & preStop Hook readinessProbe: ... lifecycle: preStop: exec: command: ["/bin/bash", "-c", "sleep 30"] strategy: # default of k8s type: RollingUpdate rollingUpdate: maxSurge: 1 # max. 1 over-provisioned pod maxUnavailable: 0 # no unavailable pod during update Container Deployment
  11. 11. Intelligent Routing ● Server side load balancing – Endpoint handling done by infrastructure (K8S) – Requests will be routed to faulty instance until platform evicts faulty instance ● Client side load balancing – Client must now all endpoints: dependency on infrastructure (service registry) – Can react on faulty request ● Outlier detection (additional to client side LB) – Faulty instance (HTTP >= 500) will be evicted (period of time) – Reacts faster than distributed infrastructure
  12. 12. Resilience ● Frameworks – Server side load balancing ● Retry storm on faulty pods – Spring Cloud LoadBalancer (client side LB) ● Since 2020 ● Generic abstraction for Netflix Ribbon ● Kubernetes and Cloud Foundry service registry ● Service Mesh
  13. 13. Idempotency ● Retry causes multiple calls! ● GET, HEAD, OPTIONS, DELETE (if exists) ● PUT – Idempotent by definition ● must be implemented idempotent (DuplicateKeyException) – Primary key must be in payload ● POST – Idempotency key in header – Idempotency key stored in separate table – PUT semantics with primary key (header vs. payload)
  14. 14. Istio
  15. 15. Istio ● Resilience ● Client side load balancing (knows pods) ● Outlier detection ● Does it`s own health checks (in addition to kubelet) ● kubelet checks sidecar and workload together
  16. 16. Istio apiVersion: networking.istio.io/v1alpha3 kind: Gateway metadata: name: mesh-gateway spec: selector: istio: ingressgateway servers: - port: number: 80 name: http protocol: HTTP hosts: - "*"
  17. 17. Istio apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: crashing-pod spec: gateways: - mesh-gateway hosts: - "*" http: - match: - uri: prefix: / route: - destination: port: number: 9080 host: crashing-pod subset: v1 retries: attempts: 3 perTryTimeout: 1s apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: name: crashing-pod spec: host: crashing-pod subsets: - name: v1 labels: app: crashing-pod trafficPolicy: tls: mode: DISABLE loadBalancer: simple: ROUND_ROBIN outlierDetection: consecutiveGatewayErrors: 1 interval: 1.0s baseEjectionTime: 30s
  18. 18. Recap: Quick Error-Recognition ● Interval of health probes (liveness, readiness) by kubelet ● Other error detection by kubelet (OOM) ● Problem: distributed architecture of K8S (propagation of error event to components) ● Error type: – Controlled error state (e.g. rolling update) – Fast detectable errors – Slow detectable errors
  19. 19. Demo
  20. 20. Summary ● Distributed architecture of K8S ● Controlled error state: 99,9% (see rolling update) --> 100%?! ● Mix of strategy necessary: 100% – Client side load balancing – Outlier detection – Resilience – Idempotency

Kubernetes offers a lot of functionalities to keep the downtime of pods very low. Graceful shutdown and zero downtime deployments are definitely possible with Kubernetes. However, this only applies to the proper transition of containers or pods. Despite all precautions taken by Kubernetes, it can happen that a service crash leads to HTTP 5xx responses. Other measures must be taken to fully compensate for services that are in such an error state. This session shows why the classic approach with a resilience framework cannot completely solve these types of problems. For this purpose, the pod lifecycle is taken into account and the Kubernetes workflow for replacing faulty pods is analyzed. One possible solution strategy is client side load balancing. A service mesh tool like Istio is used to demonstrate what it takes to achieve full compensation using this strategy.

Views

Total views

28

On Slideshare

0

From embeds

0

Number of embeds

9

Actions

Downloads

0

Shares

0

Comments

0

Likes

0

×