Introducing Flagger: a progressive delivery Kubernetes operator for Istio.
Flagger automates the promotion of canary deployments, and uses Istio routing for traffic shifting and Prometheus metrics for canary analysis.
What is Progressive Delivery?
Progressive Delivery is Continuous Delivery with fine-grained control over the
blast radius.
Building blocks
● User segmentation
● Traffic management
● Observability
● Automation
https://redmonk.com/jgovernor/2018/08/06/towards-progressive-delivery/
Introducing Flagger
Flagger is a Kubernetes operator that automates the promotion of
canary deployments using Istio routing for traffic shifting and
Prometheus metrics for canary analysis.
Flagger implements a control loop that gradually shifts traffic to the
canary while measuring key performance indicators. Based on the
KPIs analysis a canary is promoted or aborted.
Key Performance Indicators
The decision to pause the traffic shift, abort or promote a canary is
based on:
● Deployment health status
● Request success rate percentage
● Request latency average value
Flagger control loop stage 1-5
● scan for canary deployments
● create the primary deployment and HPA
● create the ClusterIP services for primary and canary
● create an Istio virtual service with weighted destinations mapped to primary and
canary ClusterIP services
● check primary and canary deployments status
○ halt advancement if a rolling update is underway
○ halt advancement if pods are unhealthy
● increase canary traffic weight percentage from 0% to 5% (step weight)
● check canary HTTP request success rate and latency
○ halt advancement if any metric is under the specified threshold
○ increment the failed checks counter
Flagger control loop stage 6-7
● check if the number of failed checks reached the threshold
○ route all traffic to primary
○ scale to zero the canary deployment
○ mark canary as failed
○ wait for the canary to be updated (revision bump) and start over
● increase canary traffic by 5% (step weight)
○ halt advancement while canary request success rate is under the threshold
○ halt advancement while canary request duration P99 is over the threshold
○ halt advancement if the primary or canary deployment becomes unhealthy
○ halt advancement while canary deployment is being scaled up/down by HPA
Flagger control loop stage 8-12
● promote canary to primary when canary weight reaches max
○ copy canary deployment spec template over primary
● wait for primary rolling update to finish
○ halt advancement if pods are unhealthy
● route all traffic to primary
● scale to zero the canary deployment
● mark canary as finished
● wait for the canary to be updated (revision bump) and start over