This is an unplanned lightning talk at Kubernetes Community Days 2019. It's about how Booking.com does multi-cluster blue-gree/canary deployments in its Kubernetes infrastructure.
I came here to talk briefly about how we do blue-green and canary deployments @ booking.com
But before I start some facts. We have 16 multitenant k8s clusters which roughly summing up to 15 hundres nodes and which hosts around 7 hundreds services.
Due to our scale we always deploy to multiple regions and we love advanced deployment strategies because they allow us to ensure seemless customer experience while shipping products at fast pace. However, such strategies are not trivial to manage. One of the ways is to configure CI/CD to deploy to several regions at the same time. In our experience, this proven to be a error prone approach because every failure in either CI/CD or a cluster leads to an inconsistent versions deployed to different regions.
That’s why we decided to follow a different way. What we wanted to achieve is a setup which would ensure consistency across clusters but which would also gives us advanced control over how we rollout changes, which kubernetes doesn’t give by default.
So, we decided to go with so called “management cluster” which hosts a set of k8s controllers which we jointly call “shipper”. A user deploys an application spec to management cluster, and shipper coordinates the deployments of the app to the application clusters. And it also continuously reconcile the state to make sure that applications in all clusters are in sync. But on top, shipper let as define deployment steps. For instance – we have staging, canary and full-on steps. Will talk a bit more about what these steps are in a second.
But before, I would like to show you what an application spec consist of. A spec defines a link to helm chart with its customizations, cluster selector which instruct shipper which regions to deploy to and finally the rollout steps. Each rollout steps controls two aspects: capacity of the application and traffic which reaches the application.
So, for instance, our default strategy consists of three steps. Steps #1 – staging. In this stage we create a new version of application but route no traffic there. So, this is your last chance for any final checks. On the next stage – canary, we route a portion of live traffic to this newly deployed application. And if something goes wrong we can always rollback. And when we satisfy with canary, we go full-on.
This is what we do, and the cool thing is that you can do the same! Because shipper is an open-source project which we create and keep developing.