In one year we migrated a full set of micro-services into a new infrastructure based on Kubernetes and Docker.
I will present how we get there describing real-life challenges, problems faced and solutions found
3. lastminute.com group in numbers
40 countries
17 languages
10M
travellers per year*
€ 2.5B GTV*
€ 250M revenue*
43M
users per month*
*data as 31st December 2015
icons from http://www.flaticon.com
4. A tech company to the core
Tech department: 300+ people
Modules: ~100
Database: 150 schemas, 3300 tables, TB data
Instances: 1400+
Locations: Chiasso, Milan, Madrid, London, Bengaluru
9. The improvements needed
● alignment
● real pipelines
● infrastructure
● resilience
● monitoring
● remove constraints
10. An year-long endeavour
● build a new, modern infrastructure
● migrate the search (flight/hotel) product there
... without:
● impacting the business
● throwing away our whole datacenter
22. Kubernetes: our architecture and choices
APP1-PRODUCTION
deployment
replica-set
secret configmap
POD
3
POD
2
POD
1
production
23. Kubernetes: our architecture and choices
APP1-PRODUCTION
deployment
replica-set
app1.lastminute.intra
secret configmap
POD
3
POD
2
POD
1
loadbalancer-app1
production
25. Kubernetes: what’s left outside?
● datastores
● distributed caches
● distributed locking
● pub-sub
● logs and metrics storage
26. 1st try (with test app), it seemed to work
https://www.flickr.com/photos/26516072@N00/2194001232
27. The self-healing term describes any application,
service, or a system that can discover that it is
not working correctly and, without any human
intervention, make the necessary changes to
restore itself to the normal or designed state.
Self-healing
ref: https://technologyconversations.com/2016/01/26/self-healing-systems
29. Kubernetes probes: liveness & readiness
Two questions for dev:
● when can I consider my
container alive?
● when can I consider my
container ready to receive
traffic?
spec:
containers:
livenessProbe:
httpGet:
path: /liveness
successThreshold: 3
failureThreshold: 2
readinessProbe:
httpGet:
path: /readiness
successThreshold: 3
failureThreshold: 2
deployment.yaml
30. /liveness:
● when tomcat container is up
● when ratio “active/max” threads are lower than a
threshold
/readiness:
● all the startup jobs have run
● no termination request has been received
.. ongoing never-ending research ..
Our choices: framework - k8s
31. ● zero downtime during rollout
● monitoring in place
● alerting
● centralized logging
● legacy infrastructure to the rescue in case of problem
2nd try (with production traffic)
32. ... failure ... the big one!
https://www.flickr.com/photos/ghost_of_kuji/2763674926
37. “Go” deep .. whatever language it takes
https://www.pexels.com/photo/sea-man-person-ocean-2859/
38. There’s light ..There’s a light .. at the end
https://www.pexels.com/photo/grayscale-photography-of-person-at-the-end-of-tunnel-211816/
39. ● lead and migration time
● resilience
● root cause analysis
● speed of deployment
● instant scaling
... benefits
40. ● 1300 req/sec in the new cluster
● 25 micro-services migrated in 4 months
● 1 week to migrate an application
● 10 minutes to create a new environment
● 11 min to gracefully roll-out a new version with 55
instances
● whole pipeline runs in 16 min
● 1.5M metrics/minute flows
Give me the numbers!