2. 2
CodiLime at a glance
Ranked among TOP 50 fastest growing companies in EMEA by The Financial Times
● 8+ years in business
● 3 locations (Warsaw and Gdansk in Poland & Palo Alto in the US)
● 170+ people on board, including 90+ software engineers, 30+ DevOps engineers
● Working with market leaders including: Juniper, NTT, Nutanix, GigaSpaces, Cloudify
● SDN, NFV & Orchestrators
● Cloud native, Serverless &
Multicloud
● Edge Computing
● Software engineering
● DevOps
● UX/UI
● R&D
Areas of expertise Services
6. 6
Ways of provision
● Out of band - create pre-baked images
● Inline - live on remote systems
● Inline - live on local systems
● Inline - in runtime
8. 8
Example
● Prepare ETCD service in GCP
● Limited tunable parameters
○ Version
○ Cluster size / failure domains
○ Performance: instance type, disk size
● Auto healing
○ Periodic backup
○ Auto disaster recovery
9. 9
Limitations and Challenges
● What parts are most likely to be changed?
○ The less there are, the easier to make it
● How often we are willing to change it
○ How do we handle data migration/availability
○ Cloud providers gives use 2 states of service availability, instead of 3
10. 10
High level flow
● Git repo
● Create image with app, extra packages
● Create required resources
● Create instance group
● …
● Profit!
11. 11
Development flow
● Single instance
● A set of instances
● Set of instances from image
● Auto scaled set of instances from image
○ Integrate health checks
○ Integrate backup
○ Integrate recovery
13. 13
It’s because...
● You cannot bake in whole config, must be adjusted per instance
● Cluster state itself - depending on the state you must configure differently
● Etcdctl - depending on the API version you use to talk to ETCD it returns
different info
● … and depending on formatting it returns different data:
● Nodes which are gone are not marked as dead
○ Need to periodically manually check if node is dead/alive
14.
15. 15
More than one config
● Initial provision - Ansible
● Remember about cloud-init
● Remember about instance tags/labels/meta-data
● Add script to join instance to cluster based on its state
16. 16
On instance launch
● Executes cloud-init
○ provision disks
○ Exec thin config script
● Thin config script talks to the cloud API to find which instances to connect to
○ We assume we use machine account
○ We assume instance has certain metadata keys
○ Thin config script is baked into image
17. 17
Thin config script
● Talk to cloud API, find instances
● Try to query instances for service state
○ If cluster is alive, join cluster *
○ If no response from cluster
■ Check if on bucket there is already existing cluster state backup
■ Restore state on instance
■ Prepare config to be adjusted to expected instances in cluster
■ Launch service
18. 18
The asterisk!
● If cluster is alive, join cluster *
● Cluster member list returns nodes, but does not show their state
○ Need to check if they are dead/alive
○ Prepare config which is in sync with cluster state
○ Join cluster…
● Race conditions when more than 2 instances joining
■ Mitigate with using random sleep ;)
19.
20. 20
Another try
● Make thin config script simpler
○ Just wait till current node is expected by the cluster
○ The code to bootstrap fresh cluster is left as is
○ Remove any node management
● Add script on etcd leader to run cluster node management
○ There can be only one leader in the cluster
○ The code is already there
○ Etcd disallows adding/removing nodes if it renders cluster inoperational
21. 21
What if cluster has...
● No leader - no need to add/remove nodes
○ This usually leads to unhealthy instances
○ In edge cases this will trigger cluster destruction and recreation and
fresh restore
● Single leader - simpler cluster management
○ No more race conditions on start
● Multiple leaders
○ We can avoid that by limiting number of instances
22. 22
More than one config, again
● Initial provision - Ansible
● Remember about cloud-init
● Add scripts
○ to join instance to cluster based on its state
○ to manage cluster nodes if leader
23. 23
So now we have...
● Ansible - imperative or declarative to make image
● Cloud-init - declarative but allows imperatives
● Scripts - imperative
○ Shell
○ gcloud/awscli
○ pex + envtpl
○ Etcdctl
● Terraform - declarative
24. 24
Adding up tools
● Vagrant + Ansible
● Serverspec / Inspec / TestInfra
● Test Kitchen
○ Merges those above, but still lacks a bit in full cluster tests
25. 25
Worth to see
● github.com/MonsantoCo/etcd-aws-cluster/ (shell)
● github.com/ocadotechnology/etcd-dynamic-cluster (python
● etcd rpc proto
● github.com/coreos/etcd-operator
Ummm wait, thats for containers…
26. 26
Summing up
● Depending on the stage we can choose different solution
● Passing parameters from one stage to another
● Sometimes certain solutions are forced
● Sometimes you must make your own tools