CodiLime Tech Talk - Michał Sochoń: Configuration management hell

Configuration Management Hell
2019.01.30 ver 0.1.0-alpha-2
Michał Sochoń

2
CodiLime at a glance
Ranked among TOP 50 fastest growing companies in EMEA by The Financial Times
● 8+ years in business
● 3 locations (Warsaw and Gdansk in Poland & Palo Alto in the US)
● 170+ people on board, including 90+ software engineers, 30+ DevOps engineers
● Working with market leaders including: Juniper, NTT, Nutanix, GigaSpaces, Cloudify
● SDN, NFV & Orchestrators
● Cloud native, Serverless &
Multicloud
● Edge Computing
● Software engineering
● DevOps
● UX/UI
● R&D
Areas of expertise Services

3
In general
● Provisioning of resources
● Configuration
● Data management

4
Config Paradigms
● Imperative
○ Non-idempotent
○ Defines how to exactly perform steps
● Declarative
○ Idempotent
○ Defines desired state

5
Parameters
● Vendor specific
● App config
● System Tunables
● Secrets / Security
● Dependent services

6
Ways of provision
● Out of band - create pre-baked images
● Inline - live on remote systems
● Inline - live on local systems
● Inline - in runtime

7
Most common setups
● Execution on demand
● Scheduled
● Event based

8
Example
● Prepare ETCD service in GCP
● Limited tunable parameters
○ Version
○ Cluster size / failure domains
○ Performance: instance type, disk size
● Auto healing
○ Periodic backup
○ Auto disaster recovery

9
Limitations and Challenges
● What parts are most likely to be changed?
○ The less there are, the easier to make it
● How often we are willing to change it
○ How do we handle data migration/availability
○ Cloud providers gives use 2 states of service availability, instead of 3

10
High level flow
● Git repo
● Create image with app, extra packages
● Create required resources
● Create instance group
● …
● Profit!

11
Development flow
● Single instance
● A set of instances
● Set of instances from image
● Auto scaled set of instances from image
○ Integrate health checks
○ Integrate backup
○ Integrate recovery

12
But where is ‘the’ hell?

13
It’s because...
● You cannot bake in whole config, must be adjusted per instance
● Cluster state itself - depending on the state you must configure differently
● Etcdctl - depending on the API version you use to talk to ETCD it returns
different info
● … and depending on formatting it returns different data:
● Nodes which are gone are not marked as dead
○ Need to periodically manually check if node is dead/alive

15
More than one config
● Initial provision - Ansible
● Remember about cloud-init
● Remember about instance tags/labels/meta-data
● Add script to join instance to cluster based on its state

16
On instance launch
● Executes cloud-init
○ provision disks
○ Exec thin config script
● Thin config script talks to the cloud API to find which instances to connect to
○ We assume we use machine account
○ We assume instance has certain metadata keys
○ Thin config script is baked into image

17
Thin config script
● Talk to cloud API, find instances
● Try to query instances for service state
○ If cluster is alive, join cluster *
○ If no response from cluster
■ Check if on bucket there is already existing cluster state backup
■ Restore state on instance
■ Prepare config to be adjusted to expected instances in cluster
■ Launch service

18
The asterisk!
● If cluster is alive, join cluster *
● Cluster member list returns nodes, but does not show their state
○ Need to check if they are dead/alive
○ Prepare config which is in sync with cluster state
○ Join cluster…
● Race conditions when more than 2 instances joining
■ Mitigate with using random sleep ;)

20
Another try
● Make thin config script simpler
○ Just wait till current node is expected by the cluster
○ The code to bootstrap fresh cluster is left as is
○ Remove any node management
● Add script on etcd leader to run cluster node management
○ There can be only one leader in the cluster
○ The code is already there
○ Etcd disallows adding/removing nodes if it renders cluster inoperational

21
What if cluster has...
● No leader - no need to add/remove nodes
○ This usually leads to unhealthy instances
○ In edge cases this will trigger cluster destruction and recreation and
fresh restore
● Single leader - simpler cluster management
○ No more race conditions on start
● Multiple leaders
○ We can avoid that by limiting number of instances

22
More than one config, again
● Initial provision - Ansible
● Remember about cloud-init
● Add scripts
○ to join instance to cluster based on its state
○ to manage cluster nodes if leader

23
So now we have...
● Ansible - imperative or declarative to make image
● Cloud-init - declarative but allows imperatives
● Scripts - imperative
○ Shell
○ gcloud/awscli
○ pex + envtpl
○ Etcdctl
● Terraform - declarative

24
Adding up tools
● Vagrant + Ansible
● Serverspec / Inspec / TestInfra
● Test Kitchen
○ Merges those above, but still lacks a bit in full cluster tests

25
Worth to see
● github.com/MonsantoCo/etcd-aws-cluster/ (shell)
● github.com/ocadotechnology/etcd-dynamic-cluster (python
● etcd rpc proto
● github.com/coreos/etcd-operator
Ummm wait, thats for containers…

26
Summing up
● Depending on the stage we can choose different solution
● Passing parameters from one stage to another
● Sometimes certain solutions are forced
● Sometimes you must make your own tools

Thank you
Krancowa 5
02-493, Warsaw
Poland
+48 22 389 51 00
contact@codilime.com

CodiLime Tech Talk - Michał Sochoń: Configuration management hell

Recommended

Recommended

More Related Content

Similar to CodiLime Tech Talk - Michał Sochoń: Configuration management hell

Similar to CodiLime Tech Talk - Michał Sochoń: Configuration management hell (20)

More from CodiLime

More from CodiLime (16)

Recently uploaded

Recently uploaded (20)

CodiLime Tech Talk - Michał Sochoń: Configuration management hell

Editor's Notes