2. The Beginning
● Consul version 0.3
● Few dozens of agents
● One DC
● Main usage: Internal LB
● Script checks
3. The expansion
● Growing to > Hundreds > Thousands of agents
● Mapping all infrastructure & services
● Creating automation for json + scripts generation
● Additional DC
4. The Abuse
● Consul as Vault backend
● KV for state
● Users implement discovery over KV
● Consul locks for huge clusters
● Consul KV for reverse proxy
5. First signs of trouble
● IOwait on servers
● Slow KV updates (Few seconds)
● Sporadic DNS query failures
● Restart cause Raft failures
9. More stability issues
● Network saturation on servers
● Raft failures: “No leader found”
● Service discovery / DNS failures
● Nodes failures to leave
10. The Problem
High de/register rates due to auto scaling
Node conflicts Increase raft traffic between consul servers
High bursts cause network saturation on leader
Raft heartbeat fail from other servers
Leader Election initiated
11. Stabilizing
● Better control on join & leave
● Add timestamp to node name
● Increase raft_multiplier
● Upgrade consul version
● Cleaned up stuf
● Decrease reconnect_timeout
● Add TTL + allow_stale
● Add DNS caching daemon
● Migrate servers to ENA enabled
12. More stuff on the way
● Serf in depth queue errors
● Reaping old nodes
● Bootstrap vs bootstrap-expect
● Security hole by default…
● Enable-scripts-check
13. Successful upgrade to 1.0.3
Start the migration on small region
Start with clients
Handle node-id cleanup
Gradually migrate agent by services
Demote raft for compatibility
Migrate one server
Migrate all servers
Migrate bigger region
14. Consul DNS tactics &
Configurations
1) Use: alow_stale, set service_ttl, increase serve count
"dns_config": {
"allow_stale": true,
"max_stale": "28800s",
"node_ttl": "0s",
"udp_answer_limit": 30,
"service_ttl": {
"*": "5s"
}
}
2) Seed Bind with Consul services records as fallback
3) Forward via bind to enable ease fault tolerance
4) Use DNS daemon (we use pdnsd) to reduce load on consul and enable negative ttl