Presented by Ahmet Alp Balkan, Software Engineer, Microsoft
DNS-based dynamic service discovery is still an unsolved problem for Docker Swarm. In this talk we will talk about a new open source project by Microsoft: “wagl” a minimalist DNS service discovery solution built specifically Docker Swarm Clusters. It takes a single command to set up and works out of the box.
wagl is open source at: https://github.com/ahmetalpbalkan/wagl
wagl allows developers to use domain names such as http://api.billing.swarm or memcached.swarm:11211 in their applications that are magically resolved into IP addresses of containers spread all over the Swarm cluster.
The session will also review other means of Service Discovery, use cases of Swarm and there will be a demo of creating Docker Swarm Clusters in Azure in just a few clicks.
3. About The Speaker
3
I am Ahmet Alp Balkan, a software engineer at Microsoft.
I contribute to Open Source.
Follow me at @AhmetAlpBalkan.
4. About This Talk
4
Service Discovery
Service Discovery Methods
A peek into various solutions
Thought exercises
Service Discovery for Docker Swarm
Can a drop-in tool just do it™ for Swarm?
Where are we headed?
5. Before we begin
5
how many of you…
… use Docker Swarm?
… used a Service Discovery method?
… wrote or configured a DNS server?
22. Service Discovery Methods
22
before we begin…
SPOILER: No method is actually really good.
This is still an unsolved problem.
Thought exercise, not a comparison.
23. Service Discovery in my dreams…
23
it comes with the orchestrator
…or it is a “setup and forget about it”
does not infect the application code
with the service discovery concern,
uses a reliable networking stack,
does not have too many moving parts.
24. Common Approaches
24
to Service Discovery
Overlay Networks
Mixing Tools (docker bridge + template + rev proxy)
Port Scanning
Domain Name Service
27. Overlay Networks
27
Good at container-to-container networking
Static port allocation not a problem
IP address per container
Seamless, does not change the application code
Introduces network latency overhead*
28. Docker 1.9 Multi-Host Networking
28
Container-to-container overlay network
Discovery* through /etc/hosts entries (DNS)
Lacking a load balancer (how to http://serviceA?)
32. Reverse Proxies (TCP/IP Load Balancers)
32
HAProxy, NGINX, Træfik, kube-proxy…
Route load balance traffic to multiple backends
can route traffic from/to any port
list of backends can be dynamically updated
ServiceA ServiceBProxy node1:32012
ServiceB node2:33406
ServiceB node7:32104
33. 33
HAProxy, NGINX, Træfik, kube-proxy…
Load balancing
Health checks
do not route traffic to unhealthy backends
ServiceA ServiceBProxy node1:32012
ServiceB node2:33406
ServiceB node7:32104
health probes
Reverse Proxies (TCP/IP Load Balancers)
34. 34
Sticky sessions
route a client’s traffic to the same
backend between requests/connections
Origin-based access control (ACLs)
HAProxy, NGINX, Træfik, kube-proxy…
ServiceA ServiceBProxy
ServiceC ServiceBProxy
Reverse Proxies (TCP/IP Load Balancers)
172.0.1.10
172.0.1.22
35. 35
Connection draining*
wait all connections to close before
removing the backend from the routing list
Allows blue-green deployments
flip the switch → the new version of
your service starts getting traffic
HAProxy, NGINX, Træfik, kube-proxy…
Reverse Proxies (TCP/IP Load Balancers)
36. 36
Downside: another moving part that can fail
what if the proxy server crashes?
Downside: discovery of the proxy server itself
where do you place the proxy server(s)?
what happens when they get rescheduled
to another host?
how do you discover proxy servers?
Downside: introduces latency overhead
HAProxy, NGINX, Træfik, kube-proxy…
Reverse Proxies (TCP/IP Load Balancers)
38. 38
Discover new containers through Docker Events API
on container events, update NGINX/HAProxy
Plugin model
write your own event handler
Interlock by @ehazlett
docker
engine
interlock
event: start
nginx
update nginx.conf
Get Container Details
SIGHUP
event: stop
event: die
39. 39
Discover new containers through Docker Events API
Writes service definitions to consul/etcd
Registrator by @progrium
docker
engine
registrator
event: start
consul
Save
Get Containers
event: stop
event: die
40. 40
You can then use consul-template/confd
update haproxy/nginx backends list
Registrator by @progrium
docker
engine
registrator
event: start
consul
Save
Get Containers
event: stop
event: die
consul-template
Watch
nginx
update nginx.conf
43. Mixin’ Tools
43
Far too many moving parts
How do you deploy these components HA?
You still have N point of failures & additional latency
Connection draining feature is a lie:
…unless orchestrator coordinates with the
reverse proxy
Stopping the container will just
drop the connections.
44. Connection draining done right
44
kube-proxy handles load balancing in Kubernetes.
When you stop a pod, it is not stopped right away.
Remaining open connections stay alive for T.
(T=grace period, configurable)
Also pre-start/post-start hooks for containers in pods
“Zero downtime rolling upgrades in 1M requests/sec”
http://blog.kubernetes.io/2015/11/one-million-requests-per-second-
dependable-and-dynamic-distributed-systems-at-scale.html
46. Port Scanning in Overlay Networks
46
by Jeff Nickoloff (github.com/allingeek/nmap-sd)
Add connected containers to a network
(such as Docker 1.9 overlay driver)
Scan open ports in the network’s subnet periodically
(as long as your subnet is small, it’s very reasonable)
Reports accessible ports to a file (bind volume)
Refresh reverse proxy config, route the traffic!
48. Motives for DNS
48
Started in 1984, roughly at the same time as TCP/IP
Humans suck at remembering IP addresses
google.com → 2a00:1450:4003:806::200e
and IP addresses do not stick around forever
Can this 30-year old tech save us?
51. Intro to DNS Resource Records
51
Type A/AAAA records
<hostname> → <IP>
$ dig A +short docker.com.
52.7.79.61
52.22.96.108
54.84.192.71
ugly truth: has no port information
can’t support dynamic port-assigned
containers :(
52. Intro to DNS Resource Records
52
Type SRV records
<hostname> → <IP, port, weight>
$ dig SRV +short _database._tcp.local.
1 1 32770 192.168.0.4
1 1 32769 192.168.0.7
1 1 32801 192.168.0.6
ugly truth: SRV is neither used anywhere,
nor getting adopted.
new MySQLDriver(“_database._tcp.local”)
ain’t happenin'
53. Bad News
53
SRV is cool but not getting any adoption at all.
We are left with A/AAAA records = IP addresses
Works if all your instances are on static ports
(such as docker run -p 80:80)
When you do dynamic ports (docker run -P),
you need to resolve the port from SRV rec.
host, port = resolveSRV(“_database._tcp.local”)
… = new MySQLDriver(host, port)
you don’t want to do this all the time
54. DNS
54
Advantage: very simple, far less moving parts
Disadvantage: goodbye dynamic port allocation
Advantage: reduces load on middleware (DNS TTL)
Disadvantage: some languages* do not obey TTLs
Advantage: uses existing network stack
Disadvantage: no resilient way to do health checks
Advantage: load balancing by shuffling IPs :)
59. SkyDNS
59
github.com/skynetservices/skydns2
Very similar to Mesos-DNS.
Closely coupled to etcd.
Really complicated, probably does everything.
Kinda hard to set up, too.
Embraces plugin model, but only plugin is etcd.
Used by Kubernetes as default DNS add-on.
62. Service Discovery in my dreams…
62
it comes with the orchestrator
…or it is a “setup and forget about it”
does not infect the application code
with the service discovery concern,
uses a reliable networking stack,
does not have too many moving parts.
67. More labels…
67
docker run -p 80:80
-l dns.service=api
-l dns.service=billing
nginx
http://api.billing.swarm.
68. Features
68
Only DNS A/SRV records
Natural Load Balancing by shuffling DNS records
External DNS recursion
Works well with Docker TLS authentication
69. Deploying wagl
69
Just run:
docker run -d --restart=always --name=dns
-p 53:53/udp
--link=swarm-master:swarm
ahmet/wagl
wagl --swarm tcp://swarm:3376
If it can get any easier, it means I have failed.
73. Where are we headed?
73
These are just baby steps (expect innovation here)
We need a complete and seamless solution
The solution will not change the application code
A combination of DNS + Reverse Proxy can be it
Watch for what orchestrators are going to adapt