Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How Yelp does Service Discovery


Published on

This is a talk that I gave at the San Francisco DevOps meetup on 9/29/15. I talk about how Yelp performs service discovery using SmartStack and Docker.

Published in: Technology
  • Be the first to comment

How Yelp does Service Discovery

  1. 1. SmartStack, Docker and Yocalhost How Yelp Does Service Discovery
  2. 2. [Demo]
  3. 3. ● This works from (almost) any host in Yelp ● This works from Python, Java, command line etc. ● If a service supports HTTP or TCP then it can be made discoverable. ○ This includes third-party services such as MySQL and scribe ● It’s dynamic: for a given service, if new instances are added then they will automatically become available. Very Important Things to Note
  4. 4. ● SmartStack (nerve and synapse) were written by Airbnb ● We’ve added some features ● The work here has been carried out by many people across Yelp Credits
  5. 5. Registration
  6. 6. Architecture hacheck service_1 service_2 service_3 Service host ZK nerve
  7. 7. Nerve registers service instance in ZooKeeper: /nerve/region:myregion ├── service_1 │ └── server_1_0000013614 ├── service_2 │ └── server_1_0000000959 ├── service_3 │ ├── server_1_0000002468 │ └── server_2_0000002467 [...] ZooKeeper data
  8. 8. The data in a znode is all that is required to connect to the corresponding service instance. We’ll shortly see how this is used for discovery. { "host":"", "port":31337, "name":"server_1", "weight":10, } ZooKeeper data
  9. 9. hacheck Normally hacheck just acts as a transparent proxy for our healthchecks: $ curl -s yocalhost:6666/http/service_1/1234/status | jq . { "uptime": 5693819.315988064, "pid": 2595160, "host": "server_1", "version": "b6309e09d71da8f1e28213d251f7c3515878caca", }
  10. 10. hacheck We can also use it to fail healthchecks before we shut down a service. This allows us to gracefully shutdown a service. (Also provides a 1s cache to limit healthcheck rate.) $ hadown service_1 $ curl -v yocalhost:6666/http/service_1/1234/status Service service_1 in down state since 1443217910: billings
  11. 11. How do we know what services to advertise? Every service host periodically runs a script to regenerate the nerve configuration, reading from the following sources: ● yelpsoa-configs runs_on: server_1 server_2 ● puppet nerve_simple::puppet_service {'foo'} ● mesos slave API
  12. 12. Discovery
  13. 13. Architecture ZK client synapse haproxy nerve
  14. 14. HAProxy ● By default bind to ● Bind only to yocalhost on public servers. ● HAProxy gives us a lot of goodies for all clients: ○ Redispatch on connection failures ○ Zero-downtime restarts (once you know how :) ○ Easy to insert connection logging ● Each host also exposes an HAProxy status page for easy introspection
  15. 15. Every client host periodically runs a script to regenerate the synapse configuration, reading service definitions from yelpsoa-configs. For each service reads a smartstack.yaml file. Restarts synapse if configuration has changed.
  16. 16. smartstack.yaml main: proxy_port: 20973 mode: http healthcheck_uri: /status timeout_server_ms: 1000
  17. 17. Namespaces main: proxy_port: 20001 mode: http healthcheck_uri: /status timeout_server_ms: 1000 long_timeout: proxy_port: 20002 mode: http healthcheck_uri: /status timeout_server_ms: 3000 Same service, different ports
  18. 18. Escape hatch Some client libraries like to do their own load balancing e.g. cassandra, memcached. Use synapse to dump the registration information to disk: $ cat /var/run/synapse/services/devops.demo.json | jq . [ { "host":"", "port":31337, "name":"server_1", "weight":10, } ]
  19. 19. Docker + Yocalhost
  20. 20. Architecture haproxy docker container 1 lo docker container 2 lo eth0 eth0 docker0 eth0 lo:0 lo
  21. 21. yocalhost ● We’d like to run only one nerve / synapse / haproxy per host ● What address should we bind haproxy to? ● won’t work from within a container ● Instead we pick a link-local address (yocalhost) ● This also works on servers without docker
  22. 22. Locality-aware discovery
  23. 23. Overview We run services in both our own datacenters as well as AWS. We logically group these environments according to latency. Service authors get to decide how ‘widely’ their service instances are advertised. Everything is controlled via smartstack.yaml files.
  24. 24. Latency hierarchies habitat region superregion ZooKeepers live here Datacenters or AZs in AWS Habitats within 1ms round-trip e.g. ‘us-west-1’ Regions within 5ms round-trip e.g. ‘pacific north-west’
  25. 25. main: proxy_port: 20973 advertise: [habitat] discover: habitat advertise / discover Synapse should look in the habitat directory in its local ZooKeeper Nerve should register this service in the habitat directory of its local ZooKeeper
  26. 26. ZooKeeper data, revisited /nerve ├── region:us-west-1 │ └── service_1 │ └── server_1_0000013614 ├── region:us-west-2 │ └── service_2 │ └── server_2_0000000959 [...]
  27. 27. Extra advertisements “Wouldn’t it be useful if we could make a service running in datacenter A available in an (arbitrary) datacenter B?” Why? ● Makes it easier to bring up a new datacenter ● Makes it easier to add more capacity to a datacenter in an emergency ● Makes it easier to keep a datacenter going in an emergency if a service fails
  28. 28. main: advertise: [region] discover: region extra_advertise: region:us-west-1: [region:us-west-2] extra_advertise
  29. 29. Design choices
  30. 30. Unix 4eva ● Lots of little components, each doing doing one thing well ● Very simple interface for clients and services ○ If it speaks TCP or HTTP we can register it ● Easy to independently replace components ○ HAProxy -> NGINX? ● Easy to observe behavior of components
  31. 31. It’s OK if ZooKeeper fails ● Nerve and Synapse keep retrying ● HAProxy keeps running but with no updates ● HAProxy performs its own healthchecks against service instances ○ If a service instance becomes unavailable then it will stop receiving traffic after a short period ● The website stays up :)
  32. 32. Does it blend scale? ● Used to have scaling issues with internal load balancers, this is not a problem with SmartStack :) ● Hit some scaling issues at 10s of thousands of ZooKeeper connections ○ Addressed this by using just a single ZooKeeper connection from each nerve and synapse ● Used to have lots of HAProxy healthchecks hitting services ○ hacheck insulates services from this ○ We limit HAProxy restart rate
  33. 33. What about etcd / consul / …? ● We try to use boring components :) ● We’re already using Zookeeper for Kafka and ElasticSearch so it’s natural to use it for our service discovery system too. ● etcd would probably also work, and is supported by SmartStack ● Conceptually similar to consul / consul-template
  34. 34. What about DNS? ● What TTL are you going to use? ● Are you clients even going to honor the TTL? ● Does the DNS resolution happen inline with requests?
  35. 35. Conclusions ● We’ve used SmartStack to create a robust service discovery system ● It’s UNIXy: lots of separate components, each doing one thing well ● It’s flexible: locality-aware discovery ● It’s reliable: new devs at Yelp view discovery as a solved problem ● It’s useful: SmartStack is the glue that holds our SOA together