Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Taskerman - a distributed cluster task manager

146 views

Published on

This talk is about Taskerman, a distributed cluster task manager built on top of AWS SQS, Zookeeper and Yelp PaaSTA. The talk was given at Imperial College, London as part of its 'Application of Computing in Industry' series: http://www.imperial.ac.uk/computing/industry/aci/yelp/

Published in: Software
  • Be the first to comment

Taskerman - a distributed cluster task manager

  1. 1. Raghavendra Prabhu rprabhu@yelp.com / @randomsurfer Distributed Systems Data Team Taskerman A Distributed Cluster Task Manager
  2. 2. Yelp’s Mission Connecting people with great local businesses.
  3. 3. Some numbers! ● 30 million unique mobile app users. ● 74 million UMVs via mobile web. ● 84 million UMVs via desktop. ● More than 142 million rich, local reviews. ● 78% of all searches on Yelp came from mobile. ● 9 offices around the world. ● 4,000+ employees worldwide and 400+ engineers (as of Q3 2017)
  4. 4. Datastore Ecosystem @
  5. 5. …. ● Memcached ● Redis ● Spark ● Redshift ● DynamoDB ● S3 ● OSCON talk on data tiers at Yelp And many more! 6
  6. 6. Distributed Systems Team ● Several TB in prod cassandra clusters with tens of nodes in each. ● Half a million messages/second in our streaming pipeline ● Several TB in elasticsearch in prod with several hundred nodes ● All are multi-AZ multi-region ● And growing…
  7. 7. Datastores: Pets or Cattle?
  8. 8. Maintenance Cost Engineering Efficiency Scalability
  9. 9. ● Safe ● Generic and Extensible ● Distributed ● Loosely coupled ● Not ad-hoc ○ Reviewed ● Sound config management The Why I
  10. 10. ● Schedulable ● Reusable ● Cluster awareness ● Easily maintainable and observable ○ Not a black box. ○ More Ironman, less Ultron ● Prior Art ○ Downsides The Why II
  11. 11. ● Paramount* ● Serialized execution ○ ‘m’ out of ‘n’ ○ Disjoint jobs. ● Avoid cascade ● Privilege escalation ● Push-based * Unless oncall is automated too. Safety
  12. 12. Quotes “There are only two hard problems in distributed systems: 2. Exactly-once delivery 1. Guaranteed order of messages 2. Exactly-once delivery” @mathiasverraes “There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors.” @secretGeek
  13. 13. ● Network is reliable ● Latency is zero ● Bandwidth is infinite ● Network is secure ● One administrator ● Transport cost is zero ● Network is homogenous ● Topology doesn't change Fallacies of Distributed System
  14. 14. Taskerman
  15. 15. ● Scheduler ● Router ● Co-ordinator ● Transport ● Executor ● Error handler ● Configuration ● Monitoring ● Tooling Components
  16. 16. RouterQueue Q2 Q1 Q3 Dead Letter Queue T1 T2 T3 Lease Failure SQS queue (Qx - node queue) Flow of task Task Scheduler Nodes
  17. 17. ● Runs on Chronos ● Emits a task ● Enqueues into global queue ● Ad-hoc support ● Deployment granularities ● Task tracking ● Yelpsoa-configs Task Scheduler
  18. 18. #Anatomy of a Taskerman Task { ‘action’: ‘heartbeat:ping’, ‘version’: ‘X.Y’, ‘limit’: 2, # To limit executors. ‘cluster_name’: ‘cassandra:geo_counter’, ‘discovery’ : ‘aws_tags’, ‘owner’: ‘distsys’, # For alerting ‘task_id’: ‘f2f6e03f-539a-49ad-8bf0-ecf079df36f5’, # For status tracking ‘taskerman_params’: { ‘action_args’: { }, ‘workqueue_args’: { }, }, ‘nodes’: [a,b,c,d], # For passthrough mode. ‘destnode’: ‘’, # Mutated by router for DLQ }
  19. 19. ● AWS SQS ● Best-effort FIFO ● Reliable and cheap ● Low latency ● Properties ○ Read without delete ○ Visibility timeout ○ Retry ○ Dead Letter Queue Queue
  20. 20. ● Stateless Marathon worker ● Routes tasks from global queue ○ To node-specific queues ● At-least once delivery ● Queue creation ○ Top-down principle ● Passthrough Task Router
  21. 21. ● ‘DNS of Taskerman’ ● Cluster => [..,Node IP,..] ● EC2 tags or Smartstack ● Liveness filtering ● Pluggable ● Challenges ○ Rate-limiting Task Router :: Discovery
  22. 22. ● The executor of Taskerman ● Dequeue task and executes ○ Pre-defined reviewed code. ● Scheduled on node ● Zookeeper for coordination ● Task deleted upon success ● Dead letter queue upon failed retries Taskrunner
  23. 23. class TestTaskRunner(TaskRunner): def __init__(self, task,..): .. def pre_check(self): .. def post_check(self): .. def execute_action(self): ..
  24. 24. ● Distributed Coordinator ● Non Blocking Lease ○ Time-based lease ○ Mutual exclusion ○ Global lease ● Ephemeral locks Zookeeper
  25. 25. ● Atomic Counters ○ Statistics on actions ○ Circuit breakers ■ Dead man’s switch ■ Prevent failure cascade ○ Automatic reset ● What is Atomic ○ Serializability Zookeeper
  26. 26. ● Staleness ○ Nodes can go down ● Garbage collection ○ Cleanup of ZK data structures ● Composition ● Starvation ● Uptime Zookeeper: Challenges
  27. 27. ● Failure is the norm, not an exception ● Multiple vectors of failure ● Pessimistic approach ○ Job retry ○ Job Counter ● Mitigation vs Alerting Failure
  28. 28. ● Heartbeat ping ○ End-to-end monitoring ● Dead Letter Queue ○ Recycle bin of failed tasks. ○ Hooks into human side of monitoring ● Others ○ Separation of state ○ Mutability Failure handling
  29. 29. Debugging distributed systems
  30. 30. ● Observability ○ Job status logging at stages ○ End-to-end logging with Scribe ○ Metrics ○ Signalfx (Queue lengths) ○ Splunk dashboards ● Alerting ○ Sensu ○ Signalfx Monitoring
  31. 31. ● Restarts ● Reboots ● Instance Replacement ● Integration tests ● Kafka config mgmt ● Backup and restore ● Search indexing ● .. and many more. Use cases
  32. 32. Uptime management $ uptime 06:52:54 up 99 days, 19:20, 1 user, load average: 0.02, 0.03, 0.07 ps -eo pid,cmd,lstart | grep .. 10058 zookeeper Tue Dec 5 05:23:43 2017
  33. 33. Q & A ● Slides will also be uploaded to slideshare.net/slidunder.
  34. 34. www.yelp.com/careers/ We're Hiring!
  35. 35. @YelpEngineering fb.com/YelpEngineers engineeringblog.yelp.com github.com/yelp
  36. 36. ● https://www.elastic.co/products/elasticsearch ● https://zookeeper.apache.org/ ● https://kafka.apache.org/ ● https://www.flickr.com/photos/dapuglet/6291424431 ● http://www.alamy.com/stock-photo/cattle-penning.html ● http://www.firstcallsigns.co.uk/content/images/thumbs/0000927_EE80127.jpeg ● https://sensuapp.org/img/logo-flat-white.png ● https://thumbs.gfycat.com/FocusedCompetentEyas-max-1mb.gif ● https://www.percona.com/sites/default/files/dashboard.png ● https://www.sales-initiative.com/downloads/2856/download/resilience.jpg?cb=29f43ac82cea225ab3ee370d7580760d ● http://izquotes.com/quotes-pictures/quote-a-distributed-system-is-one-in-which-the-failure-of-a-computer-you-didn-t-eve n-know-existed-can-leslie-lamport-346227.jpg ● https://pbs.twimg.com/media/DRCfqaCWsAczqTz.jpg ● https://upload.wikimedia.org/wikipedia/en/thumb/e/e0/Iron_Man_bleeding_edge.jpg/220px-Iron_Man_bleeding_edge.jpg ● https://github.com/mesos/chronos ● https://github.com/mesosphere Image Credits
  37. 37. ● http://www.networknuts-web.biz/wp-content/uploads/2014/10/cron-logo.png ● http://www.pvhc.net/img195/ojfspebrvfblupftgajb.png ● https://fun-damentals.com/wp-content/uploads/2016/05/a-resilience.png ● http://www.azquotes.com/picture-quotes/quote-debugging-is-twice-as-hard-as-writing-the-code-in-the-first-place-therefor e-if-you-write-brian-kernighan-66-91-06.jpg ● https://thenounproject.com/ ● https://aws.amazon.com/ ● https://www.splunk.com/ ● https://www.terraform.io/ ● http://yelp.com ● http://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/ Image Credits
  38. 38. ● https://engineeringblog.yelp.com/2015/03/using-services-to-break-down-monoliths.html ● http://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/ ● https://martinfowler.com/bliki/TwoHardThings.html ● https://zookeeper.apache.org/ ● https://www.terraform.io/ ● https://github.com/Yelp/service-principles ● https://en.wikipedia.org/wiki/Law_of_Demeter Further Reading

×