Taskerman - a distributed cluster task manager

Raghavendra Prabhu
rprabhu@yelp.com / @randomsurfer
Distributed Systems Data Team
Taskerman
A Distributed Cluster Task Manager

Yelp’s Mission
Connecting people with great
local businesses.

Some numbers!
● 30 million unique mobile app users.
● 74 million UMVs via mobile web.
● 84 million UMVs via desktop.
● More than 142 million rich, local reviews.
● 78% of all searches on Yelp came from mobile.
● 9 offices around the world.
● 4,000+ employees worldwide and 400+ engineers
(as of Q3 2017)

….
● Memcached
● Redis
● Spark
● Redshift
● DynamoDB
● S3
● OSCON talk on data tiers at Yelp
And many more!
6

Distributed Systems Team
● Several TB in prod cassandra clusters with tens of
nodes in each.
● Half a million messages/second in our streaming
pipeline
● Several TB in elasticsearch in prod with several hundred
nodes
● All are multi-AZ multi-region
● And growing…

Maintenance Cost
Engineering Efficiency
Scalability

● Safe
● Generic and Extensible
● Distributed
● Loosely coupled
● Not ad-hoc
○ Reviewed
● Sound config management
The Why I

● Schedulable
● Reusable
● Cluster awareness
● Easily maintainable and observable
○ Not a black box.
○ More Ironman, less Ultron
● Prior Art
○ Downsides
The Why II

● Paramount*
● Serialized execution
○ ‘m’ out of ‘n’
○ Disjoint jobs.
● Avoid cascade
● Privilege escalation
● Push-based
* Unless oncall is automated too.
Safety

Quotes
“There are only two hard problems in distributed systems:
2. Exactly-once delivery 1. Guaranteed order of messages
2. Exactly-once delivery”
@mathiasverraes
“There are 2 hard problems in computer science: cache
invalidation, naming things, and off-by-1 errors.”
@secretGeek

● Network is reliable
● Latency is zero
● Bandwidth is infinite
● Network is secure
● One administrator
● Transport cost is zero
● Network is homogenous
● Topology doesn't change
Fallacies of Distributed System

● Scheduler
● Router
● Co-ordinator
● Transport
● Executor
● Error handler
● Configuration
● Monitoring
● Tooling
Components

RouterQueue
Q2
Q1 Q3
Dead
Letter
Queue
T1
T2
T3
Lease
Failure
SQS queue (Qx - node
queue)
Flow of task
Task Scheduler
Nodes

● Runs on Chronos
● Emits a task
● Enqueues into global queue
● Ad-hoc support
● Deployment granularities
● Task tracking
● Yelpsoa-configs
Task Scheduler

#Anatomy of a Taskerman Task
{
‘action’: ‘heartbeat:ping’,
‘version’: ‘X.Y’,
‘limit’: 2, # To limit executors.
‘cluster_name’: ‘cassandra:geo_counter’,
‘discovery’ : ‘aws_tags’,
‘owner’: ‘distsys’, # For alerting
‘task_id’: ‘f2f6e03f-539a-49ad-8bf0-ecf079df36f5’,
# For status tracking
‘taskerman_params’: {
‘action_args’: {
},
‘workqueue_args’: {
},
},
‘nodes’: [a,b,c,d], # For passthrough mode.
‘destnode’: ‘’, # Mutated by router for DLQ
}

● AWS SQS
● Best-effort FIFO
● Reliable and cheap
● Low latency
● Properties
○ Read without delete
○ Visibility timeout
○ Retry
○ Dead Letter Queue
Queue

● Stateless Marathon worker
● Routes tasks from global queue
○ To node-specific queues
● At-least once delivery
● Queue creation
○ Top-down principle
● Passthrough
Task Router

● ‘DNS of Taskerman’
● Cluster => [..,Node IP,..]
● EC2 tags or Smartstack
● Liveness filtering
● Pluggable
● Challenges
○ Rate-limiting
Task Router :: Discovery

● The executor of Taskerman
● Dequeue task and executes
○ Pre-defined reviewed code.
● Scheduled on node
● Zookeeper for coordination
● Task deleted upon success
● Dead letter queue upon failed
retries
Taskrunner

class TestTaskRunner(TaskRunner):
def __init__(self, task,..):
..
def pre_check(self):
..
def post_check(self):
..
def execute_action(self):
..

● Distributed Coordinator
● Non Blocking Lease
○ Time-based lease
○ Mutual exclusion
○ Global lease
● Ephemeral locks
Zookeeper

● Atomic Counters
○ Statistics on actions
○ Circuit breakers
■ Dead man’s switch
■ Prevent failure cascade
○ Automatic reset
● What is Atomic
○ Serializability
Zookeeper

● Staleness
○ Nodes can go down
● Garbage collection
○ Cleanup of ZK data structures
● Composition
● Starvation
● Uptime
Zookeeper: Challenges

● Failure is the norm, not an
exception
● Multiple vectors of failure
● Pessimistic approach
○ Job retry
○ Job Counter
● Mitigation vs Alerting
Failure

● Heartbeat ping
○ End-to-end monitoring
● Dead Letter Queue
○ Recycle bin of failed tasks.
○ Hooks into human side of
monitoring
● Others
○ Separation of state
○ Mutability
Failure handling

● Observability
○ Job status logging at stages
○ End-to-end logging with Scribe
○ Metrics
○ Signalfx (Queue lengths)
○ Splunk dashboards
● Alerting
○ Sensu
○ Signalfx
Monitoring

● Restarts
● Reboots
● Instance Replacement
● Integration tests
● Kafka config mgmt
● Backup and restore
● Search indexing
● .. and many more.
Use cases

Uptime management
$ uptime
06:52:54 up 99 days, 19:20, 1 user,
load average: 0.02, 0.03, 0.07
ps -eo pid,cmd,lstart | grep ..
10058 zookeeper Tue Dec 5 05:23:43 2017

Q & A
● Slides will also be uploaded to slideshare.net/slidunder.

www.yelp.com/careers/
We're Hiring!

@YelpEngineering
fb.com/YelpEngineers
engineeringblog.yelp.com
github.com/yelp

● https://www.elastic.co/products/elasticsearch
● https://zookeeper.apache.org/
● https://kafka.apache.org/
● https://www.flickr.com/photos/dapuglet/6291424431
● http://www.alamy.com/stock-photo/cattle-penning.html
● http://www.firstcallsigns.co.uk/content/images/thumbs/0000927_EE80127.jpeg
● https://sensuapp.org/img/logo-flat-white.png
● https://thumbs.gfycat.com/FocusedCompetentEyas-max-1mb.gif
● https://www.percona.com/sites/default/files/dashboard.png
● https://www.sales-initiative.com/downloads/2856/download/resilience.jpg?cb=29f43ac82cea225ab3ee370d7580760d
● http://izquotes.com/quotes-pictures/quote-a-distributed-system-is-one-in-which-the-failure-of-a-computer-you-didn-t-eve
n-know-existed-can-leslie-lamport-346227.jpg
● https://pbs.twimg.com/media/DRCfqaCWsAczqTz.jpg
● https://upload.wikimedia.org/wikipedia/en/thumb/e/e0/Iron_Man_bleeding_edge.jpg/220px-Iron_Man_bleeding_edge.jpg
● https://github.com/mesos/chronos
● https://github.com/mesosphere
Image Credits

● http://www.networknuts-web.biz/wp-content/uploads/2014/10/cron-logo.png
● http://www.pvhc.net/img195/ojfspebrvfblupftgajb.png
● https://fun-damentals.com/wp-content/uploads/2016/05/a-resilience.png
● http://www.azquotes.com/picture-quotes/quote-debugging-is-twice-as-hard-as-writing-the-code-in-the-first-place-therefor
e-if-you-write-brian-kernighan-66-91-06.jpg
● https://thenounproject.com/
● https://aws.amazon.com/
● https://www.splunk.com/
● https://www.terraform.io/
● http://yelp.com
● http://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/
Image Credits

● https://engineeringblog.yelp.com/2015/03/using-services-to-break-down-monoliths.html
● http://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/
● https://martinfowler.com/bliki/TwoHardThings.html
● https://zookeeper.apache.org/
● https://www.terraform.io/
● https://github.com/Yelp/service-principles
● https://en.wikipedia.org/wiki/Law_of_Demeter
Further Reading

Taskerman - a distributed cluster task manager

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Taskerman - a distributed cluster task manager

Similar to Taskerman - a distributed cluster task manager (20)

More from Raghavendra Prabhu

More from Raghavendra Prabhu (20)

Recently uploaded

Recently uploaded (20)

Taskerman - a distributed cluster task manager