SlideShare a Scribd company logo
Raghavendra Prabhu
rprabhu@yelp.com / @randomsurfer
Distributed Systems Data Team
Taskerman
A Distributed Cluster Task Manager
Yelp’s Mission
Connecting people with great
local businesses.
Some numbers!
● 30 million unique mobile app users.
● 74 million UMVs via mobile web.
● 84 million UMVs via desktop.
● More than 142 million rich, local reviews.
● 78% of all searches on Yelp came from mobile.
● 9 offices around the world.
● 4,000+ employees worldwide and 400+ engineers
(as of Q3 2017)
Datastore Ecosystem @
….
● Memcached
● Redis
● Spark
● Redshift
● DynamoDB
● S3
● OSCON talk on data tiers at Yelp
And many more!
6
Distributed Systems Team
● Several TB in prod cassandra clusters with tens of
nodes in each.
● Half a million messages/second in our streaming
pipeline
● Several TB in elasticsearch in prod with several hundred
nodes
● All are multi-AZ multi-region
● And growing…
Datastores: Pets or Cattle?
Maintenance Cost
Engineering Efficiency
Scalability
● Safe
● Generic and Extensible
● Distributed
● Loosely coupled
● Not ad-hoc
○ Reviewed
● Sound config management
The Why I
● Schedulable
● Reusable
● Cluster awareness
● Easily maintainable and observable
○ Not a black box.
○ More Ironman, less Ultron
● Prior Art
○ Downsides
The Why II
● Paramount*
● Serialized execution
○ ‘m’ out of ‘n’
○ Disjoint jobs.
● Avoid cascade
● Privilege escalation
● Push-based
* Unless oncall is automated too.
Safety
Quotes
“There are only two hard problems in distributed systems:
2. Exactly-once delivery 1. Guaranteed order of messages
2. Exactly-once delivery”
@mathiasverraes
“There are 2 hard problems in computer science: cache
invalidation, naming things, and off-by-1 errors.”
@secretGeek
● Network is reliable
● Latency is zero
● Bandwidth is infinite
● Network is secure
● One administrator
● Transport cost is zero
● Network is homogenous
● Topology doesn't change
Fallacies of Distributed System
Taskerman
● Scheduler
● Router
● Co-ordinator
● Transport
● Executor
● Error handler
● Configuration
● Monitoring
● Tooling
Components
RouterQueue
Q2
Q1 Q3
Dead
Letter
Queue
T1
T2
T3
Lease
Failure
SQS queue (Qx - node
queue)
Flow of task
Task Scheduler
Nodes
● Runs on Chronos
● Emits a task
● Enqueues into global queue
● Ad-hoc support
● Deployment granularities
● Task tracking
● Yelpsoa-configs
Task Scheduler
#Anatomy of a Taskerman Task
{
‘action’: ‘heartbeat:ping’,
‘version’: ‘X.Y’,
‘limit’: 2, # To limit executors.
‘cluster_name’: ‘cassandra:geo_counter’,
‘discovery’ : ‘aws_tags’,
‘owner’: ‘distsys’, # For alerting
‘task_id’: ‘f2f6e03f-539a-49ad-8bf0-ecf079df36f5’,
# For status tracking
‘taskerman_params’: {
‘action_args’: {
},
‘workqueue_args’: {
},
},
‘nodes’: [a,b,c,d], # For passthrough mode.
‘destnode’: ‘’, # Mutated by router for DLQ
}
● AWS SQS
● Best-effort FIFO
● Reliable and cheap
● Low latency
● Properties
○ Read without delete
○ Visibility timeout
○ Retry
○ Dead Letter Queue
Queue
● Stateless Marathon worker
● Routes tasks from global queue
○ To node-specific queues
● At-least once delivery
● Queue creation
○ Top-down principle
● Passthrough
Task Router
● ‘DNS of Taskerman’
● Cluster => [..,Node IP,..]
● EC2 tags or Smartstack
● Liveness filtering
● Pluggable
● Challenges
○ Rate-limiting
Task Router :: Discovery
● The executor of Taskerman
● Dequeue task and executes
○ Pre-defined reviewed code.
● Scheduled on node
● Zookeeper for coordination
● Task deleted upon success
● Dead letter queue upon failed
retries
Taskrunner
class TestTaskRunner(TaskRunner):
def __init__(self, task,..):
..
def pre_check(self):
..
def post_check(self):
..
def execute_action(self):
..
● Distributed Coordinator
● Non Blocking Lease
○ Time-based lease
○ Mutual exclusion
○ Global lease
● Ephemeral locks
Zookeeper
● Atomic Counters
○ Statistics on actions
○ Circuit breakers
■ Dead man’s switch
■ Prevent failure cascade
○ Automatic reset
● What is Atomic
○ Serializability
Zookeeper
● Staleness
○ Nodes can go down
● Garbage collection
○ Cleanup of ZK data structures
● Composition
● Starvation
● Uptime
Zookeeper: Challenges
● Failure is the norm, not an
exception
● Multiple vectors of failure
● Pessimistic approach
○ Job retry
○ Job Counter
● Mitigation vs Alerting
Failure
● Heartbeat ping
○ End-to-end monitoring
● Dead Letter Queue
○ Recycle bin of failed tasks.
○ Hooks into human side of
monitoring
● Others
○ Separation of state
○ Mutability
Failure handling
Debugging distributed systems
● Observability
○ Job status logging at stages
○ End-to-end logging with Scribe
○ Metrics
○ Signalfx (Queue lengths)
○ Splunk dashboards
● Alerting
○ Sensu
○ Signalfx
Monitoring
● Restarts
● Reboots
● Instance Replacement
● Integration tests
● Kafka config mgmt
● Backup and restore
● Search indexing
● .. and many more.
Use cases
Uptime management
$ uptime
06:52:54 up 99 days, 19:20, 1 user,
load average: 0.02, 0.03, 0.07
ps -eo pid,cmd,lstart | grep ..
10058 zookeeper Tue Dec 5 05:23:43 2017
Q & A
● Slides will also be uploaded to slideshare.net/slidunder.
www.yelp.com/careers/
We're Hiring!
@YelpEngineering
fb.com/YelpEngineers
engineeringblog.yelp.com
github.com/yelp
● https://www.elastic.co/products/elasticsearch
● https://zookeeper.apache.org/
● https://kafka.apache.org/
● https://www.flickr.com/photos/dapuglet/6291424431
● http://www.alamy.com/stock-photo/cattle-penning.html
● http://www.firstcallsigns.co.uk/content/images/thumbs/0000927_EE80127.jpeg
● https://sensuapp.org/img/logo-flat-white.png
● https://thumbs.gfycat.com/FocusedCompetentEyas-max-1mb.gif
● https://www.percona.com/sites/default/files/dashboard.png
● https://www.sales-initiative.com/downloads/2856/download/resilience.jpg?cb=29f43ac82cea225ab3ee370d7580760d
● http://izquotes.com/quotes-pictures/quote-a-distributed-system-is-one-in-which-the-failure-of-a-computer-you-didn-t-eve
n-know-existed-can-leslie-lamport-346227.jpg
● https://pbs.twimg.com/media/DRCfqaCWsAczqTz.jpg
● https://upload.wikimedia.org/wikipedia/en/thumb/e/e0/Iron_Man_bleeding_edge.jpg/220px-Iron_Man_bleeding_edge.jpg
● https://github.com/mesos/chronos
● https://github.com/mesosphere
Image Credits
● http://www.networknuts-web.biz/wp-content/uploads/2014/10/cron-logo.png
● http://www.pvhc.net/img195/ojfspebrvfblupftgajb.png
● https://fun-damentals.com/wp-content/uploads/2016/05/a-resilience.png
● http://www.azquotes.com/picture-quotes/quote-debugging-is-twice-as-hard-as-writing-the-code-in-the-first-place-therefor
e-if-you-write-brian-kernighan-66-91-06.jpg
● https://thenounproject.com/
● https://aws.amazon.com/
● https://www.splunk.com/
● https://www.terraform.io/
● http://yelp.com
● http://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/
Image Credits
● https://engineeringblog.yelp.com/2015/03/using-services-to-break-down-monoliths.html
● http://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/
● https://martinfowler.com/bliki/TwoHardThings.html
● https://zookeeper.apache.org/
● https://www.terraform.io/
● https://github.com/Yelp/service-principles
● https://en.wikipedia.org/wiki/Law_of_Demeter
Further Reading

More Related Content

What's hot

Mario on spark
Mario on sparkMario on spark
Mario on spark
Igor Berman
 
An Introduction to Rearview - Time Series Based Monitoring
An Introduction to Rearview - Time Series Based MonitoringAn Introduction to Rearview - Time Series Based Monitoring
An Introduction to Rearview - Time Series Based Monitoring
VictorOps
 
Scaling Up Logging and Metrics
Scaling Up Logging and MetricsScaling Up Logging and Metrics
Scaling Up Logging and Metrics
Ricardo Lourenço
 
Anatomy of an action
Anatomy of an actionAnatomy of an action
Anatomy of an action
Gordon Chung
 
Security Monitoring for big Infrastructures without a Million Dollar budget
Security Monitoring for big Infrastructures without a Million Dollar budgetSecurity Monitoring for big Infrastructures without a Million Dollar budget
Security Monitoring for big Infrastructures without a Million Dollar budget
Juan Berner
 
Monitoring Uptime on the NeCTAR Research Cloud - Andy Botting, University of ...
Monitoring Uptime on the NeCTAR Research Cloud - Andy Botting, University of ...Monitoring Uptime on the NeCTAR Research Cloud - Andy Botting, University of ...
Monitoring Uptime on the NeCTAR Research Cloud - Andy Botting, University of ...
OpenStack
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Demi Ben-Ari
 
Data Management in Cloud Platforms
Data Management in Cloud PlatformsData Management in Cloud Platforms
Data Management in Cloud Platforms
shnkoc
 
Eko10 Workshop Opensource Database Auditing
Eko10  Workshop Opensource Database AuditingEko10  Workshop Opensource Database Auditing
Eko10 Workshop Opensource Database Auditing
Juan Berner
 
Stabilising the jenga tower
Stabilising the jenga towerStabilising the jenga tower
Stabilising the jenga tower
Gordon Chung
 
Gnocchi v4 - past and present
Gnocchi v4 - past and presentGnocchi v4 - past and present
Gnocchi v4 - past and present
Gordon Chung
 
Insight DE project
Insight DE projectInsight DE project
Insight DE project
Kat Chuang
 
Golang in TiDB (GopherChina 2017)
Golang in TiDB  (GopherChina 2017)Golang in TiDB  (GopherChina 2017)
Golang in TiDB (GopherChina 2017)
PingCAP
 
Вячеслав Крюков, Ivinco
Вячеслав Крюков, IvincoВячеслав Крюков, Ivinco
Вячеслав Крюков, Ivinco
Ontico
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in production
PingCAP
 
Efficient Migration of Very Large Distributed State for Scalable Stream Proce...
Efficient Migration of Very Large Distributed State for Scalable Stream Proce...Efficient Migration of Very Large Distributed State for Scalable Stream Proce...
Efficient Migration of Very Large Distributed State for Scalable Stream Proce...
Bonaventura Del Monte
 
How to build TiDB
How to build TiDBHow to build TiDB
How to build TiDB
PingCAP
 
Large-Scale Automated Storage on Kubernetes - Matt Schallert OSCON 2019
Large-Scale Automated Storage on Kubernetes - Matt Schallert OSCON 2019Large-Scale Automated Storage on Kubernetes - Matt Schallert OSCON 2019
Large-Scale Automated Storage on Kubernetes - Matt Schallert OSCON 2019
Matt Schallert
 
Open Source india 2014
Open Source india 2014Open Source india 2014
Open Source india 2014
lohitvijayarenu
 
Order from chaos: automating monitoring configuration
Order from chaos: automating monitoring configurationOrder from chaos: automating monitoring configuration
Order from chaos: automating monitoring configuration
Sensu Inc.
 

What's hot (20)

Mario on spark
Mario on sparkMario on spark
Mario on spark
 
An Introduction to Rearview - Time Series Based Monitoring
An Introduction to Rearview - Time Series Based MonitoringAn Introduction to Rearview - Time Series Based Monitoring
An Introduction to Rearview - Time Series Based Monitoring
 
Scaling Up Logging and Metrics
Scaling Up Logging and MetricsScaling Up Logging and Metrics
Scaling Up Logging and Metrics
 
Anatomy of an action
Anatomy of an actionAnatomy of an action
Anatomy of an action
 
Security Monitoring for big Infrastructures without a Million Dollar budget
Security Monitoring for big Infrastructures without a Million Dollar budgetSecurity Monitoring for big Infrastructures without a Million Dollar budget
Security Monitoring for big Infrastructures without a Million Dollar budget
 
Monitoring Uptime on the NeCTAR Research Cloud - Andy Botting, University of ...
Monitoring Uptime on the NeCTAR Research Cloud - Andy Botting, University of ...Monitoring Uptime on the NeCTAR Research Cloud - Andy Botting, University of ...
Monitoring Uptime on the NeCTAR Research Cloud - Andy Botting, University of ...
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
 
Data Management in Cloud Platforms
Data Management in Cloud PlatformsData Management in Cloud Platforms
Data Management in Cloud Platforms
 
Eko10 Workshop Opensource Database Auditing
Eko10  Workshop Opensource Database AuditingEko10  Workshop Opensource Database Auditing
Eko10 Workshop Opensource Database Auditing
 
Stabilising the jenga tower
Stabilising the jenga towerStabilising the jenga tower
Stabilising the jenga tower
 
Gnocchi v4 - past and present
Gnocchi v4 - past and presentGnocchi v4 - past and present
Gnocchi v4 - past and present
 
Insight DE project
Insight DE projectInsight DE project
Insight DE project
 
Golang in TiDB (GopherChina 2017)
Golang in TiDB  (GopherChina 2017)Golang in TiDB  (GopherChina 2017)
Golang in TiDB (GopherChina 2017)
 
Вячеслав Крюков, Ivinco
Вячеслав Крюков, IvincoВячеслав Крюков, Ivinco
Вячеслав Крюков, Ivinco
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in production
 
Efficient Migration of Very Large Distributed State for Scalable Stream Proce...
Efficient Migration of Very Large Distributed State for Scalable Stream Proce...Efficient Migration of Very Large Distributed State for Scalable Stream Proce...
Efficient Migration of Very Large Distributed State for Scalable Stream Proce...
 
How to build TiDB
How to build TiDBHow to build TiDB
How to build TiDB
 
Large-Scale Automated Storage on Kubernetes - Matt Schallert OSCON 2019
Large-Scale Automated Storage on Kubernetes - Matt Schallert OSCON 2019Large-Scale Automated Storage on Kubernetes - Matt Schallert OSCON 2019
Large-Scale Automated Storage on Kubernetes - Matt Schallert OSCON 2019
 
Open Source india 2014
Open Source india 2014Open Source india 2014
Open Source india 2014
 
Order from chaos: automating monitoring configuration
Order from chaos: automating monitoring configurationOrder from chaos: automating monitoring configuration
Order from chaos: automating monitoring configuration
 

Similar to Taskerman - a distributed cluster task manager

NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
Ruslan Meshenberg
 
Benchmarks, performance, scalability, and capacity what s behind the numbers...
Benchmarks, performance, scalability, and capacity  what s behind the numbers...Benchmarks, performance, scalability, and capacity  what s behind the numbers...
Benchmarks, performance, scalability, and capacity what s behind the numbers...
james tong
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at Appsflyer
Michael Spector
 
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
StreamNative
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
confluent
 
Scalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBERScalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBER
Shuyi Chen
 
Orchestrating Cassandra with Kubernetes: Challenges and Opportunities
Orchestrating Cassandra with Kubernetes: Challenges and OpportunitiesOrchestrating Cassandra with Kubernetes: Challenges and Opportunities
Orchestrating Cassandra with Kubernetes: Challenges and Opportunities
Raghavendra Prabhu
 
Orchestrating Cassandra with Kubernetes Operator and PaaSTA
Orchestrating Cassandra with Kubernetes Operator and PaaSTAOrchestrating Cassandra with Kubernetes Operator and PaaSTA
Orchestrating Cassandra with Kubernetes Operator and PaaSTA
Raghavendra Prabhu
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Hernan Costante
 
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
Amazon Web Services
 
Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache spark
Demi Ben-Ari
 
Server fleet management using Camunda by Akhil Ahuja
Server fleet management using Camunda by Akhil AhujaServer fleet management using Camunda by Akhil Ahuja
Server fleet management using Camunda by Akhil Ahuja
camunda services GmbH
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
jhugg
 
Reactive mistakes - ScalaDays Chicago 2017
Reactive mistakes -  ScalaDays Chicago 2017Reactive mistakes -  ScalaDays Chicago 2017
Reactive mistakes - ScalaDays Chicago 2017
Petr Zapletal
 
Gluster dev session #6 understanding gluster's network communication layer
Gluster dev session #6  understanding gluster's network   communication layerGluster dev session #6  understanding gluster's network   communication layer
Gluster dev session #6 understanding gluster's network communication layer
Pranith Karampuri
 
Pregel: A System For Large Scale Graph Processing
Pregel: A System For Large Scale Graph ProcessingPregel: A System For Large Scale Graph Processing
Pregel: A System For Large Scale Graph ProcessingRiyad Parvez
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadKrivoy Rog IT Community
 
CQRS: Theory
CQRS: Theory CQRS: Theory
CQRS: Theory
Topu Newaj
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
NETWAYS
 
Scalability broad strokes
Scalability   broad strokesScalability   broad strokes
Scalability broad strokes
Gagan Bajpai
 

Similar to Taskerman - a distributed cluster task manager (20)

NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Benchmarks, performance, scalability, and capacity what s behind the numbers...
Benchmarks, performance, scalability, and capacity  what s behind the numbers...Benchmarks, performance, scalability, and capacity  what s behind the numbers...
Benchmarks, performance, scalability, and capacity what s behind the numbers...
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at Appsflyer
 
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
 
Scalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBERScalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBER
 
Orchestrating Cassandra with Kubernetes: Challenges and Opportunities
Orchestrating Cassandra with Kubernetes: Challenges and OpportunitiesOrchestrating Cassandra with Kubernetes: Challenges and Opportunities
Orchestrating Cassandra with Kubernetes: Challenges and Opportunities
 
Orchestrating Cassandra with Kubernetes Operator and PaaSTA
Orchestrating Cassandra with Kubernetes Operator and PaaSTAOrchestrating Cassandra with Kubernetes Operator and PaaSTA
Orchestrating Cassandra with Kubernetes Operator and PaaSTA
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
 
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
 
Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache spark
 
Server fleet management using Camunda by Akhil Ahuja
Server fleet management using Camunda by Akhil AhujaServer fleet management using Camunda by Akhil Ahuja
Server fleet management using Camunda by Akhil Ahuja
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
 
Reactive mistakes - ScalaDays Chicago 2017
Reactive mistakes -  ScalaDays Chicago 2017Reactive mistakes -  ScalaDays Chicago 2017
Reactive mistakes - ScalaDays Chicago 2017
 
Gluster dev session #6 understanding gluster's network communication layer
Gluster dev session #6  understanding gluster's network   communication layerGluster dev session #6  understanding gluster's network   communication layer
Gluster dev session #6 understanding gluster's network communication layer
 
Pregel: A System For Large Scale Graph Processing
Pregel: A System For Large Scale Graph ProcessingPregel: A System For Large Scale Graph Processing
Pregel: A System For Large Scale Graph Processing
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High load
 
CQRS: Theory
CQRS: Theory CQRS: Theory
CQRS: Theory
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
 
Scalability broad strokes
Scalability   broad strokesScalability   broad strokes
Scalability broad strokes
 

More from Raghavendra Prabhu

Orchestrating Cassandra with Kubernetes
Orchestrating Cassandra with KubernetesOrchestrating Cassandra with Kubernetes
Orchestrating Cassandra with Kubernetes
Raghavendra Prabhu
 
Cassandra Operator with Yelp PaaSTA
Cassandra Operator with Yelp PaaSTACassandra Operator with Yelp PaaSTA
Cassandra Operator with Yelp PaaSTA
Raghavendra Prabhu
 
Safe and Fast Automation on AWS for Fun and Profit
Safe and Fast Automation on AWS for Fun and ProfitSafe and Fast Automation on AWS for Fun and Profit
Safe and Fast Automation on AWS for Fun and Profit
Raghavendra Prabhu
 
Pass Elk: CAP Theorem since 90s and Beyond
Pass Elk: CAP Theorem since 90s and BeyondPass Elk: CAP Theorem since 90s and Beyond
Pass Elk: CAP Theorem since 90s and Beyond
Raghavendra Prabhu
 
Cassandra in Docker at Yelp: Opportunities and Challenges
Cassandra in Docker at Yelp: Opportunities and ChallengesCassandra in Docker at Yelp: Opportunities and Challenges
Cassandra in Docker at Yelp: Opportunities and Challenges
Raghavendra Prabhu
 
NUMA and Java Databases
NUMA and Java DatabasesNUMA and Java Databases
NUMA and Java Databases
Raghavendra Prabhu
 
Linux NUMA & Databases: Perils and Opportunities
Linux NUMA & Databases: Perils and OpportunitiesLinux NUMA & Databases: Perils and Opportunities
Linux NUMA & Databases: Perils and Opportunities
Raghavendra Prabhu
 
Clusternaut: Orchestrating  Percona XtraDB Cluster with Kubernetes
Clusternaut:  Orchestrating  Percona XtraDB Cluster with KubernetesClusternaut:  Orchestrating  Percona XtraDB Cluster with Kubernetes
Clusternaut: Orchestrating  Percona XtraDB Cluster with Kubernetes
Raghavendra Prabhu
 
Clusternaut: Orchestrating Percona XtraDB Cluster with Kubernetes.
Clusternaut: Orchestrating Percona XtraDB Cluster with Kubernetes.Clusternaut: Orchestrating Percona XtraDB Cluster with Kubernetes.
Clusternaut: Orchestrating Percona XtraDB Cluster with Kubernetes.
Raghavendra Prabhu
 
Working from home - fun, facts and scares!
Working from home -  fun, facts and scares!Working from home -  fun, facts and scares!
Working from home - fun, facts and scares!
Raghavendra Prabhu
 
Securing databases with systemd for containers and services
Securing databases with systemd for containers and services Securing databases with systemd for containers and services
Securing databases with systemd for containers and services
Raghavendra Prabhu
 
Corpus collapsum: Partition tolerance testing of Galera with Docker and NetEm
Corpus collapsum: Partition tolerance testing of Galera with Docker and NetEm Corpus collapsum: Partition tolerance testing of Galera with Docker and NetEm
Corpus collapsum: Partition tolerance testing of Galera with Docker and NetEm
Raghavendra Prabhu
 
Dock'em: Distributed Systems Testing with NetEm and Docker
Dock'em: Distributed Systems Testing with NetEm and Docker Dock'em: Distributed Systems Testing with NetEm and Docker
Dock'em: Distributed Systems Testing with NetEm and Docker
Raghavendra Prabhu
 
Galera with Docker: How Synchronous Replication and Linux Containers mesh tog...
Galera with Docker: How Synchronous Replication and Linux Containers mesh tog...Galera with Docker: How Synchronous Replication and Linux Containers mesh tog...
Galera with Docker: How Synchronous Replication and Linux Containers mesh tog...
Raghavendra Prabhu
 
Jutsu or Dô: Open documentation: continuous process than a body
Jutsu or Dô: Open documentation: continuous process than a body Jutsu or Dô: Open documentation: continuous process than a body
Jutsu or Dô: Open documentation: continuous process than a body
Raghavendra Prabhu
 
Corpus collapsum: Partition tolerance of Galera in a noisy high load environment
Corpus collapsum: Partition tolerance of Galera in a noisy high load environmentCorpus collapsum: Partition tolerance of Galera in a noisy high load environment
Corpus collapsum: Partition tolerance of Galera in a noisy high load environment
Raghavendra Prabhu
 
Corpus collapsum: Partition tolerance of Galera put to test
Corpus collapsum: Partition tolerance of Galera put to testCorpus collapsum: Partition tolerance of Galera put to test
Corpus collapsum: Partition tolerance of Galera put to test
Raghavendra Prabhu
 
Acidic clusters - Review of contemporary ACID-compliant databases with synchr...
Acidic clusters - Review of contemporary ACID-compliant databases with synchr...Acidic clusters - Review of contemporary ACID-compliant databases with synchr...
Acidic clusters - Review of contemporary ACID-compliant databases with synchr...
Raghavendra Prabhu
 
Running virtualized Galera instances for fun and profit
Running virtualized Galera instances for fun and profitRunning virtualized Galera instances for fun and profit
Running virtualized Galera instances for fun and profit
Raghavendra Prabhu
 
ACIDic Clusters: Review of current relation databases with synchronous replic...
ACIDic Clusters: Review of current relation databases with synchronous replic...ACIDic Clusters: Review of current relation databases with synchronous replic...
ACIDic Clusters: Review of current relation databases with synchronous replic...
Raghavendra Prabhu
 

More from Raghavendra Prabhu (20)

Orchestrating Cassandra with Kubernetes
Orchestrating Cassandra with KubernetesOrchestrating Cassandra with Kubernetes
Orchestrating Cassandra with Kubernetes
 
Cassandra Operator with Yelp PaaSTA
Cassandra Operator with Yelp PaaSTACassandra Operator with Yelp PaaSTA
Cassandra Operator with Yelp PaaSTA
 
Safe and Fast Automation on AWS for Fun and Profit
Safe and Fast Automation on AWS for Fun and ProfitSafe and Fast Automation on AWS for Fun and Profit
Safe and Fast Automation on AWS for Fun and Profit
 
Pass Elk: CAP Theorem since 90s and Beyond
Pass Elk: CAP Theorem since 90s and BeyondPass Elk: CAP Theorem since 90s and Beyond
Pass Elk: CAP Theorem since 90s and Beyond
 
Cassandra in Docker at Yelp: Opportunities and Challenges
Cassandra in Docker at Yelp: Opportunities and ChallengesCassandra in Docker at Yelp: Opportunities and Challenges
Cassandra in Docker at Yelp: Opportunities and Challenges
 
NUMA and Java Databases
NUMA and Java DatabasesNUMA and Java Databases
NUMA and Java Databases
 
Linux NUMA & Databases: Perils and Opportunities
Linux NUMA & Databases: Perils and OpportunitiesLinux NUMA & Databases: Perils and Opportunities
Linux NUMA & Databases: Perils and Opportunities
 
Clusternaut: Orchestrating  Percona XtraDB Cluster with Kubernetes
Clusternaut:  Orchestrating  Percona XtraDB Cluster with KubernetesClusternaut:  Orchestrating  Percona XtraDB Cluster with Kubernetes
Clusternaut: Orchestrating  Percona XtraDB Cluster with Kubernetes
 
Clusternaut: Orchestrating Percona XtraDB Cluster with Kubernetes.
Clusternaut: Orchestrating Percona XtraDB Cluster with Kubernetes.Clusternaut: Orchestrating Percona XtraDB Cluster with Kubernetes.
Clusternaut: Orchestrating Percona XtraDB Cluster with Kubernetes.
 
Working from home - fun, facts and scares!
Working from home -  fun, facts and scares!Working from home -  fun, facts and scares!
Working from home - fun, facts and scares!
 
Securing databases with systemd for containers and services
Securing databases with systemd for containers and services Securing databases with systemd for containers and services
Securing databases with systemd for containers and services
 
Corpus collapsum: Partition tolerance testing of Galera with Docker and NetEm
Corpus collapsum: Partition tolerance testing of Galera with Docker and NetEm Corpus collapsum: Partition tolerance testing of Galera with Docker and NetEm
Corpus collapsum: Partition tolerance testing of Galera with Docker and NetEm
 
Dock'em: Distributed Systems Testing with NetEm and Docker
Dock'em: Distributed Systems Testing with NetEm and Docker Dock'em: Distributed Systems Testing with NetEm and Docker
Dock'em: Distributed Systems Testing with NetEm and Docker
 
Galera with Docker: How Synchronous Replication and Linux Containers mesh tog...
Galera with Docker: How Synchronous Replication and Linux Containers mesh tog...Galera with Docker: How Synchronous Replication and Linux Containers mesh tog...
Galera with Docker: How Synchronous Replication and Linux Containers mesh tog...
 
Jutsu or Dô: Open documentation: continuous process than a body
Jutsu or Dô: Open documentation: continuous process than a body Jutsu or Dô: Open documentation: continuous process than a body
Jutsu or Dô: Open documentation: continuous process than a body
 
Corpus collapsum: Partition tolerance of Galera in a noisy high load environment
Corpus collapsum: Partition tolerance of Galera in a noisy high load environmentCorpus collapsum: Partition tolerance of Galera in a noisy high load environment
Corpus collapsum: Partition tolerance of Galera in a noisy high load environment
 
Corpus collapsum: Partition tolerance of Galera put to test
Corpus collapsum: Partition tolerance of Galera put to testCorpus collapsum: Partition tolerance of Galera put to test
Corpus collapsum: Partition tolerance of Galera put to test
 
Acidic clusters - Review of contemporary ACID-compliant databases with synchr...
Acidic clusters - Review of contemporary ACID-compliant databases with synchr...Acidic clusters - Review of contemporary ACID-compliant databases with synchr...
Acidic clusters - Review of contemporary ACID-compliant databases with synchr...
 
Running virtualized Galera instances for fun and profit
Running virtualized Galera instances for fun and profitRunning virtualized Galera instances for fun and profit
Running virtualized Galera instances for fun and profit
 
ACIDic Clusters: Review of current relation databases with synchronous replic...
ACIDic Clusters: Review of current relation databases with synchronous replic...ACIDic Clusters: Review of current relation databases with synchronous replic...
ACIDic Clusters: Review of current relation databases with synchronous replic...
 

Recently uploaded

Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
Srikant77
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
kalichargn70th171
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 

Recently uploaded (20)

Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 

Taskerman - a distributed cluster task manager

  • 1. Raghavendra Prabhu rprabhu@yelp.com / @randomsurfer Distributed Systems Data Team Taskerman A Distributed Cluster Task Manager
  • 2. Yelp’s Mission Connecting people with great local businesses.
  • 3. Some numbers! ● 30 million unique mobile app users. ● 74 million UMVs via mobile web. ● 84 million UMVs via desktop. ● More than 142 million rich, local reviews. ● 78% of all searches on Yelp came from mobile. ● 9 offices around the world. ● 4,000+ employees worldwide and 400+ engineers (as of Q3 2017)
  • 5.
  • 6. …. ● Memcached ● Redis ● Spark ● Redshift ● DynamoDB ● S3 ● OSCON talk on data tiers at Yelp And many more! 6
  • 7. Distributed Systems Team ● Several TB in prod cassandra clusters with tens of nodes in each. ● Half a million messages/second in our streaming pipeline ● Several TB in elasticsearch in prod with several hundred nodes ● All are multi-AZ multi-region ● And growing…
  • 10. ● Safe ● Generic and Extensible ● Distributed ● Loosely coupled ● Not ad-hoc ○ Reviewed ● Sound config management The Why I
  • 11. ● Schedulable ● Reusable ● Cluster awareness ● Easily maintainable and observable ○ Not a black box. ○ More Ironman, less Ultron ● Prior Art ○ Downsides The Why II
  • 12. ● Paramount* ● Serialized execution ○ ‘m’ out of ‘n’ ○ Disjoint jobs. ● Avoid cascade ● Privilege escalation ● Push-based * Unless oncall is automated too. Safety
  • 13. Quotes “There are only two hard problems in distributed systems: 2. Exactly-once delivery 1. Guaranteed order of messages 2. Exactly-once delivery” @mathiasverraes “There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors.” @secretGeek
  • 14. ● Network is reliable ● Latency is zero ● Bandwidth is infinite ● Network is secure ● One administrator ● Transport cost is zero ● Network is homogenous ● Topology doesn't change Fallacies of Distributed System
  • 16. ● Scheduler ● Router ● Co-ordinator ● Transport ● Executor ● Error handler ● Configuration ● Monitoring ● Tooling Components
  • 17. RouterQueue Q2 Q1 Q3 Dead Letter Queue T1 T2 T3 Lease Failure SQS queue (Qx - node queue) Flow of task Task Scheduler Nodes
  • 18. ● Runs on Chronos ● Emits a task ● Enqueues into global queue ● Ad-hoc support ● Deployment granularities ● Task tracking ● Yelpsoa-configs Task Scheduler
  • 19. #Anatomy of a Taskerman Task { ‘action’: ‘heartbeat:ping’, ‘version’: ‘X.Y’, ‘limit’: 2, # To limit executors. ‘cluster_name’: ‘cassandra:geo_counter’, ‘discovery’ : ‘aws_tags’, ‘owner’: ‘distsys’, # For alerting ‘task_id’: ‘f2f6e03f-539a-49ad-8bf0-ecf079df36f5’, # For status tracking ‘taskerman_params’: { ‘action_args’: { }, ‘workqueue_args’: { }, }, ‘nodes’: [a,b,c,d], # For passthrough mode. ‘destnode’: ‘’, # Mutated by router for DLQ }
  • 20. ● AWS SQS ● Best-effort FIFO ● Reliable and cheap ● Low latency ● Properties ○ Read without delete ○ Visibility timeout ○ Retry ○ Dead Letter Queue Queue
  • 21. ● Stateless Marathon worker ● Routes tasks from global queue ○ To node-specific queues ● At-least once delivery ● Queue creation ○ Top-down principle ● Passthrough Task Router
  • 22. ● ‘DNS of Taskerman’ ● Cluster => [..,Node IP,..] ● EC2 tags or Smartstack ● Liveness filtering ● Pluggable ● Challenges ○ Rate-limiting Task Router :: Discovery
  • 23. ● The executor of Taskerman ● Dequeue task and executes ○ Pre-defined reviewed code. ● Scheduled on node ● Zookeeper for coordination ● Task deleted upon success ● Dead letter queue upon failed retries Taskrunner
  • 24. class TestTaskRunner(TaskRunner): def __init__(self, task,..): .. def pre_check(self): .. def post_check(self): .. def execute_action(self): ..
  • 25. ● Distributed Coordinator ● Non Blocking Lease ○ Time-based lease ○ Mutual exclusion ○ Global lease ● Ephemeral locks Zookeeper
  • 26. ● Atomic Counters ○ Statistics on actions ○ Circuit breakers ■ Dead man’s switch ■ Prevent failure cascade ○ Automatic reset ● What is Atomic ○ Serializability Zookeeper
  • 27. ● Staleness ○ Nodes can go down ● Garbage collection ○ Cleanup of ZK data structures ● Composition ● Starvation ● Uptime Zookeeper: Challenges
  • 28. ● Failure is the norm, not an exception ● Multiple vectors of failure ● Pessimistic approach ○ Job retry ○ Job Counter ● Mitigation vs Alerting Failure
  • 29. ● Heartbeat ping ○ End-to-end monitoring ● Dead Letter Queue ○ Recycle bin of failed tasks. ○ Hooks into human side of monitoring ● Others ○ Separation of state ○ Mutability Failure handling
  • 31.
  • 32. ● Observability ○ Job status logging at stages ○ End-to-end logging with Scribe ○ Metrics ○ Signalfx (Queue lengths) ○ Splunk dashboards ● Alerting ○ Sensu ○ Signalfx Monitoring
  • 33. ● Restarts ● Reboots ● Instance Replacement ● Integration tests ● Kafka config mgmt ● Backup and restore ● Search indexing ● .. and many more. Use cases
  • 34. Uptime management $ uptime 06:52:54 up 99 days, 19:20, 1 user, load average: 0.02, 0.03, 0.07 ps -eo pid,cmd,lstart | grep .. 10058 zookeeper Tue Dec 5 05:23:43 2017
  • 35. Q & A ● Slides will also be uploaded to slideshare.net/slidunder.
  • 38. ● https://www.elastic.co/products/elasticsearch ● https://zookeeper.apache.org/ ● https://kafka.apache.org/ ● https://www.flickr.com/photos/dapuglet/6291424431 ● http://www.alamy.com/stock-photo/cattle-penning.html ● http://www.firstcallsigns.co.uk/content/images/thumbs/0000927_EE80127.jpeg ● https://sensuapp.org/img/logo-flat-white.png ● https://thumbs.gfycat.com/FocusedCompetentEyas-max-1mb.gif ● https://www.percona.com/sites/default/files/dashboard.png ● https://www.sales-initiative.com/downloads/2856/download/resilience.jpg?cb=29f43ac82cea225ab3ee370d7580760d ● http://izquotes.com/quotes-pictures/quote-a-distributed-system-is-one-in-which-the-failure-of-a-computer-you-didn-t-eve n-know-existed-can-leslie-lamport-346227.jpg ● https://pbs.twimg.com/media/DRCfqaCWsAczqTz.jpg ● https://upload.wikimedia.org/wikipedia/en/thumb/e/e0/Iron_Man_bleeding_edge.jpg/220px-Iron_Man_bleeding_edge.jpg ● https://github.com/mesos/chronos ● https://github.com/mesosphere Image Credits
  • 39. ● http://www.networknuts-web.biz/wp-content/uploads/2014/10/cron-logo.png ● http://www.pvhc.net/img195/ojfspebrvfblupftgajb.png ● https://fun-damentals.com/wp-content/uploads/2016/05/a-resilience.png ● http://www.azquotes.com/picture-quotes/quote-debugging-is-twice-as-hard-as-writing-the-code-in-the-first-place-therefor e-if-you-write-brian-kernighan-66-91-06.jpg ● https://thenounproject.com/ ● https://aws.amazon.com/ ● https://www.splunk.com/ ● https://www.terraform.io/ ● http://yelp.com ● http://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/ Image Credits
  • 40. ● https://engineeringblog.yelp.com/2015/03/using-services-to-break-down-monoliths.html ● http://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/ ● https://martinfowler.com/bliki/TwoHardThings.html ● https://zookeeper.apache.org/ ● https://www.terraform.io/ ● https://github.com/Yelp/service-principles ● https://en.wikipedia.org/wiki/Law_of_Demeter Further Reading