Fine-Grained Scheduling with Helix (ApacheCon NA 2014)

1,177 views
1,102 views

Published on

Talk from ApacheCon 2014. How Helix can integrate with provisioners like YARN or Mesos to manage the entire lifecycle of a distributed system.

Published in: Technology
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,177
On SlideShare
0
From Embeds
0
Number of Embeds
183
Actions
Shares
0
Downloads
20
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide

Fine-Grained Scheduling with Helix (ApacheCon NA 2014)

  1. 1. Fine-Grained Scheduling with Helix Kanak Biscuitwala Jason Zhang Apache Helix Committers @ LinkedIn helix.apache.org @apachehelix
  2. 2. About Helix Helix automates assignment of partitioned and replicated distributed tasks in the face of node failure and recovery, cluster expansion, and reconfiguration.
  3. 3. Helix at LinkedIn Oracle Oracle OracleDB Change Capture Change Consumers Index Search Index User Writes Data Replicator Backup/Restore In Production ETL HDFS Analytics
  4. 4. Helix at LinkedIn In Production Over 1000 instances covering over 30000 database partitions Over 1000 instances for change capture consumers As many as 500 instances in a single Helix cluster (all numbers are per-datacenter)
  5. 5. Others Using Helix
  6. 6. Datacenter Diversity
  7. 7. Intersection of Job Types OracleDB OracleDB
  8. 8. Intersection of Job Types OracleDB OracleDB BackupBackup
  9. 9. Intersection of Job Types OracleDB OracleDB BackupBackup HDFS ETL ETL
  10. 10. Intersection of Job Types OracleDB OracleDB BackupBackup HDFS ETL ETL Long-running and batch jobs running together!
  11. 11. Cloud Deployment A B online nearline C batch A1 A1 A2 A3B1 C1 C2 C3 B2 B3 C2 B4 B5 C2 C4 Applications with diverse requirements running together in a datacenter
  12. 12. Cloud Deployment A B C A1 A1 A2 A3B1 C1 C2 C3 B2 B3 C2 B4 B5 C2 C4 Applications with diverse requirements running together in a datacenter DB Backup ETL
  13. 13. Processes on Machines Machine ContainerProcess VM
  14. 14. Processes on Machines TaskTaskProcess No Isolation Machine ContainerProcess VM
  15. 15. Processes on Machines TaskTaskProcess 128 MB 128 MB 128 MB Process Process Process No Isolation VM-based Isolation Machine ContainerProcess VM
  16. 16. Processes on Machines TaskTaskProcess 256 MB 64 MB 128 MB 128 MB 128 MB Process Process Process Process Process No Isolation VM-based Isolation Container-based Isolation Machine ContainerProcess VM
  17. 17. • Run as individual processes – Poor isolation or poor utilization • Virtual machines – Better isolation – Xen, Hyper-V, ESX, KVM • Containers – cgroup – YARN, Mesos – Super lightweight, dynamic based on application requirements Processes on Machines
  18. 18. Processes on Machines Virtualization and containerization significantly improve process isolation and open up possibilities for efficient utilization of physical resources
  19. 19. Container-Based Solution
  20. 20. Container-Based Solution System Requirements A B C 64 MB 64 MB 64 MB 128 MB 128 MB 256 MB
  21. 21. Container-Based Solution Allocation 64 MB 64 MB 128 MB 256 MB 128 MB 64 MB Machine Container
  22. 22. Container-Based Solution Allocation 64 MB 64 MB 128 MB 256 MB 128 MB 64 MB Machine Container A A A B B C Process
  23. 23. Container-Based Solution Allocation 64 MB 64 MB 128 MB 256 MB 128 MB 64 MB Containerization is powerful! Machine Container A A A B B C Process
  24. 24. Container-Based Solution Allocation 64 MB 64 MB 128 MB 256 MB 128 MB 64 MB Containerization is powerful! Machine Container A A A B B C Process But do processes always fit so nicely?
  25. 25. Over-Utilization 256 MB Container-Based Solution Machine ContainerProcess
  26. 26. Over-Utilization 256 MB Process 1 Container-Based Solution Machine ContainerProcess
  27. 27. Over-Utilization Outcome: Preemption and relaunch 256 MB Process 1 Container-Based Solution Machine ContainerProcess
  28. 28. Over-Utilization Outcome: Preemption and relaunch Container-Based Solution 384 MB Machine ContainerProcess
  29. 29. Over-Utilization Outcome: Preemption and relaunch Container-Based Solution 384 MBProcess 1 Machine ContainerProcess
  30. 30. Under-Utilization 384 MB 128 MB Container-Based Solution Machine ContainerProcess
  31. 31. Under-Utilization Outcome: Over-provisioned until restart 384 MB Process 1 128 MB Container-Based Solution Machine ContainerProcess Process 2
  32. 32. Container-Based Solution Failure 64 MB 64 MB 128 MB 256 MB 128 MB 64 MB Machine Container A A A B B C Process
  33. 33. Container-Based Solution Failure 64 MB 64 MB 128 MB 128 MB Machine Container A A B B Process
  34. 34. Container-Based Solution Failure 64 MB 64 MB 128 MB 128 MB Outcome: Launch containers elsewhere Machine Container A A B B Process 256 MBC 64 MBA What about stateful systems?
  35. 35. Container-Based Solution Failure 64 MB 64 MB 128 MB 256 MB 128 MB 64 MB Machine Container SLAVE SLAVE MASTER B B C Process
  36. 36. Container-Based Solution Failure 64 MB 64 MB 128 MB 128 MB Without additional information, the master is unavailable until restart Machine Container SLAVE SLAVE B B Process
  37. 37. Scaling Container-Based Solution Machine ContainerProcess 256 MB50% 256 MB50%
  38. 38. Scaling Container-Based Solution Machine ContainerProcess
  39. 39. Scaling Container-Based Solution Machine ContainerProcess 128 MB33% 128 MB33% 128 MB33% Outcome: Relaunch with new sharding
  40. 40. Container-Based Solution Container-Based Solution Utilization Application requirements define container size Fault Tolerance New container is started Scaling Workload is repartitioned and new containers are brought up Discovery Existence
  41. 41. Container-Based Solution We need something finer-grained The container model provides flexibility within machines, but assumes homogeneity of tasks within containers
  42. 42. Task-Based Solution
  43. 43. Task-Based Solution System Requirements A B C complete in less than 5 hours always have 2 containers running response time should be less than 50 ms
  44. 44. Task-Based Solution Allocation Machine Container A A B Task B C C
  45. 45. Over-Utilization Task-Based Solution Machine ContainerTask
  46. 46. Over-Utilization Task-Based Solution Task 1 Machine ContainerTask
  47. 47. Over-Utilization Task-Based Solution Task 1 Machine ContainerTask
  48. 48. Over-Utilization Task-Based Solution Task 1 Machine ContainerTask Task 1
  49. 49. Over-Utilization Task-Based Solution Hide the overhead of a container restart Machine ContainerTask Task 1
  50. 50. Under-Utilization 384 MB 128 MB Task-Based Solution Machine ContainerTask
  51. 51. Under-Utilization 384 MB Task 1 128 MB Task 2 Task-Based Solution Machine ContainerTask
  52. 52. Under-Utilization Optimize container allocations based on usage 384 MB Task 1 Task 2 Task-Based Solution Machine ContainerTask
  53. 53. Task-Based Solution Failure Task 1 Leader Task 2 Leader Task 3 Leader Task 2 Standby Task 3 Standby Task 1 Standby Task 2 Standby Task 1 Standby Task 3 Standby Machine Container
  54. 54. Task-Based Solution Failure Task 1 Leader Task 2 Leader Task 2 Standby Task 3 Standby Task 1 Standby Task 3 Standby Task 3 Leader Machine Container
  55. 55. Task-Based Solution Failure Some systems cannot wait for new containers to start Task 1 Leader Task 2 Leader Task 2 Standby Task 3 Standby Task 1 Standby Task 3 Standby Task 3 Leader Machine Container
  56. 56. Task-Based Solution Discovery Task 1 Leader Task 2 Leader Task 2 Standby Machine Container Task 1:! Leader at N1 Standby at N2 Task 1 Standby Task 2:! Leader at N2 Standby at N1 N1 N2
  57. 57. Task-Based Solution Discovery Task 1 Leader Task 2 Leader Task 2 Standby Machine Container Learn where everything runs, and what state each task is in Task 1:! Leader at N1 Standby at N2 Task 1 Standby Task 2:! Leader at N2 Standby at N1 N1 N2
  58. 58. Scaling Task-Based Solution T4 T5 T6 T1 T2 T3 Machine ContainerTask
  59. 59. Scaling Task-Based Solution T4 T5 T6 T1 T2 T3 Machine ContainerTask
  60. 60. Scaling Task-Based Solution T4 T5 T6 T1 T2 T3 Machine ContainerTask
  61. 61. Scaling Task-Based Solution T4 T5 T6 T1 T2 T3 Machine ContainerTask
  62. 62. Comparing Solutions Container Solution Task + Container Solution Utilization Application requirements define container size Tasks are distributed as needed to a minimal container set as per SLA Fault Tolerance New container is started Existing task can assume a new state while waiting for new container Scaling Workload is repartitioned and new containers are brought up Tasks are moved across containers Discovery Existence Existence and state
  63. 63. Benefits of a Task-Based Solution Comparing Solutions Container reuse Minimize overhead of container relaunch Fine-grained scheduling
  64. 64. Benefits of a Task-Based Solution Comparing Solutions Container reuse Minimize overhead of container relaunch Fine-grained scheduling Task : Container :: Thread : Process Task is the right level of abstraction
  65. 65. Working at task granularity is powerful We need a performance-centric approach to resource assignment Comparing Solutions
  66. 66. Working at task granularity is powerful How can Helix help? We need a performance-centric approach to resource assignment Comparing Solutions
  67. 67. Working at task granularity is powerful How can Helix help? We need a performance-centric approach to resource assignment Comparing Solutions YARN/Mesos: containers bring flexibility in a machine Helix: tasks bring flexibility in a container
  68. 68. Task Management with Helix
  69. 69. Application Lifecycle Capacity Planning Provisioning Fault Tolerance State Management Allocating physical resources for your load Deploying and launching tasks Staying available, ensuring success Determining what code should be running and where
  70. 70. Resource Partition PartitionPartition Helix Overview Resources (Task Groups) master slave offline All partitions can be replicated. All resources can be partitioned (these are tasks). Each replica is in a state.
  71. 71. Offline Master Slave State Model and Constraints Helix Overview State Constraints Transition Constraints Partition Master: [1, 1] Slave: [0, 2] Max T1 transitions in parallel Resource - Max T2 transitions in parallel Node No more than 10 partitions Max T3 transitions in parallel Cluster - Max T4 transitions in parallel StateCount=2 StateCount=1
  72. 72. Offline Master Slave State Model and Constraints Helix Overview State Constraints Transition Constraints Partition Master: [1, 1] Slave: [0, 2] Max T1 transitions in parallel Resource - Max T2 transitions in parallel Node No more than 10 partitions Max T3 transitions in parallel Cluster - Max T4 transitions in parallel StateCount=2 StateCount=1 Helix manages task state
  73. 73. Controller PARTICIPANTS Spectators Controller Controller Manage TASKS Helix Overview Cluster Roles
  74. 74. Helix Controller High-Level Overview Rebalancer Task Assignment Constraints Nodes “single master” “no more than 3 tasks per instance”
  75. 75. Helix Controller Rebalancer ResourceAssignment computeResourceMapping( RebalancerConfig rebalancerConfig, ResourceAssignment prevAssignment, Cluster cluster, ResourceCurrentState currentState); Based on the current nodes in the cluster and constraints, find an assignment of task to node
  76. 76. Helix Controller Rebalancer ResourceAssignment computeResourceMapping( RebalancerConfig rebalancerConfig, ResourceAssignment prevAssignment, Cluster cluster, ResourceCurrentState currentState); Based on the current nodes in the cluster and constraints, find an assignment of task to node What else do we need?
  77. 77. Helix Controller What is Missing? Dynamic Container Allocation Container Isolation Automated Service Deployment Resource Utilization Monitoring
  78. 78. Helix Controller Target Provider Based on some constraints, determine how many containers are required in this system Fixed CPU Memory Bin Packing We’re working on integrating with monitoring systems in order to query for usage information
  79. 79. Helix Controller Target Provider Based on some constraints, determine how many containers are required in this system TargetProviderResponse evaluateExistingContainers( Cluster cluster, ResourceId resourceId, Collection<Participant> participants); class TargetProviderResponse { List<ContainerSpec> containersToAcquire; List<Participant> containersToRelease; List<Participant> containersToStop; List<Participant> containersToStart; } Fixed CPU Memory Bin Packing We’re working on integrating with monitoring systems in order to query for usage information
  80. 80. Helix Controller Adding a Target Provider Rebalancer Task Assignment Constraints Nodes Target Provider
  81. 81. Helix Controller Adding a Target Provider Rebalancer Task Assignment Constraints Nodes Target Provider How do we use the target provider response?
  82. 82. Helix Controller Container Provider Given the container requirements, ensure that number of containers are running YARN Mesos Logical
  83. 83. Helix Controller Container Provider Given the container requirements, ensure that number of containers are running ListenableFuture<ContainerId> allocateContainer(ContainerSpec spec); ! ListenableFuture<Boolean> deallocateContainer(ContainerId containerId); ! ListenableFuture<Boolean> startContainer(ContainerId containerId, Participant participant); ! ListenableFuture<Boolean> stopContainer(ContainerId containerId); YARN Mesos Logical
  84. 84. Helix Controller Logical Container Provider
  85. 85. Helix Controller Logical Container Provider use use startContainer
  86. 86. Helix Controller Adding a Container Provider Rebalancer Task Assignment Constraints Nodes Target Provider Container Provider Target Provider + Container Provider = Provisioner
  87. 87. Application Lifecycle Capacity Planning Provisioning Fault Tolerance State Management Target Provider Container Provider Existing Helix Controller (enhanced by Provisioner) Existing Helix Controller (enhanced by Provisioner) With Helix and the Task Abstraction
  88. 88. System Architecture
  89. 89. System Architecture Resource Provider
  90. 90. System Architecture submit job Resource ProviderClient
  91. 91. System Architecture submit job Resource Provider Controller Container Provisioner Rebalancer Client App Launcher
  92. 92. System Architecture submit job Resource Provider Controller Container Provisioner Rebalancer Client container request App Launcher
  93. 93. System Architecture submit job Resource Provider Controller Container Provisioner Rebalancer Client container request Participant Container Participant Launcher Helix Participant App App Launcher
  94. 94. System Architecture submit job Resource Provider Controller Container Provisioner Rebalancer Client container request Participant Container Participant Launcher Helix Participant App App Launcher assign tasks
  95. 95. HDFS/Common Area Helix + YARN YARN Architecture Client Resource Manager Application Master Container Node Manager Node Manager submit job node statusnode status container request assign work status App Package grab package
  96. 96. HDFS/Common Area Helix + YARN Helix + YARN Architecture Client Resource Manager Application Master Container Node Manager Node Manager submit job node statusnode status container request assign tasks status Helix Controller Rebalancer Helix Participant App App Package grab package
  97. 97. HDFS/Common Area Scheduler Slave Helix + Mesos Mesos Architecture Scheduler Mesos Master Slave Machine Slave Machine Mesos Slave Mesos Slave offer resources node statusnode status Mesos Executor grab executor Executor Package offer response
  98. 98. Scheduler Slave Helix Controller Helix + Mesos Helix + Mesos Architecture Scheduler Mesos Master Slave Machine Slave Machine Mesos Slave Mesos Slave offer resources node statusnode status assign tasks HDFS/Common Area Mesos Executor grab executor Helix Executor Package offer response Participant Participant
  99. 99. Example
  100. 100. Distributed Document Store Overview Oracle Partition 0 Partition 1 Partition 2 Oracle Partition 0 Partition 1 Partition 2 P1 BackupP2 Backup HDFS ETL ETL Master Slave Oracle Partition 0 Partition 1 Partition 2 P0 Backup ETL
  101. 101. Distributed Document Store Overview Oracle Partition 0 Partition 1 Partition 2 Oracle Partition 0 Partition 1 Partition 2 P1 BackupP2 Backup HDFS ETL ETL Master Slave P0 Backup Partition 0 Partition 1 Partition 2
  102. 102. Distributed Document Store YARN Example Client Resource Manager submit job container request assign work status node status Application Master Node Manager Helix Controller Rebalancer Container Node Manager node status Helix Participant OraclePartition 0 Partition 1 P1 Backup ETL
  103. 103. YAML Specification appConfig: { config: { k1: v1 } } appPackageUri: 'file://path/to/myApp-pkg.tar' appName: myApp services: [DB, ETL] # the task containers serviceConfigMap: {DB: { num_containers: 3, memory: 1024 }, ... ETL: { time_to_complete: 5h, ... }, ...} servicePackageURIMap: { DB: ‘file://path/to/db-service-pkg.tar', ... } ... Distributed Document Store
  104. 104. YAML Specification appConfig: { config: { k1: v1 } } appPackageUri: 'file://path/to/myApp-pkg.tar' appName: myApp services: [DB, ETL] # the task containers serviceConfigMap: {DB: { num_containers: 3, memory: 1024 }, ... ETL: { time_to_complete: 5h, ... }, ...} servicePackageURIMap: { DB: ‘file://path/to/db-service-pkg.tar', ... } ... Distributed Document Store TargetProvider specification
  105. 105. Service/Container Implementation public class MyQueuerService extends StatelessParticipantService { @Override public void init() { ... } ! @Override public void onOnline() { ... } ! @Override public void onOffline() { ... } } Distributed Document Store
  106. 106. Task Implementation public class BackupTask extends Task { @Override public ListenableFuture<Status> start() { ... } ! @Override public ListenableFuture<Status> cancel() { ... } ! @Override public ListenableFuture<Status> pause() { ... } ! @Override public ListenableFuture<Status> resume() { ... } } Distributed Document Store
  107. 107. Distributed Document Store State Model-Style Callbacks public class StoreStateModel extends StateModel { public void onBecomeMasterFromSlave() { ... } ! public void onBecomeSlaveFromMaster() { ... } ! public void onBecomeSlaveFromOffline() { ... } ! public void onBecomeOfflineFromSlave() { ... } }
  108. 108. class  RoutingLogic  {        public  void  write(Request  request)  {            partition  =  getPartition(request.key);            List<Participant>  nodes  =                    routingTableProvider.getInstance(                            partition,  “MASTER”);            nodes.get(0).write(request);        }   !      public  void  read(Request  request)  {            partition  =  getPartition(request.key);            List<Participant>  nodes  =                    routingTableProvider.getInstance(partition);            random(nodes).read(request);        }   } Spectator (for Discovery) Distributed Document Store
  109. 109. Recap • Container abstraction has become a huge win • With Helix, we can go a step further and make tasks the unit of work • With the TargetProvider and ContainerProvider abstractions, any popular provisioner can be plugged in
  110. 110. Questions? Jason zzhang@apache.org Kanak kanak@apache.org; @kinselofant Website helix.apache.org Dev Mailing List dev@helix.apache.org User Mailing List user@helix.apache.org Twitter @apachehelix ?

×