Probabilistic Adaptive Load Balancing For Parallel Queries


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Probabilistic Adaptive Load Balancing For Parallel Queries

  1. 1. Probabilistic Adaptive Load Balancing for Parallel Queries Daniel M. Yellin* Jorge Buenabad-Chávez** Norman W. Paton*** *IBM Israel Software Lab ** Centro de Investigacion y de Estudios Avanzados del IPN , Mexico *** University of Manchester
  2. 2. Autonomic Computing <ul><li>Autonomic computing provides general framework for adaptive systems, MAPE </li></ul><ul><li>… but getting the details right is tough </li></ul><ul><li>When a trend is sensed, when to adapt? </li></ul><ul><ul><li>Not too early, not too late </li></ul></ul><ul><li>What to adapt to? </li></ul><ul><ul><li>Adapting in the wrong way can make things worse! </li></ul></ul>
  3. 3. Our problem <ul><li>Given: </li></ul><ul><ul><li>A computational system S that can operate in one of several (possibly infinite) modes, m 1 ,m 2 ,.. </li></ul></ul><ul><ul><li>Each mode is optimized for a particular workload </li></ul></ul><ul><li>Goal: </li></ul><ul><ul><li>Monitor the existing workload and decide when to adapt S to a different mode, optimized for the current (or predicted) workload </li></ul></ul><ul><li>Considerations: </li></ul><ul><ul><li>Risk: No promise that future workload will be similar to current workload </li></ul></ul><ul><ul><li>Cost: Each time we change the mode of the system S from m i to m j , we incur a cost. Switching modes can be expensive! </li></ul></ul>m1 m2 m4 m3 m5 m1 m2 m4 m3 m5 Entails a cost
  4. 4. Example: Pub-sub systems <ul><li>Given: </li></ul><ul><li>S = pub-sub system, including a server and a set of clients </li></ul><ul><li>Modes = {cache a particular data item on a client, store particular data item on server} </li></ul><ul><li>Goal: </li></ul><ul><li>Monitor the access patterns of clients and decide when to move data item to client (server) from server (client) </li></ul>Daniel M. Yellin: Competitive algorithms for the dynamic selection of component implementations. IBM Systems Journal 42 (1): 85-97 (2003). Server Client 1 Client 2 d1 d2 d3 d3 … read d3, read d3,... write d3, write d3,...
  5. 5. Example: Data type implementation <ul><li>Given: </li></ul><ul><li>S = abstract data type with multiple implementations, each optimized for specific sorts of operations </li></ul><ul><li>Goal: </li></ul><ul><li>Monitor the operations on S and decide when to switch from one implementation to another </li></ul>Component data Requests of type X faster using key K1 K1 impl 1 impl 2 Requests of type Y faster using key K2 K2
  6. 6. Our approach <ul><li>Monitor existing workload and response times </li></ul><ul><li>Determine (a finite number of) modes to consider for adaptation </li></ul><ul><li>Determine likelihood of (a finite number of) workloads in the immediate future </li></ul><ul><li>For each relevant mode, compute the expected cost of switching to that mode, based upon probability of different workloads and cost of processing workload in that mode </li></ul><ul><li>Note: cost of adaptation (SwitchCost) is included in EC </li></ul>
  7. 7. Adaptive Query Processing <ul><li>A query optimiser , given a query and information on the data involved and the environment in which the query is to be run, proposes an execution plan for that query that is predicted to yield the best response time. </li></ul><ul><li>If the information used by the optimiser is misleading (e.g. partial, incorrect, out-of-date or subject to change during query evaluation), the execution plan chosen by the optimiser may be inappropriate. </li></ul><ul><li>In Adaptive Query Processing , the execution plan is modified at query runtime, on the basis of feedback received from the environment. </li></ul>
  8. 8. Adaptation for Load Balancing <ul><li>In partitioned parallelism, a task is divided into subtasks that are run in parallel on different nodes. </li></ul><ul><li>For a join, A ⋈ B is represented as the union of the results of plan fragments F i = A i ⋈ B i , for i = 1 .. P , where P is the level of parallelism. </li></ul><ul><li>The time taken to evaluate the join is max(evaluation_time(F i )), for i = 1.. P . </li></ul><ul><li>As a result, any delay in completing a fragment F i delays the completion of the operator, so it is crucial to match fragment size to node capabilities. </li></ul><ul><li>Most join algorithms have state; as such changing the size of a fragment allocated to a machine involves replicating or relocating operator state. </li></ul>
  9. 9. Flux* <ul><li>When load imbalance is detected: </li></ul><ul><ul><li>Halt query execution. </li></ul></ul><ul><ul><li>Compute new distribution policy (dp). </li></ul></ul><ul><ul><li>Update hash tables by transferring data between nodes. </li></ul></ul><ul><ul><li>Update dp in parent exchange nodes. </li></ul></ul><ul><ul><li>Resume query execution </li></ul></ul><ul><li>Many variations of this technique exists and have been compared ** </li></ul>Scan(A) Join(A 1 ,B 1 ) Join(A 2 ,B 2 ) Hash table A 1 dp Hash table A 2 * M. Shah, J.M. Hellerstein, S. Chandrasekaran, M.J. Franklin, Flux: An Adaptive Partitioning Operator for Continuous Query Systems. ICDE 2003. ** Paton, N.W., Raman, V., Swart, G. and Narang, I., Autonomic Query Parallelization using Non-dedicated Computers: An Evaluation of Adaptivity Options, Proc. 3rd IEEE Intl. Conf. on Autonomic Computing , 2006.
  10. 10. Heuristics used by Flux <ul><li>Units of adaptation : table divided into partitions, and each node can gain/loose at most one partition during an adaptation </li></ul><ul><li>Scale of adaptation : at most half the partitions can be moved at any adaptation </li></ul><ul><li>Frequency of adaptation : once an adaptation takes place and takes time s, no further adaptation until after time s </li></ul><ul><li>Timing of adaptation : applies heuristics to determine when to transfer partition from over-utilized processor to under-utilized processor </li></ul>
  11. 11. A brief review … <ul><li>Our algorithm is based upon the concept of mathematical expectation . </li></ul><ul><li>If the probabilities of obtaining the amounts a 1 , a 2 ,..., a k </li></ul><ul><li>are p 1 , p 2 ,..., p k , where p 1 + p 2 +...+ p k = 1 then the mathematical expectation is: </li></ul><ul><li>E = a 1 * p 1 + a 2 * p 2 +...+ a k * p k </li></ul><ul><li>For example, if we win $10 when a die comes up 1 or 6, and lose $ 5 when it comes up 2, 3, 4 or 5, our mathematical expectation is: </li></ul><ul><li>E = 10*(2/6) + (-5)(4/6) = 0 </li></ul>
  12. 12. Moving from heuristics to evidence-based decision making <ul><li>Define the notion of expected cost (EC) of using a particular distribution policy dp </li></ul><ul><ul><li>EC of dp is cost of processing the parallel query using dp, given that in the future we will have actual workloads w1,w2,… with probabilities of p1,p2,… </li></ul></ul><ul><ul><li>EC(dp) = cost(dp,w1)*p1 + cost(dp,w2)*p2 + … + SwitchCost(current,dp) </li></ul></ul><ul><ul><li>In practice, we only consider two workloads & two distribution policies: the currently used dp and the “optimal” one obtained from monitored workloads </li></ul></ul>Cost(dp,w1) computed how much longer it would take dp to finish processing w1 than the optimal distribution policy. See paper for details. SwitchCost is not present if dp is the currently used distribution policy
  13. 13. Probabilistic Delta Algorithm Initialize current_dp // initially distribute uniformly TimeToSwitch = False while (not TimeToSwitch) Process next portion of query Compute preferred_dp // “ideal” distribution ecNoChange = EC_NoChange(current_dp, preferred_dp, count) ecChange = EC_Change(preferred_dp, current_dp, count) if ecNoChange >= ecChange TimeToSwitch = True endwhile current_dp = preferred_dp Adapt to preferred_dp Includes SwitchCost Does not include SwitchCost
  14. 14. Computing probabilities of future workloads <ul><li>Let n_c be the number of workloads in the window that are most similar to current_dp </li></ul><ul><li>Let n_p be the number of workloads in the window that are most similar to preferred_dp </li></ul><ul><li>Let n_w be the total number of time units in the window. </li></ul><ul><li>prob(preferred_dp) = n_p / n_w and prob(current_dp)= n_c / n_w </li></ul><ul><li>Note: can use more sophisticated techniques; e.g., weight the workloads based on </li></ul><ul><li>“ proximity” to current time </li></ul>
  15. 15. Experiment Setup (Simulator) <ul><li>Cost model parameters: drawn from micro benchmarks </li></ul><ul><li>Database from TPC-H benchmark. </li></ul><ul><li>As number of nodes grows, the data is assumed to be striped over the available machines. </li></ul><ul><li>All machines are assumed to have the same capabilities, and to be sharing the same network. </li></ul><ul><li>Experiments use Q1: P ⋈ PS ( P has 200,000 tuples, PS has 800,00 0 tuples). </li></ul>Same as in: Automatic Query Parallelization using Non-dedicated Compters: An Evaluation of Adaptivity Options , The Very Large Data Bases Journal, N. W. Paton, J. Buenabad-Chavez, M. Chen, V. Raman, G. Swart, I. Narang, D. M. Yellin, and A.A.A. Fernandez. To VLDB
  16. 16. Experiments <ul><li>Periodic imbalance : The load on one or more of the machines comes and goes during the experiment. The level , duration , and repeat duration of the external load are varied. </li></ul><ul><li>Poisson imbalance : The arrival rate of jobs follows a Poisson distribution in which the average number of jobs starting per second varies. </li></ul><ul><li>“ Cyclic Poisson” imbalance : Like a Poisson distribution except the average workload is not constant but changes over time in cyclic fashion (like sine wave). Trying to mimic more realistic workloads that change over time. </li></ul>
  17. 17. Periodic load imbalance Parallelism level =3 Single node affected Duration & repeat duration of load spike = 1s Level of imbalance = avg # of external jobs introduced
  18. 18. PD is more conservative in deciding to adapt For previous experiment w/ level of imbalance = 6 current dp =0 means adaptation taking place PD adjusts only once to periodic increased load on node 1 Expected cost of adaptation is greater than expected cost of sticking with current distribution Each node start w/ 1/3 of workload but nodes 2 and 3 gain workload over time
  19. 19. Poisson load imbalance Parallelism level =3 Single node affected Duration of load spike = 1s
  20. 20. “ Poisson cyclic” load imbalance Parallelism level =3 Duration of Cycle = 5s load spike = 1s
  21. 21. Future work <ul><li>Our approach is sensitive to window size. What is best window size to use? </li></ul><ul><li>Investigate better techniques for computing the probability of future workloads </li></ul><ul><li>Use more than just two alternative distribution policies; e.g., can we infer a trend and use a “predicted distribution policy”? </li></ul><ul><li>Test the algorithm on a real system, not only with simulator </li></ul>
  22. 22. Conclusions <ul><li>We investigated replacing heuristics with a more fundamental approach for determining when to adapt the system </li></ul><ul><li>Initial experiments showed that using Probabilistic Delta (expected cost) algorithm for determining when to adapt usually improved on existing approaches, sometimes significantly </li></ul><ul><li>The gain of this approach is due to inhibiting specious adaptations while still encouraging necessary adaptations </li></ul>
  23. 23. Backup slides
  24. 24. Adaptive Parallel Queries A distribution policy describes how we partition work between processors