Minimizing Communication Cost in Distributed Multi-query Processing Jian Li, Amol Deshpande, Samir Khuller Department of C...
Outline <ul><li>Justification and related work
Problem formulation
Proposed methods and analysis </li><ul><li>Graph theory concepts
Tree topology
Arbitrary graph topologies
Experimental results </li></ul><li>Conclusions </li></ul>
Outline <ul><li>Justification </li></ul><ul><li>Problem formulation
Proposed methods and analysis </li><ul><li>Graph theory concepts
Tree topology
Arbitrary graph topologies
Experimental results </li></ul><li>Conclusions </li></ul>
Justification <ul><li>Emergence of large-scale distributed query processing in applications like: </li><ul><li>Wireless se...
Publish-subscribe systems
Distributed stream processing applications </li></ul></ul>Common need: Minimize data movement!!
Justification <ul><li>Data transfer cost from one node to another may be heterogenous. </li></ul>
Justification
Outline <ul><li>Justification </li></ul><ul><li>Problem formulation </li></ul><ul><li>Proposed methods and analysis </li><...
Tree topology
Arbitrary graph topologies
Experimental results </li></ul><li>Conclusions </li></ul>
Problem formulation <ul><li>Minimization of data movement for  multiples queries.
Assumptions: </li><ul><li>Non-uniform communication cost model.
No restrictions on data size.
Query plans are part of the input.
Intermediate results sizes are known. </li></ul></ul>
More formally <ul><li>Input: Set of relations or data sources
Topology, undirected weighted graph
Assignment of relations to nodes in the topology. </li></ul>
More formally <ul><li>Set of queries
Each query comes with a plan in the form of a directed tree. </li></ul>Destination node Data sources involved Data size S ...
More formally Given the topology graph G c  and a set of trees representing the query plans, our goal is to find a data mo...
Problem formulation Topology G c Queries (10) S 1 S 2 S 1  x S 2 C (10) (7) S 4 S 1  x S 2  x S 4 (5) (100) (100) S 2 S 6 ...
Problem formulation <ul><li>If a block of data sized S is sent along an edge e with weight w(e) then the communication cos...
For simplicity in the examples assume w(e) = 1 for all edges. </li><ul><li>But the algorithm is general in that sense! </l...
Outline <ul><li>Justification
Problem formulation
Proposed methods and analysis </li></ul><ul><ul><li>Graph theory concepts </li></ul></ul><ul><ul><li>Tree topology
Arbitrary graph topologies
Experimental results </li></ul><li>Conclusions </li></ul>
Problem analysis <ul><li>It has been proved to be NP-Hard </li><ul><li>Via reduction to the Steiner Tree problem </li></ul...
For general topologies, aproximation algorithms are known. </li></ul></ul>
Steiner Tree problem <ul><li>Given an undirected graph:
Find a tree of minimum weight that connects all vertices in S. </li><ul><li>It can contain vertices not in S, known as Ste...
Steiner Tree problem <ul>5 </ul><ul>5 </ul><ul>2 </ul><ul>6 </ul><ul>2 </ul><ul>2 </ul><ul>3 </ul><ul>4 </ul><ul>13 </ul><...
The algorithm <ul><li>It implies to solve a series of min-cut problems on appropriately constructed  hypergraphs . </li></...
Hypergraphs <ul><li>Generalization of a graph.  </li><ul><li>In normal graphs, edges can be seen as pairs of vertices.
Hyperedges can group any number of vertices. </li></ul></ul>
Max-flow/Min-cut <ul><li>Given a weighted, directed graph and nodes  s ,  t  known as source and sink:
Find a flow or mapping of maximum value:  </li></ul>
Max-flow/Min-cut 3 / 3 2 / 3 2 / 2 3 / 3 0 / 2 1 / 4 2 / 2 3 / 3 Flow Capacity
Max-flow/Min-cut <ul><li>A min-cut is a set of edges with minimum weight such that if removed from the graph, there is no ...
The maximum value of an s-t flow is equal to the minimum capacity of an s-t cut. </li></ul>
Max-flow/Min-cut 3 / 3 2 / 3 3 / 3 0 / 2 1 / 4 2 / 2 3 / 3 2 / 2 Flow Capacity
Max-flow/Min-cut in hypergraphs <ul><li>Problem solvable in polynomial time.
What about max-flow in hypergraphs? </li><ul><li>For every hyperedge, add two new nodes and a directed edge between them o...
Max-flow/Min-cut in hypergraphs w
Upcoming SlideShare
Loading in …5
×

Minimizing cost in distributed multiquery processing applications

1,201 views

Published on

A brief overview of several methods to optimize data movement in distributed multi-query processing applications.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,201
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
18
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • If more than one source resides in a node then we can just create a node per relation and link the nodes with weighted 0 edges. In the case of replication, it becomes part of the query plan optimization. In that case the tree query plan given to the algorithm should be the minimum weighted tree (of course only considering the weight edges in the topology). The shortest paths for every pair of nodes might be precomputed (we could use associativity for joins of 3 relations), then we only care about taking the groups of replicas such that their distance is the smallest, furthermore this is only done in the leaves of the query plan.
  • Min-cut can be solved in polynomial time. Edmonds-Karp solves the problem in O(|V| * |E| ^2) Max-flow: Ford-Fulkerson O(|V| * f) Dinitz algorithm: O(|E| |V| ^ 2)
  • Minimizing cost in distributed multiquery processing applications

    1. 1. Minimizing Communication Cost in Distributed Multi-query Processing Jian Li, Amol Deshpande, Samir Khuller Department of Computer Science, University of Maryland Presented by: Luis Galárraga Saarland University July 7th, 2010
    2. 2. Outline <ul><li>Justification and related work
    3. 3. Problem formulation
    4. 4. Proposed methods and analysis </li><ul><li>Graph theory concepts
    5. 5. Tree topology
    6. 6. Arbitrary graph topologies
    7. 7. Experimental results </li></ul><li>Conclusions </li></ul>
    8. 8. Outline <ul><li>Justification </li></ul><ul><li>Problem formulation
    9. 9. Proposed methods and analysis </li><ul><li>Graph theory concepts
    10. 10. Tree topology
    11. 11. Arbitrary graph topologies
    12. 12. Experimental results </li></ul><li>Conclusions </li></ul>
    13. 13. Justification <ul><li>Emergence of large-scale distributed query processing in applications like: </li><ul><li>Wireless sensor networks
    14. 14. Publish-subscribe systems
    15. 15. Distributed stream processing applications </li></ul></ul>Common need: Minimize data movement!!
    16. 16. Justification <ul><li>Data transfer cost from one node to another may be heterogenous. </li></ul>
    17. 17. Justification
    18. 18. Outline <ul><li>Justification </li></ul><ul><li>Problem formulation </li></ul><ul><li>Proposed methods and analysis </li><ul><li>Graph theory concepts
    19. 19. Tree topology
    20. 20. Arbitrary graph topologies
    21. 21. Experimental results </li></ul><li>Conclusions </li></ul>
    22. 22. Problem formulation <ul><li>Minimization of data movement for multiples queries.
    23. 23. Assumptions: </li><ul><li>Non-uniform communication cost model.
    24. 24. No restrictions on data size.
    25. 25. Query plans are part of the input.
    26. 26. Intermediate results sizes are known. </li></ul></ul>
    27. 27. More formally <ul><li>Input: Set of relations or data sources
    28. 28. Topology, undirected weighted graph
    29. 29. Assignment of relations to nodes in the topology. </li></ul>
    30. 30. More formally <ul><li>Set of queries
    31. 31. Each query comes with a plan in the form of a directed tree. </li></ul>Destination node Data sources involved Data size S i S j S i x S j w z(S i ) z(S j ) z(S i x S j )
    32. 32. More formally Given the topology graph G c and a set of trees representing the query plans, our goal is to find a data movement plan that minimizes the total communication cost incurred while executing the queries.
    33. 33. Problem formulation Topology G c Queries (10) S 1 S 2 S 1 x S 2 C (10) (7) S 4 S 1 x S 2 x S 4 (5) (100) (100) S 2 S 6 S 2 x S 6 D (10) (5) S 2 S 5 S 2 x S 5 B (10) (6) (8) B A C D E F S 2 S 1 S 3 S 4 S 6 S 5 (10) (10) (100) (8) (100)
    34. 34. Problem formulation <ul><li>If a block of data sized S is sent along an edge e with weight w(e) then the communication cost is S * w(e)
    35. 35. For simplicity in the examples assume w(e) = 1 for all edges. </li><ul><li>But the algorithm is general in that sense! </li></ul></ul>
    36. 36. Outline <ul><li>Justification
    37. 37. Problem formulation
    38. 38. Proposed methods and analysis </li></ul><ul><ul><li>Graph theory concepts </li></ul></ul><ul><ul><li>Tree topology
    39. 39. Arbitrary graph topologies
    40. 40. Experimental results </li></ul><li>Conclusions </li></ul>
    41. 41. Problem analysis <ul><li>It has been proved to be NP-Hard </li><ul><li>Via reduction to the Steiner Tree problem </li></ul><li>Is everything lost? </li><ul><li>If topology graph is a tree, there is a polynomial-time algorithm.
    42. 42. For general topologies, aproximation algorithms are known. </li></ul></ul>
    43. 43. Steiner Tree problem <ul><li>Given an undirected graph:
    44. 44. Find a tree of minimum weight that connects all vertices in S. </li><ul><li>It can contain vertices not in S, known as Steiner points. </li></ul></ul>
    45. 45. Steiner Tree problem <ul>5 </ul><ul>5 </ul><ul>2 </ul><ul>6 </ul><ul>2 </ul><ul>2 </ul><ul>3 </ul><ul>4 </ul><ul>13 </ul><ul>2 </ul><ul>2 </ul><ul>3 </ul><ul>4 </ul><ul>Terminals </ul><ul>Steiner points </ul>
    46. 46. The algorithm <ul><li>It implies to solve a series of min-cut problems on appropriately constructed hypergraphs . </li></ul>Umm.. Hypergraphs?
    47. 47. Hypergraphs <ul><li>Generalization of a graph. </li><ul><li>In normal graphs, edges can be seen as pairs of vertices.
    48. 48. Hyperedges can group any number of vertices. </li></ul></ul>
    49. 49. Max-flow/Min-cut <ul><li>Given a weighted, directed graph and nodes s , t known as source and sink:
    50. 50. Find a flow or mapping of maximum value: </li></ul>
    51. 51. Max-flow/Min-cut 3 / 3 2 / 3 2 / 2 3 / 3 0 / 2 1 / 4 2 / 2 3 / 3 Flow Capacity
    52. 52. Max-flow/Min-cut <ul><li>A min-cut is a set of edges with minimum weight such that if removed from the graph, there is no path from s to t.
    53. 53. The maximum value of an s-t flow is equal to the minimum capacity of an s-t cut. </li></ul>
    54. 54. Max-flow/Min-cut 3 / 3 2 / 3 3 / 3 0 / 2 1 / 4 2 / 2 3 / 3 2 / 2 Flow Capacity
    55. 55. Max-flow/Min-cut in hypergraphs <ul><li>Problem solvable in polynomial time.
    56. 56. What about max-flow in hypergraphs? </li><ul><li>For every hyperedge, add two new nodes and a directed edge between them of capacity equal to the weight of the hyperedge. </li></ul></ul>
    57. 57. Max-flow/Min-cut in hypergraphs w
    58. 58. w Max-flow/Min-cut in hypergraphs
    59. 59. <ul>Weighted hypergraph partition problem. </ul><ul><li>Instead of a source and a sink we have 2 subsets of nodes:
    60. 60. We want to find a partition (S, T) on V such that: </li></ul>
    61. 61. <ul>Weighted hypergraph partition problem. </ul><ul><li>Convert it to a min-cut problem by adding artificial source and sink. </li></ul>L s L t s t
    62. 62. Outline <ul><li>Justification
    63. 63. Problem formulation
    64. 64. Proposed methods and analysis </li></ul><ul><ul><li>Graph theory concepts </li></ul></ul><ul><ul><li>Tree topology </li></ul></ul><ul><ul><li>Arbitrary graph topologies
    65. 65. Experimental results </li></ul><li>Conclusions </li></ul>
    66. 66. Tree topology – Step 1 <ul><li>Build a directed weighted hypergraph H D , by combining the query plan trees for all the queries. </li><ul><li>Edges oriented from children to parent. </li></ul><li>It explicitly captures all the opportunities for sharing the movement of data sources among the queries. </li></ul>
    67. 67. Tree topology – Step 2 <ul><li>For every edge in the topology graph decide which data sources and intermediate results move across that edge by solving an instance of the weighted hypergraph partition problem . </li></ul>
    68. 68. Tree topology – Step 1 – Single query Topology G c <ul><li>Here H D has the same structure of the query plan. </li></ul>B A C D E F S 2 S 1 S 3 S 4 S 6 S 5 (10) (10) (100) (8) (100) (10) S 1 S 2 S 1 x S 2 C (10) (7) S 4 S 1 x S 2 x S 4 (5) (100) H D
    69. 69. Tree topology – Step 2 – Single query <ul><li>For every edge (u, v) in G c solve an instance of the weighted hypergraph partition problem.
    70. 70. The tree is divided into 2 components G c u and G c v </li><ul><li>Label the nodes in the hypergraph with u or v depending on which component they lie. </li></ul></ul>
    71. 71. Tree topology – Step 2 – Single query G c u G c v S 1 S 2 C S 4 (10) S 1 x S 2 (10) (7) S 1 x S 2 x S 4 (5) (100) H D B A C D E F S 2 S 1 S 3 S 4 S 6 S 5 (10) (10) (100) (8) (100) C C D C
    72. 72. Tree topology – Step 2 – Single query <ul><li>Now solve a weighted hypergraph partition problem where L s are the nodes labeled C and L t , the ones labeled D </li></ul>(10) S 1 x S 2 (10) (7) S 1 x S 2 x S 4 (5) (100) H D C C D C
    73. 73. Tree topology – Step 2 – Single query (10) (10) (7) (5) (100) H D <ul><li>It induces a labeling for internal nodes and data transference (S 1 S 2 from C to, S 1 S 2 S 4 from D to C) </li></ul>S 1 S 2 S 4
    74. 74. Tree topology – Step 2 – Multiple queries <ul><li>Sources appearing more than once generate hyperedges with weight equal to the source data size. </li></ul>(10) S 1 S 2 S 1 x S 2 C (10) (7) S 4 S 1 x S 2 x S 4 (5) (100) (100) S 2 S 6 S 2 x S 6 D (10) (5) S 2 S 5 S 2 x S 5 B (10) (6) (8)
    75. 75. Tree topology – Step 2 – Multiple queries S 1 S 1 x S 2 C (10) (7) S 4 S 1 x S 2 x S 4 (5) (100) (100) S 6 S 2 x S 6 D (5) S 2 S 5 S 2 x S 5 B (10) (6) (8)
    76. 76. Tree topology – Step 2 – Multiple queries (10) (7) (5) (100) (100) (5) (10) (6) (8) <ul><li>Consider again edge C-D in the topology. </li></ul>S 1 S 2 S 4 S 2 S 5 S 2
    77. 77. Tree topology – Step 3 <ul><li>Combine the local solutions for all the edges of the topology into a single global data movement plan. </li><ul><li>Local solutions may not agree on where the internal nodes should be evaluated. </li></ul></ul>
    78. 78. Tree topology- Step 3 <ul><li>For every internal node i in H D , construct a directed graph J i with:
    79. 79. Then add the edges according to this rule (for every edge in G c ): </li></ul>
    80. 80. Tree topology – Step 3 <ul><li>J i encodes information about how the different local plans locate evaluation of node i and defines a path. </li></ul>B A C D E F J S 1 S 2 S 4 Evaluate here!!
    81. 81. Outline <ul><li>Justification
    82. 82. Problem formulation
    83. 83. Proposed methods and analysis </li></ul><ul><ul><li>Graph theory concepts
    84. 84. Tree topology </li></ul></ul><ul><ul><li>Arbitrary graph topologies </li></ul></ul><ul><ul><li>Experimental results </li></ul><li>Conclusions </li></ul>
    85. 85. Arbitrary graph topologies <ul><li>For single queries, dynamic programming approach offers optimal solution in O(n 2 m+n 3 ) with: </li><ul><li>m = # of nodes in query tree and n = # nodes in topology graph. </li></ul><li>For multi-query a O(lg(n)) approximation is achieved through tree metrics. </li><ul><li>Suitable for n small! </li></ul></ul>
    86. 86. Arbitrary graph topologies – The pairs problem <ul><li>Restrictions on the query structure.
    87. 87. Pairs problem: </li><ul><li>Assume each query is defined by a pair of nodes whose data should meet somewhere in the network.
    88. 88. A query-overlap graph H is a graph where the vertices correspond to the set of data items and each edge corresponds to a pair query. </li></ul></ul>
    89. 89. The pairs problem
    90. 90. The pairs problem <ul><li>Approximation algorithm when H is a star.
    91. 91. For same size data sources: </li><ul><li>Reducible to the minimum Steiner Tree problem
    92. 92. Otherwise to Connected Facility Location Problem </li></ul></ul>Isn't this problem NP-Hard? Yes, but good approximation algorithms do exist!!
    93. 93. <ul><ul><li>Connected Facility Location Problem </li></ul></ul><ul><li>Input:
    94. 94. Output: </li></ul>Bought edges (Steiner cost) Rented edges (Connection cost)
    95. 95. Connected Facility Location Problem Demands, D Facilities, F Rented edges
    96. 96. The pairs problem <ul><li>Steiner Tree and SROB have approximation algorithms with ratio ƿ 1.55 and 2.92 respectively.
    97. 97. Final approximation ratio depends on query overlap graph topology: </li><ul><li>If the star arboricity SN of H can be computed in polynomial time, then there is a ƿ*SN(H)-approximation . </li></ul></ul>Minimum number of star-shaped forests the edges of the graph can be partitioned. Why?
    98. 98. Outline <ul><li>Justification
    99. 99. Problem formulation
    100. 100. Proposed methods and analysis </li></ul><ul><ul><li>Graph theory concepts
    101. 101. Tree topology
    102. 102. Arbitrary graph topologies </li></ul></ul><ul><ul><li>Experimental results </li></ul></ul><ul><li>Conclusions </li></ul>
    103. 103. Experimental results <ul><li>Topology cases: tree and arbitrary graph
    104. 104. Multi-query approach compared with isolated optimization per query.
    105. 105. Normalized with the cost of the naïve approach.
    106. 106. Identical data size vs tri-modal distribution
    107. 107. Random query overlap. </li></ul>
    108. 108. Tree topology
    109. 109. Arbitrary topology – Pairs problem
    110. 110. Outline <ul><li>Justification
    111. 111. Problem formulation </li></ul><ul><li>Proposed methods and analysis </li><ul><li>Graph theory concepts
    112. 112. Tree topology
    113. 113. Arbitrary graph topologies
    114. 114. Experimental results </li></ul></ul><ul><li>Conclusions </li></ul>
    115. 115. Conclusions <ul><li>Communication cost can be the main bottleneck in certain applications.
    116. 116. Optimizing data movement is a hard problem. </li><ul><li>Relies on assumptions about query structure and topology.
    117. 117. For certain cases efficient algorithms are available. </li></ul><li>It has to be complemented with other optimization schemes. </li></ul>
    118. 118. Thank you!!

    ×