Minimizing Communication Cost in Distributed Multi-query Processing Jian Li, Amol Deshpande, Samir Khuller Department of Computer Science, University of Maryland Presented by: Luis Galárraga Saarland University July 7th, 2010
Outline Justification and related work
Problem formulation
Proposed methods and analysis Graph theory concepts
Tree topology
Arbitrary graph topologies
Experimental results Conclusions
Outline Justification Problem formulation
Proposed methods and analysis Graph theory concepts
Tree topology
Arbitrary graph topologies
Experimental results Conclusions
Justification Emergence of large-scale distributed query processing in applications like: Wireless sensor networks
Publish-subscribe systems
Distributed stream processing applications Common need: Minimize data movement!!
Justification Data transfer cost from one node to another may be heterogenous.
Justification
Outline Justification Problem formulation Proposed methods and analysis Graph theory concepts
Tree topology
Arbitrary graph topologies
Experimental results Conclusions
Problem formulation Minimization of data movement for  multiples queries.
Assumptions: Non-uniform communication cost model.
No restrictions on data size.
Query plans are part of the input.
Intermediate results sizes are known.
More formally Input: Set of relations or data sources
Topology, undirected weighted graph
Assignment of relations to nodes in the topology.
More formally Set of queries
Each query comes with a plan in the form of a directed tree. Destination node Data sources involved Data size S i S j S i  x S j w z(S i ) z(S j ) z(S i  x S j )
More formally Given the topology graph G c  and a set of trees representing the query plans, our goal is to find a data movement plan that minimizes the total communication cost incurred while executing the queries.
Problem formulation Topology G c Queries (10) S 1 S 2 S 1  x S 2 C (10) (7) S 4 S 1  x S 2  x S 4 (5) (100) (100) S 2 S 6 S 2  x S 6 D (10) (5) S 2 S 5 S 2  x S 5 B (10) (6) (8) B A C D E F S 2 S 1 S 3 S 4 S 6 S 5 (10) (10) (100) (8) (100)
Problem formulation If a block of data sized S is sent along an edge e with weight w(e) then the communication cost is S * w(e)
For simplicity in the examples assume w(e) = 1 for all edges. But the algorithm is general in that sense!
Outline Justification
Problem formulation
Proposed methods and analysis Graph theory concepts Tree topology
Arbitrary graph topologies
Experimental results Conclusions
Problem analysis It has been proved to be NP-Hard Via reduction to the Steiner Tree problem Is everything lost? If topology graph is a tree, there is a polynomial-time algorithm.
For general topologies, aproximation algorithms are known.
Steiner Tree problem Given an undirected graph:
Find a tree of minimum weight that connects all vertices in S. It can contain vertices not in S, known as Steiner points.
Steiner Tree problem 5 5 2 6 2 2 3 4 13 2 2 3 4 Terminals Steiner points
The algorithm It implies to solve a series of min-cut problems on appropriately constructed  hypergraphs . Umm.. Hypergraphs?
Hypergraphs Generalization of a graph.  In normal graphs, edges can be seen as pairs of vertices.
Hyperedges can group any number of vertices.
Max-flow/Min-cut Given a weighted, directed graph and nodes  s ,  t  known as source and sink:
Find a flow or mapping of maximum value:
Max-flow/Min-cut 3 / 3 2 / 3 2 / 2 3 / 3 0 / 2 1 / 4 2 / 2 3 / 3 Flow Capacity
Max-flow/Min-cut A min-cut is a set of edges with minimum weight such that if removed from the graph, there is no path from s to t.
The maximum value of an s-t flow is equal to the minimum capacity of an s-t cut.
Max-flow/Min-cut 3 / 3 2 / 3 3 / 3 0 / 2 1 / 4 2 / 2 3 / 3 2 / 2 Flow Capacity
Max-flow/Min-cut in hypergraphs Problem solvable in polynomial time.
What about max-flow in hypergraphs? For every hyperedge, add two new nodes and a directed edge between them of capacity equal to the weight of the hyperedge.
Max-flow/Min-cut in hypergraphs w

Minimizing cost in distributed multiquery processing applications

Editor's Notes

  • #10 If more than one source resides in a node then we can just create a node per relation and link the nodes with weighted 0 edges. In the case of replication, it becomes part of the query plan optimization. In that case the tree query plan given to the algorithm should be the minimum weighted tree (of course only considering the weight edges in the topology). The shortest paths for every pair of nodes might be precomputed (we could use associativity for joins of 3 relations), then we only care about taking the groups of replicas such that their distance is the smallest, furthermore this is only done in the leaves of the query plan.
  • #23 Min-cut can be solved in polynomial time. Edmonds-Karp solves the problem in O(|V| * |E| ^2) Max-flow: Ford-Fulkerson O(|V| * f) Dinitz algorithm: O(|E| |V| ^ 2)