Cut to Fit: Tailoring the Partitioning to the Computation
Jul. 3, 2019•0 likes
1 likes
Be the first to like this
Show More
•60 views
views
Total views
0
On Slideshare
0
From embeds
0
Number of embeds
0
Download to read offline
Report
Education
Iacovos G. Kolokasis & Polyvios Pratikakis
Institute of Computer Sciense (ICS)
Foundation of Research and Technology – Hellas (FORTH) &
Computer Science Department, University of Crete
Cut to Fit: Tailoring the Partitioning to the Computation
Cut to Fit: Tailoring the Partitioning
to the Computation
Iacovos G. Kolokasis & Polyvios Pratikakis
30 June 2019
Institute of Computer Sciense (ICS)
Foundation of Research and Technology – Hellas (FORTH) &
Computer Science Department, University of Crete
Graph Analytics Computation Dependencies
1. Various graph datasets with different properties
• Power-law graphs (e.g. social networks)
• Grid graphs (e.g. road networks)
2. Various graph algorithms with different computation
effort
• Not all algorithms perform a fixed amount of operation
per edge (e.g. BFS, Connected Components)
• Many algorithms make passes over the vertices apart
from passes over the edges
3. Various partition strategies
• Distributed graph computing frameworks operation
based on graph partitioning
kolokasis@ics.forth.gr 2 of 26
Impact of Graph Partitioning
• Data partitioning could have a significant impact on the
perfofmance of the graph computation
• Network Traffic
• Memory occupation
• Load balance
kolokasis@ics.forth.gr 3 of 26
Challenges
• There is no single optimal partitioner for all problems
• Complex partitioner results into increased partitioning
time
Our Goal is to study these two problems, by:
• Characterizing partition strategies using a wide set of
metrics
• Quantifying the correlation of partition metrics with
computation performance
kolokasis@ics.forth.gr 4 of 26
Spark Cluster Configuration
Instance Total Cores Total Memory Exec./Worker
Master 1 32 256GB -
Workers 4 32 256GB 6
Per Executor - 5 29GB -
• Nodes connect with 40Gb network
• We use 240 and 480 total number of partitions
• We restart Spark between runs
kolokasis@ics.forth.gr 5 of 26
Graph Partitioners
Assigns edges to partitions by hashing together the source and
destination vertex IDs, resulting in a random vertex cut.
kolokasis@ics.forth.gr 7 of 26
Graph Partitioners
Assigns edges to partitions by hashing the source vertex ID.
This causes all edges with the same source vertex to be
collocated in the same partition.
kolokasis@ics.forth.gr 8 of 26
Graph Partitioners
Arranges all partitions into a square matrix and picks the
column on the basis of the source vertex’s hash and the row
on the basis of the destination vertex’s hash.
kolokasis@ics.forth.gr 9 of 26
Graph Partitioners
Assigns edges to partitions by hashing the source and
destination vertex IDs in a canonical direction, resulting in a
random vertex cut that collocates all edges between two
vertices, regardless of direction.
kolokasis@ics.forth.gr 10 of 26
Graph Partitioners
Assigns edges to partition by simple modulo of the source
vertex IDs with the total number of partitions. We expect any
correlation between vertex IDs and locality.
kolokasis@ics.forth.gr 11 of 26
Graph Partitioners
Assigns edges to partition by simple modulo of the
destination vertex IDs with the total number of partitions.
We assume that vertex IDs may capture a metric of locality.
kolokasis@ics.forth.gr 12 of 26
Graph Partitioners
Places edges into partitions using a Destination Cut strategy
when the destination is a hub, or a Source Cut strategy when
it is not.
kolokasis@ics.forth.gr 13 of 26
Graph Partitioners
Distributes edges using the Edge Partition 2D strategy when
source and destination vertices are both hubs or both not
hubs; if only one of them is a hub, the algorithm places the
edge near the non-hub vertex.
kolokasis@ics.forth.gr 14 of 26
Partition Metrics
The ratio of the number of edges in the biggest partition, over
the average number of edges per partition.
kolokasis@ics.forth.gr 15 of 26
Partition Metrics
Normalized Standard Deviation of the number of edges per
partition. An alternative measure of imbalance in the edge
partitioning.
kolokasis@ics.forth.gr 16 of 26
Partition Metrics
The ratio of the total number of vertices of each partition,
including replicated vertices, over the total number of vertices
of the original graph.
kolokasis@ics.forth.gr 17 of 26
Partition Metrics
The number of vertices that exist in more than one partition,
irrespective of how many copies of each cut vertex there are.
These are the unique vertices copied across partitions.
kolokasis@ics.forth.gr 18 of 26
Partition Metrics
The total number of copies of replicated vertices that exist in
more than one partition. Shows the number of messages that
need to be exchanged on every superstep.
kolokasis@ics.forth.gr 19 of 26
Characterization of Partitions Metrics
• Almost all partitions produced by partitioners are quite
balanced
• Except for web-wikipedia-link-fr, where DC produced
unballanced partitions
kolokasis@ics.forth.gr 20 of 26
Characterization of Partitions Metrics
• Power-law graphs
results into higher RF
• Low number of CV
usually means a low RF
kolokasis@ics.forth.gr 21 of 26
Which Metrics can predict the performance?
• RF is almost correlated with PR except only in
web-wikipedia-link-fr dataset
• RF is not correlated with TC
kolokasis@ics.forth.gr 22 of 26
Which Metrics can predict the performance?
• CV is almost correlated with CC except only in
road-road-usa dataset
• CV is not reliable predictor of TC performance
kolokasis@ics.forth.gr 23 of 26
Dynamic Partitioner Selection
Hypothesis
Select a partitioner dynamically based on the properties of the
data (e.g size of the graph, granularity of partitioning)
Testing
We implemented a very simple dynamic partitioner that selects
between partitioning algorithms based on the granularity of
partitioning
kolokasis@ics.forth.gr 24 of 26
Conclusions
• Distributed graph analytics frameworks efficiency is highly
dependent on the partitioning strategies used
• There is no single optimal partitioner for all problems
• There is no simple way to predict the performance of the
computation
• Dymamic partitioners can achieve results better than
static partitioners on different set of datasets and
configurations
kolokasis@ics.forth.gr 26 of 26