SlideShare a Scribd company logo
Modeling Orderings in Asynchronous Distributed-Memory
Parallel Graph Algorithms
Abstract Graph Machine
Final Dissertation Defense
Thejaka Amila Kanewala
November 9, 2018
Agenda
• Preliminaries
• Motivation
• Abstract Graph Machine (AGM)
• Extended Abstract Graph Machine (EAGM)
• Engineering an AGM
• Engineering an EAGM
• Results
• Final Remarks
2Abstract Graph Machine
Graphs
• Graph is an ordered pair, G = (V, E) where V is a set of vertices and E
is a set of edges ( ).
3Abstract Graph Machine
Our Focus.
Graph Applications (Graph Problems)
• Single Source Shortest Paths
• Breadth First Search
• Connected Components
• Maximal Independent Set
• Minimum Spanning Tree
• Graph Coloring
• And many more …
4Abstract Graph Machine
Multiple algorithms to solve a
graph problem.
E.g.,
- Dijkstra’s SSSP
- Bellman-Ford
- Etc.
Why Performance in Parallel Graph Algorithms is
Challenging ?
• Low compute/communication ratio (caused by irregularity)
• Synchronization overhead
• Higher amount of work
• Many dependencies
- Input graph
- The algorithm logic
- The underlying runtime
- Etc.
5Abstract Graph Machine
Regular vs. Irregular
Communication only take
place at the data boundary
Communication take place
everywhere.
Low communication/computation
ratio.
High communication/computation
ratio.
6Abstract Graph Machine
Synchronization Overhead
• Many existing parallel graph algorithms are designed focusing
shared-memory parallel architectures.
• When we ”extend” these algorithms for distributed-memory
execution, they end up having many synchronization phases.
 Synchronization overhead is significant when processing a
large graph on many distributed nodes.
7Abstract Graph Machine
Shared-Memory Parallel Graph Algorithms
• Shared-memory parallel algorithms can be describe abstractly using the
Parallel Random Access Machine (PRAM) model.
Independent
processors
with its own
memory (e.g.,
registers)
Shared-memory
8Abstract Graph Machine
Shared-Memory Parallel Graph Algorithms
.
.
.
.
.
Consists of
multiple phases
like these
Synchronize
Synchronize
These algorithms can be naturally extended to distributed-memory using
Bulk Synchronous Parallel (BSP).
9Abstract Graph Machine
Shared-Memory to Distributed-Memory
.
.
.
.
.
.
.
.
.
. Global Barrier
Global Barrier
Executes many global barriers. Synchronization overhead is
significant, especially when executing the program on many nodes.
10Abstract Graph Machine
Asynchronous Graph Algorithms
Level synchronous
BFS, using “S” as
the source.
11Abstract Graph Machine
Asynchronous Graph Algorithms
Level synchronous
BFS, using “S” as
the source.
12Abstract Graph Machine
Synchronize
Asynchronous Graph Algorithms
Level synchronous
BFS, using “S” as
the source.
Synchronize
Once a label is set, no corrections  0 additional work.
13Abstract Graph Machine
Asynchronous Graph Algorithms
Asynchronous
BFS, using “S” as
the source.
14Abstract Graph Machine
Asynchronous Graph Algorithms
Asynchronous
BFS, using “S” as
the source. Label correction
Need to correct
these as well.
Generates lot of redundant
work.
15Abstract Graph Machine
Asynchronous Graph Algorithms
• Asynchronous graph algorithms are better in the sense
they avoid the overhead of synchronization.
• BUT, they tend to generate lot of redundant work.
- High runtime
16Abstract Graph Machine
How much Synchronization?
Low
diameter
Have enough
parallel work in
each level
Fewer barriers
(diameter ~ barriers)
E.g., Twitter, ER.
Level synchronous execution is
better.
.
.
.
High
diameter
Not enough
parallel work in a
level
E.g., Road NW.
Asynchronous execution
is better.
We need a way to control the amount of synchronization needed by a graph
algorithm  Abstract Graph Machine (AGM) is a model that can control the
level of synchronization of a graph algorithm using an ordering.
17Abstract Graph Machine
SSSP as an Example
• In an undirected weighted graph (G=(V, E)), the Single Source Shortest Path
problem is to find a path to every vertex so that the sum of weights in that path is
minimized.
18Abstract Graph Machine
Algorithms
void Chaotic(Graph g, Vertex source) {
For each Vertex v in G {
state[v] <- INFINITY;
}
Relax(source, 0);
}
void Relax(Vertex v, Distance d) {
If (d < state[v]) {
state[v] <- d;
For each edge e of v {
Vertex u = target_vertex(e);
Relax(u, d+weight[e]);
}
}
}
void Dijkstra(Graph g, Vertex source)
{
For each Vertex v in G {
state[v] <- INFINITY;
}
pq.insert(<source, 0>);
While pq is not empty {
<v, d> = vertex-distance pair with
minimum distance in pq;
Relax(v, d);
}
}
void Relax(Vertex v, Distance d) {
If (d < state[v]) {
state[v] <- d;
For each edge e of v {
Vertex u = target_vertex(e);
pq.insert(<u, d+weight[e]>);
}
}
}
void -Stepping(Graph g, Vertex source) {
For each Vertex v in G {
state[v] <- INFINITY;
}
insert <source, 0> to appropriate bucket from
buckets;
While all buckets are not empty {
bucket = smallest non-empty bucket in buckets;
For each <v, d> in bucket {
Relax(v, d);
}
}
}
void Relax(Vertex v, Distance d) {
If (d < state[v]) {
state[v] <- d
For each edge e of v {
Vertex u = target_vertex(e);
insert <u, d+weight[e]> to appropriate
bucket in buckets;
}
}
}
Relax
Relax
Relax
Process random
vertex
Put vertex in priority
queue
Put vertex in ∆
priority queue
19Abstract Graph Machine
SSSP Algorithms Processing Work
Chaotic ▲-Stepping KLA Dijkstras
Vertex, distance pairs
generated by the Relax
function are fed to itself.
Vertex distance pairs
generated in by the Relax
function are ordered based on
∆. Generated vertex-distance
pair is inserted into
appropriated, bucket and
vertex, distance pairs in the
first bucket is processed first.
Vertex, distance pairs are
ordered based on the level
from the source vertex.
Buckets are created based on
level interval k.
Vertex, distance pairs are
ordered based on the
distance. Smallest distant
vertex, distance pairs are
processed first.
Relax
State
update
Ordering of
updates
Same Same
Different Different
20Abstract Graph Machine
Ordering of SSSP Algorithms
Chaotic ▲-Stepping KLA Dijkstra
Generated vertex,
distance pairs are
selected randomly and not
ordered.
Generated vertex,
distance pairs are
separated into several
partitions based on a
relation defined on the
distance and ∆.
Generated vertex,
distance pairs are
separated into partitions
based on the level and k.
Generated vertex,
distance pairs are
separated into partitions
based on the distinct
distance values.
Single partition. Multiple partitions based
on ∆.
Multiple partitions based
on the level.
Multiple partitions based
on the distance.
(v0,d0)
(v1,d1)
(v2,d2)
(v3,d3)
(v4,d4)(v5,d5)
(v6,d6) (v0,d0)(v1,d1)
(v2,d2)(v3,d3)
(v4,d4)(v5,d5)
(v6,d6) (v0,d0)
(v1,d1)
(v2,d2)
(v3,d3)
(v4,d4)
(v5,d5)
(v6,d6)
(v0,d0)
(v1,d1)
(v2,d2)
…
Unordered
Unordered
within ∆
Unordered
within k
Ordered
Ordered
between ∆
Ordered
between k
21Abstract Graph Machine
Distributed Memory Implementations
Chaotic ▲-Stepping KLA Dijkstras
Barriers are executed at
the beginning of the
algorithm and at the end
of the algorithm.
A barrier is executed after
processing a ∆ bucket.
A barrier is executed after
processing k levels.
A barrier is executed after
processing a vertex-
distance pair with different
distance.
(v0,d0)
(v2,d2)
(v4,d4)
(v6,d6)
(v1,d1)
(v3,d3)
(v5,d5)
R0 R1
(v1,d1)
(v3,d3)
(v5,d5)
(v0,d0)
(v2,d2)
(v4,d4)
(v6,d6)
R0 R1
(v0,d0)
(v2,d2)
(v4,d4)
(v6,d6)
(v1,d1)
(v3,d3)
(v5,d5)
R0 R1
(v0,d0)
(v2,d2)
…
(v1,d1)
…
R0 R1
• Two ranks (R0 and R1)
• Vertices are distributed (e.g., Even vertices in R0 and odd vertices in R1).
• Barrier execution :
22Abstract Graph Machine
Abstract Graph Machine for SSSP
Strict weak ordering
relation defined on work
units.
The strict weak ordering relation creates equivalence
classes and induces an ordering on generated equivalence
classes.
23Abstract Graph Machine
Abstract Graph Machine
• Abstract model that represents graph algorithms as a processing function and
strict weak ordering
• In AGM for SSSP, the (vertex, distance) pair is called a workitem
• An AGM consists of:
 A definition of a graph (vertices, edges, vertex attributes, edge attributes)
 A set of states (of the computation, e.g., distances)
 A definition of workitems
 A processing function
 A strict weak ordering relation
 An initial workitem set (to start the algorithm)
24Abstract Graph Machine
The Abstract Graph Machine (AGM)
Definition of
graph
Definition of
workitems
Set of states
Processing
function
Strict weak
ordering
Initial set of
workitems
Including vertex and
edge attributes
SSSP: distance
from source
SSSP: relax
SSSP: edge weight
SSSP: starting vertex SSSP: depends on
algorithm!
25Abstract Graph Machine
SSSP Algorithms in AGM
• In general, a SSSP algorithm’s workitem consists of a vertex and a
distance. Therefore, the set of workitems for AGM is defined as
follows:
• is a pair (e.g., w = <v, d>). The “[]”
operator retrieves values associated with ”w”. E.g., w[0] returns the
vertex associated with workitem and w[1] returns the distances.
• The state of the algorithms (shortest distance calculated at a point in
time) is stored in a state. We call this state “distance”.
26Abstract Graph Machine
The Processing Function for SSSP
Input workitem. The
workitem is a pair and
w[0] returns the
vertex associated to
the workitem and w[1]
returns the distance
associated.
1. Check this
condition. If input
workitem’s distance is
better than stored
distance go to 2.
2. Updated
distance and go
to 3.
3. Generate new
workitems.
relax() in AGM notation
27Abstract Graph Machine
State Update & Work Generation
28Abstract Graph Machine
State update logic.
Work generation
logic.
Strict weak ordering relations (SSSP)
• Chaotic SSSP strict weak ordering relation ( )
• Dijkstra’s SSSP strict weak ordering relation ( )
• ∆-Stepping SSSP strict weak ordering relation ( )
• Similar ordering relation can be derived for KLA.
29Abstract Graph Machine
Algorithm Family
• All algorithms share the same processing function but with different
orderings.
• The collection of algorithms that share the same processing function
is called an algorithm family.
30Abstract Graph Machine
Distributed ∆–Stepping Algorithm
Partitions created by global
ordering, based on relation
the .
First bucket created
by delta-stepping
algorithm
Work within a bucket
is not ordered
Partitions are ordered
∆–Stepping Algorithm
on two ranks (R0, R1)
Huge opportunity for
acceleration
31Abstract Graph Machine
Extended AGM (EAGM): ∆–Stepping + thread level ordering
For a specific algorithm, work
within a bucket is not ordered
AGM creates buckets with
global ordering, based on
relation .
And we can do
these in parallel
Each rank can process its
bucket with a better ordering!
But why be
unordered?
32Abstract Graph Machine
E.g., .
Extended AGM (EAGM): Hybrid Hierarchical Ordering
As we distribute/parallelize, we
create different spatial domains
for processing
Extended AGM: Use
different orderings in each
domain
33
E.g.,
Abstract Graph Machine
Extended AGM in General
Abstract Graph Machine 34
AGM takes a
processing function
and a strict weak
ordering relation
EAGM takes a
processing function
and a hierarchy of
strict weak ordering
relations attached to
each spatial domain.
Global
Ordering
Node
Ordering
NUMA
Ordering
Thread
Ordering
Extended AGM (EAGM): E.g., ∆–Stepping with NUMA level
ordering.
• Work within each bucket is further ordered using NUMA local priority
queues.
35Abstract Graph Machine
ordering for B1
bucket and NUMA domain
1 in Rank 0
Extended AGM (EAGM) - ∆–Stepping with node level
ordering.
• Work within each bucket is further ordered using node local priority
queues.
36Abstract Graph Machine
ordering for B1
bucket in Rank 0
Engineering an AGM
Abstract Graph Machine 37
template<typename buckets>
void PF(const WorkItem& wi,
int tid,
buckets& outset) {
Vertex v = std::get<0>(wi);
Distance d = std::get<1>(wi);
Distance old_dist = vdistance[v], last_old_dist;
while (d < old_dist) {
Distance din = CAS (d, &vdistance[v]);
if (din == old_dist) {
FORALL_OUTEDGES_T(v, e, g, Graph) {
Vertex u = boost::target(e, g);
Weight we = boost::get(vweight, e);
WorkItem wigen(u, (d+we));
outset.push(wigen, tid);
}
break;
} else {
old_dist = din;
}
}
}
AGM Graph Processing Framework
Abstract Graph Machine 38
//================== Dijkstra Ordering =====================
template<int index>
struct dijkstra : public base_ordering {
public:
static const eagm_ordering ORDERING_NAME = eagm_ordering::enum_dijkstra;
template <typename T>
bool operator()(T i, T j) {
return (std::get<index>(i) < std::get<index>(j));
}
eagm_ordering name() {
return ORDERING_NAME;
}
};
The strict weak
ordering, partitions
WorkItems. In this
case the ordering is
based on the
distance.
The processing
function for
SSSP.
For SSSP a
”WorkItem” is a
vertex and distance.
CAS = Atomic
compare & swap
After executing wi,
processing function
generates new
work items and
they are pushed for
ordering.
Invoking the Framework
Abstract Graph Machine 39
// SSSP (AGM) algorithm
typedef agm<Graph,
WorkItem,
ProcessingFunction,
StrictWeakOrdering,
RuntimeModelGen> sssp_agm_t;
sssp_agm_t ssspalgo(rtmodelgen,
ordering,
pf,
initial);
Creating the
AGM for SSSP.
The WorkItem type.
The processing function
(preorder) type.
The strict weak ordering
relation type.
The underlying runtime.
(AGM framework abstracts
out the runtime).
The underlying
runtime.
The specific ordering
function (e.g., Delta
Ordering).
The processing
function instance.
The initial work item
set (e.g., source
vertex).
Mapping AGM to an Implementation
Abstract Graph Machine 40
Q2. How should we arrange
processing function and
ordering ?
Q1. How should we do
ordering ?
Abstract
Actual
ImplementationCommunication
method
Ordering Implementation
• Every compute node keeps a set of equivalence classes locally in a data structure.
• Equivalence classes are ordered according to the strict weak ordering (swo)
• Every equivalence class has:
 A representative work-item
 An append buffer that holds work items for the equivalence class.
Abstract Graph Machine 41
Representation
Shared-memory
parallel thread
Holds all the work
items that are not
comparable to the
representative.
Strict weak ordering
is transitive.
Will discuss later.
An equivalence
class.
Ordering Implementation: Inserts to an Equivalence
class
• Finds the equivalence class by comparing with representative
work items.
• If there is no equivalence class, create a one with the inserting
work item as the representative work item.
 - Deal with concurrency
Abstract Graph Machine 42
Inserting element is not
comparable with the
representative element.
Ordering Implementation: Processing an Equivalence
Class
• Find the equivalence class with minimum representation workitem.
 - Every rank sends the representative work item of the first equivalence class in the list. (Uses all
reduce to communicate minimum work items)
 - Selects the smallest equivalence class
Abstract Graph Machine 43
Equivalence
classes are not
distributed
uniformly
If a particular node does
not have an equivalence
class relevant to the
global minimum
representative, then
insert one.
Ordering Implementation: Termination of an Equivalence
Class
• Termination of a single equivalence class
 When processing an equivalence class more work can be
added to the same equivalence class.
 Every node keeps track of number of work items pushed into
the equivalence class and number of work items processed in
the equivalence class.
 The difference between those two numbers are globally
reduced (sum) and when the sum is zero framework decides
that equivalence class is processed.
Abstract Graph Machine 44
Counts are reduced
after making sure
all messages are
exchanged (i.e.
after an epoch).
AGM Implementation: Data Structure
• A data structure is needed to hold the equivalence classes
 Later this data structure will be further extended to use at
EAGM levels.
• Requirements of the Data Structure
 Should order nodes by representative work items
 Faster lookups are important
 Ability to concurrently insert/delete and lookup/search data
Abstract Graph Machine 45
Sounds like a
Dictionary ADT.
Data Structure
• Binary Search Trees (BST), with Locks
• Linked List, with Locks and Atomics
• Concurrent SkipList
• Partitioning Scheme
Abstract Graph Machine 46
E.g, RBTrees. Often
balances the tree. Therefore,
need to lock the whole data
structure for insert/delete
Not much difference between
locked version and lock free
version. No need to lock the
whole data structure (can lock
only the modifying node), but
linear lookup time.
An alternative probabilistic data
structure that gives the same
complexity guarantees as BST,
but avoids the need for
rebalancing, but the contention is
much higher.
Explain in the next slide.
Date Structure: Partitioning Scheme
Abstract Graph Machine 47
Every node maintains
a set of vectors -- one
per each thread
Every thread maintains the
min work item seen so far.
Push back - O(1)
(assuming no resize)
Find the global min
representative
Partition by the
global min with strict
weak ordering
relation and move
partitioned work to a
new vector. (O(n))
Thread buckets is a
data structure that
has a vector per
each thread,
represents the next
processing
equivalence class
ordering for B1
bucket in Rank 0
Data Structure: Partitioning Scheme
• Processing Thread Buckets
• Avoid load imbalance issues.
• But cannot insert work items that are coming for the same buckets (cannot
afford vector resize -- if we need this we need to use locks)
• Work items pushed for the same bucket are directly sent.
Abstract Graph Machine 48
T0 T1 T2
T0 T1 T2 T0
…
Data Structure: Summary
Abstract Graph Machine 49
Data Structure Execution Time (sec.)
Linked List (Atomics) ~61
Binary Search Trees with
Locks
~82
Concurrent SkipList ~89
Partitioning Scheme ~44
• Pre-order processing function
• SSSP-Delta ordering
• Two nodes, 16 shared-memory
parallel threads, Graph500 Scale 24.
Mapping AGM to an Implementation
Abstract Graph Machine 50
Q2. How should we arrange
processing function and
ordering ?
Q1. How should we do
ordering ?
Abstract
Actual
ImplementationCommunication
method
Communication Paradigms
Abstract Graph Machine 51
R0
R1 R2 R3
Push
Push
Push
Push
Push
Push
R0
R1 R2 R3
Pull
Pull
Pull
Pull
Pull
Pull
R0
R1 R2 R3
Gather
Gather
Gather
Scatter
Scatter
Scatter
Apply
Push Communication Pull Communication GAS Communication
R1 pushes workitems to R0 R0 pulls workitems to R1
Placing Processing Function: Pre-Order & Post-Order
Abstract Graph Machine 52
After receiving the
workitem processing
function is executed
and generated
workitems are
inserted to the data
structure
After receiving
the workitem
insert to the
data structure
for ordering
Pre-Order
Post-Order
Placing Processing Function: Split-Order
• The processing function contains logic for updating state/s and to generate new
work.
 - We split the processing function into two functions: a state update function
( and)
Abstract Graph Machine 53
workitem
Insert the
workitem for
ordering if state
is updated
Generate new
work only if state is
not changed
Allows us to
prune work
further
Placing Processing Function: Summary
Abstract Graph Machine 54
Run Time Total Inserts Additional
WorkItems
Processed
Split-Order 14.54 17139284 21334
Pre-order 62.55 409388039 2469025
Post-order 44.09 398344581 2284918
Best timing.
Total work items pushed
into the data structure is
much less compared other
two implementations.
Additional relaxes are
much less compared to
other two
implementations,
because of pruning.
Post-Order is
slightly better
than pre-order
because of the
reduced
contention.
Engineering an EAGM
Abstract Graph Machine 56
EAGM Spatial Levels
Abstract Graph Machine 57
T0 T1 T2 T3 T4 T5
N0
Global
Spatial Level
Node
Spatial Level
NUMA
Spatial Level
Thread
Spatial Level
numa0 numa1
T0 T1 T2 T3 T4 T5
N1
numa0 numa1
T0 T1 T2 T3 T4 T5
N2
numa0 numa1
EAGM Execution
Abstract Graph Machine 58
// SSSP (AGM) algorithm
typedef agm<Graph,
WorkItem,
ProcessingFunction,
StrictWeakOrdering,
RuntimeModelGen> sssp_agm_t;
sssp_agm_t ssspalgo(rtmodelgen,
ordering,
pf,
initial);
// SSSP (EAGM) algorithm
typedef eagm<Graph,
WorkItem,
ProcessingFunction,
EAGMConfig,
RuntimeModelGen> sssp_eagm_t;
sssp_eagm_t ssspalgo(rtmodelgen,
config,
pf,
initial);
AGM takes a strict
weak ordering
relation.
EAGM takes an
hierarchy of
orderings
The hierarchy
correspond to the
memory hierarchy.
Globally Delta (=5)
ordering Node level chaotic
Ordering
NUMA level chaotic
Ordering
Thread level chaotic
Ordering
EAGMConfig
EAGMConfig
Abstract Graph Machine 59
// strict weak orderings
//================== Chaotic =====================
struct chaotic : public base_ordering {
public:
static const eagm_ordering ORDERING_NAME = eagm_ordering::enum_chaotic;
template <typename T>
bool operator()(T i, T j) {
return false;
}
eagm_ordering name() {
return ORDERING_NAME;
}
};
Any two workitems
are not comparable
to each other.
CHAOTIC_ORDERING_T ch;
DELTA_ORDERING_T delta(agm_params.delta);
auto config = boost::graph::agm::create_eagm_config(delta,
ch,
ch,
ch);
Globally delta
ordering
Node level chaotic
ordering
NUMA level chaotic
orderingThread level chaotic
ordering
EAGM Execution:
Abstract Graph Machine 60
T0 T1 T2 T3 T4 T5
N0
numa0 numa1
T0 T1 T2 T3 T4 T5
N1
numa0 numa1
T0 T1 T2 T3 T4 T5
N2
numa0 numa1
Time
An equivalence
class
Barrier
synchronization
Execution same
as AGM
Execution is
globally
synchronous
Execution is
asynchronous in
process, NUMA and
thread levels
EAGM Execution:
Abstract Graph Machine 61
Execution is globally
asynchronous 
globally single
equivalence class.
Execution is node
level
synchronous,
Equivalence class
at process level.
Execution is
asynchronous in
NUMA and thread
levels
Local thread
barriers
EAGM Execution:
Abstract Graph Machine 62
T0 T1 T2 T3 T4 T5
N0
numa0 numa1
T0 T1 T2 T3 T4 T5
N1
numa0 numa1
T0 T1 T2 T3 T4 T5
N2
numa0 numa1
Time
Global
Equivalence
Class
Global
Barrier
Node
Equivalence
Class
NUMA
Equivalence
Class
Thread
Equivalence
Class
NUMA
Barrier
Thread
Barrier
Spatial Ordering
TemporalOrdering
• If we have “non-
chaotic” orderings
defined for every
memory level.
• E.g.,
Spatial ordering
Temporal ordering
EAGM Implementation
• The heart of the EAGM implementation is the nested data structure that holds
equivalence classes.
Abstract Graph Machine 63
Global level
classes
Node level
classes
NUMA level
classes
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Time
EAGM Implementation: Data Structure
Abstract Graph Machine 64
Order by
Delta=10
Order by
K=2
Order by
Delta=5
E.g.,
EAGM Implementation: Static Optimizations
Abstract Graph Machine 65
Chaotic ordering always
create a single
equivalence class.
Therefore, we exclude
levels for chaotic
orderings.
Optimized data structure.
Optimizations are performed
statically at compile time with the
help of template meta-
programming and template
specialization.
Remove nested levels
for chaotic orderings.
EAGM Implementation: Static Optimizations
Abstract Graph Machine 66
Remove nested levels
for chaotic orderings.This is really an AGM.
EAGM Implementation: Static Optimizations
Abstract Graph Machine 67
Results
Abstract Graph Machine 68
Single Source Shortest Paths: Pre/Post/Split Orderings
Abstract Graph Machine 69
Weak scaling results
carried out on 2 Broad-
well 16-core Intel Xeon
processors.
230 vertices &
234 edges.
Shared-memory
execution.
Split-order is ~5
times faster than
pre-order & post-
order.
Single Source Shortest Paths: Weak Scaling
Abstract Graph Machine 70
Graph500 graph.
Graph500
proposed SSSP
graph.
Erdos-Renyi
graph.
Framework algorithms show
better scaling compared to
PowerGraph.
Thread level Dijkstra ordering
shows better scaling behavior
for Power-Law graphs.
Globally synchronous delta-
stepping has better scaling for
ER graphs.
Single Source Shortest Paths: Strong Scaling (BRII+)
Abstract Graph Machine 71
Relative Speedup, = Time for
fastest sequential algorithm /
Parallel execution time on “n”
PEs
Synchronization
overhead
become
significant when
executing on
higher number
of nodes
Just ordering by
level does not
reduce the work,
but if we apply
Dijkstra ordering
at thread level
we see better
performance.
SSSP More Results: Weak Scaling (BR II)
Abstract Graph Machine 72
Both Power-Graph & PBGL-1
do not scale well at higher
scale
Breadth First Search: Weak Scaling
Abstract Graph Machine 73
K(2) global ordering and
level global ordering
reduce more redundant
work compared thread
level ordering.
BFS: In-node Performance (Weak Scaling)
Abstract Graph Machine 74
Global level
synchronous
Node level
synchronous
Within a node both configurations
execute in the same way.
Therefore, the performance of both
configurations are similar.
BFS: Strong Scaling
Abstract Graph Machine 75
Globally level
synchronous versions
show better speed-up
than thread level
synchronous
versions.
BFS: Road Networks
Abstract Graph Machine 76
Strong scaling with
Road networks. Road
networks has a high
diameter (~850-900).
Therefore, global
level synchronous
version show poor
scaling because of
the synchronization
overhead.
Global level
synchronous
Globally
asynchronous
and thread level
synchronous
Other Graph Applications
• Connected Components
• Maximal Independent Set
• Triangle Counting
Abstract Graph Machine 77
Conclusions
• Using the AGM abstraction, we showed that, we can generate
families of algorithms by keeping the processing function intact
and by changing ordering.
• With the EAGM we can achieve spatial and temporal orderings.
• We discuss the challenges in mapping AGM framework to an
implementations and described how we could avoid such
challenges.
• We came up with an efficient implementation of AGM and EAGM
frameworks.
 Weak and strong scaling results show that AGM/EAGM
algorithms outperform graph algorithms in other frameworks.
Abstract Graph Machine 78
Thank You !
Abstract Graph Machine 79

More Related Content

What's hot

Scaling out logistic regression with Spark
Scaling out logistic regression with SparkScaling out logistic regression with Spark
Scaling out logistic regression with Spark
Barak Gitsis
 
Stockage, manipulation et analyse de données matricielles avec PostGIS Raster
Stockage, manipulation et analyse de données matricielles avec PostGIS RasterStockage, manipulation et analyse de données matricielles avec PostGIS Raster
Stockage, manipulation et analyse de données matricielles avec PostGIS Raster
ACSG Section Montréal
 
Gradient descent optimizer
Gradient descent optimizerGradient descent optimizer
Gradient descent optimizer
Hojin Yang
 
First Place Memocode'14 Design Contest Entry
First Place Memocode'14 Design Contest EntryFirst Place Memocode'14 Design Contest Entry
First Place Memocode'14 Design Contest Entry
Kevin Townsend
 
Stochastic gradient descent and its tuning
Stochastic gradient descent and its tuningStochastic gradient descent and its tuning
Stochastic gradient descent and its tuning
Arsalan Qadri
 
Big Data Analysis with Signal Processing on Graphs
Big Data Analysis with Signal Processing on GraphsBig Data Analysis with Signal Processing on Graphs
Big Data Analysis with Signal Processing on Graphs
Mohamed Seif
 
Mahout scala and spark bindings
Mahout scala and spark bindingsMahout scala and spark bindings
Mahout scala and spark bindings
Dmitriy Lyubimov
 
Solving The Shortest Path Tour Problem
Solving The Shortest Path Tour ProblemSolving The Shortest Path Tour Problem
Solving The Shortest Path Tour Problem
Nozir Shokirov
 
Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation Graphs
Yoonho Lee
 
Sparse Kernel Learning for Image Annotation
Sparse Kernel Learning for Image AnnotationSparse Kernel Learning for Image Annotation
Sparse Kernel Learning for Image Annotation
Sean Moran
 
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEEuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
HONGJOO LEE
 
single source shorest path
single source shorest pathsingle source shorest path
single source shorest path
sowfi
 
Problems in parallel computations of tree functions
Problems in parallel computations of tree functionsProblems in parallel computations of tree functions
Problems in parallel computations of tree functions
Dr Sandeep Kumar Poonia
 
Injecting image priors into Learnable Compressive Subsampling
Injecting image priors into Learnable Compressive SubsamplingInjecting image priors into Learnable Compressive Subsampling
Injecting image priors into Learnable Compressive Subsampling
Martino Ferrari
 
Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15
Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15
Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15
MLconf
 
On Sampling from Massive Graph Streams
On Sampling from Massive Graph StreamsOn Sampling from Massive Graph Streams
On Sampling from Massive Graph Streams
Nesreen K. Ahmed
 
Algorithm Design and Complexity - Course 12
Algorithm Design and Complexity - Course 12Algorithm Design and Complexity - Course 12
Algorithm Design and Complexity - Course 12
Traian Rebedea
 
Co-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and SparkCo-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and Spark
sscdotopen
 
Sep logic slide
Sep logic slideSep logic slide
Sep logic slide
rainoftime
 
Bayesian Inference and Uncertainty Quantification for Inverse Problems
Bayesian Inference and Uncertainty Quantification for Inverse ProblemsBayesian Inference and Uncertainty Quantification for Inverse Problems
Bayesian Inference and Uncertainty Quantification for Inverse Problems
Matt Moores
 

What's hot (20)

Scaling out logistic regression with Spark
Scaling out logistic regression with SparkScaling out logistic regression with Spark
Scaling out logistic regression with Spark
 
Stockage, manipulation et analyse de données matricielles avec PostGIS Raster
Stockage, manipulation et analyse de données matricielles avec PostGIS RasterStockage, manipulation et analyse de données matricielles avec PostGIS Raster
Stockage, manipulation et analyse de données matricielles avec PostGIS Raster
 
Gradient descent optimizer
Gradient descent optimizerGradient descent optimizer
Gradient descent optimizer
 
First Place Memocode'14 Design Contest Entry
First Place Memocode'14 Design Contest EntryFirst Place Memocode'14 Design Contest Entry
First Place Memocode'14 Design Contest Entry
 
Stochastic gradient descent and its tuning
Stochastic gradient descent and its tuningStochastic gradient descent and its tuning
Stochastic gradient descent and its tuning
 
Big Data Analysis with Signal Processing on Graphs
Big Data Analysis with Signal Processing on GraphsBig Data Analysis with Signal Processing on Graphs
Big Data Analysis with Signal Processing on Graphs
 
Mahout scala and spark bindings
Mahout scala and spark bindingsMahout scala and spark bindings
Mahout scala and spark bindings
 
Solving The Shortest Path Tour Problem
Solving The Shortest Path Tour ProblemSolving The Shortest Path Tour Problem
Solving The Shortest Path Tour Problem
 
Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation Graphs
 
Sparse Kernel Learning for Image Annotation
Sparse Kernel Learning for Image AnnotationSparse Kernel Learning for Image Annotation
Sparse Kernel Learning for Image Annotation
 
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEEuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
 
single source shorest path
single source shorest pathsingle source shorest path
single source shorest path
 
Problems in parallel computations of tree functions
Problems in parallel computations of tree functionsProblems in parallel computations of tree functions
Problems in parallel computations of tree functions
 
Injecting image priors into Learnable Compressive Subsampling
Injecting image priors into Learnable Compressive SubsamplingInjecting image priors into Learnable Compressive Subsampling
Injecting image priors into Learnable Compressive Subsampling
 
Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15
Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15
Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15
 
On Sampling from Massive Graph Streams
On Sampling from Massive Graph StreamsOn Sampling from Massive Graph Streams
On Sampling from Massive Graph Streams
 
Algorithm Design and Complexity - Course 12
Algorithm Design and Complexity - Course 12Algorithm Design and Complexity - Course 12
Algorithm Design and Complexity - Course 12
 
Co-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and SparkCo-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and Spark
 
Sep logic slide
Sep logic slideSep logic slide
Sep logic slide
 
Bayesian Inference and Uncertainty Quantification for Inverse Problems
Bayesian Inference and Uncertainty Quantification for Inverse ProblemsBayesian Inference and Uncertainty Quantification for Inverse Problems
Bayesian Inference and Uncertainty Quantification for Inverse Problems
 

Similar to ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY PARALLEL GRAPH ALGORITHMS

141205 graphulo ingraphblas
141205 graphulo ingraphblas141205 graphulo ingraphblas
141205 graphulo ingraphblas
graphulo
 
141222 graphulo ingraphblas
141222 graphulo ingraphblas141222 graphulo ingraphblas
141222 graphulo ingraphblas
MIT
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み
NVIDIA Japan
 
Svm map reduce_slides
Svm map reduce_slidesSvm map reduce_slides
Svm map reduce_slides
Sara Asher
 
Scala + WattzOn, sitting in a tree....
Scala + WattzOn, sitting in a tree....Scala + WattzOn, sitting in a tree....
Scala + WattzOn, sitting in a tree....
Raffi Krikorian
 
Twopi.1
Twopi.1Twopi.1
Twopi.1
fhhangtuah
 
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)
Spark Summit
 
CS 354 Transformation, Clipping, and Culling
CS 354 Transformation, Clipping, and CullingCS 354 Transformation, Clipping, and Culling
CS 354 Transformation, Clipping, and Culling
Mark Kilgard
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Flink Forward
 
An Introduction to NV_path_rendering
An Introduction to NV_path_renderingAn Introduction to NV_path_rendering
An Introduction to NV_path_rendering
Mark Kilgard
 
MATLABgraphPlotting.pptx
MATLABgraphPlotting.pptxMATLABgraphPlotting.pptx
MATLABgraphPlotting.pptx
PrabhakarSingh646829
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Spark Summit
 
Ppt 1
Ppt 1Ppt 1
Topological Sort
Topological SortTopological Sort
Topological Sort
Dr Sandeep Kumar Poonia
 
Computation Assignment Help
Computation Assignment Help Computation Assignment Help
Computation Assignment Help
Programming Homework Help
 
MapReduceAlgorithms.ppt
MapReduceAlgorithms.pptMapReduceAlgorithms.ppt
MapReduceAlgorithms.ppt
CheeWeiTan10
 
Introduction to matlab
Introduction to matlabIntroduction to matlab
Introduction to matlab
VidhyaSenthil
 
Scalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceScalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduce
Kyong-Ha Lee
 
Computer graphics
Computer graphicsComputer graphics
Computer graphics
amitsarda3
 
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahonGraph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Christopher Conlan
 

Similar to ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY PARALLEL GRAPH ALGORITHMS (20)

141205 graphulo ingraphblas
141205 graphulo ingraphblas141205 graphulo ingraphblas
141205 graphulo ingraphblas
 
141222 graphulo ingraphblas
141222 graphulo ingraphblas141222 graphulo ingraphblas
141222 graphulo ingraphblas
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み
 
Svm map reduce_slides
Svm map reduce_slidesSvm map reduce_slides
Svm map reduce_slides
 
Scala + WattzOn, sitting in a tree....
Scala + WattzOn, sitting in a tree....Scala + WattzOn, sitting in a tree....
Scala + WattzOn, sitting in a tree....
 
Twopi.1
Twopi.1Twopi.1
Twopi.1
 
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)
 
CS 354 Transformation, Clipping, and Culling
CS 354 Transformation, Clipping, and CullingCS 354 Transformation, Clipping, and Culling
CS 354 Transformation, Clipping, and Culling
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
 
An Introduction to NV_path_rendering
An Introduction to NV_path_renderingAn Introduction to NV_path_rendering
An Introduction to NV_path_rendering
 
MATLABgraphPlotting.pptx
MATLABgraphPlotting.pptxMATLABgraphPlotting.pptx
MATLABgraphPlotting.pptx
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
 
Ppt 1
Ppt 1Ppt 1
Ppt 1
 
Topological Sort
Topological SortTopological Sort
Topological Sort
 
Computation Assignment Help
Computation Assignment Help Computation Assignment Help
Computation Assignment Help
 
MapReduceAlgorithms.ppt
MapReduceAlgorithms.pptMapReduceAlgorithms.ppt
MapReduceAlgorithms.ppt
 
Introduction to matlab
Introduction to matlabIntroduction to matlab
Introduction to matlab
 
Scalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceScalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduce
 
Computer graphics
Computer graphicsComputer graphics
Computer graphics
 
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahonGraph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
 

Recently uploaded

molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
Anagha Prasad
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
HongcNguyn6
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
Vandana Devesh Sharma
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
Gokturk Mehmet Dilci
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
vluwdy49
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
Leonel Morgado
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
University of Hertfordshire
 
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
AbdullaAlAsif1
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
İsa Badur
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
David Osipyan
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
muralinath2
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
Leonel Morgado
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
Sharon Liu
 
Bob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdfBob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdf
Texas Alliance of Groundwater Districts
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
yqqaatn0
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
Daniel Tubbenhauer
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
Abdul Wali Khan University Mardan,kP,Pakistan
 
Thornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdfThornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdf
European Sustainable Phosphorus Platform
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
University of Maribor
 

Recently uploaded (20)

molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
 
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
 
Bob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdfBob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdf
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
 
Thornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdfThornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdf
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
 

ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY PARALLEL GRAPH ALGORITHMS

  • 1. Modeling Orderings in Asynchronous Distributed-Memory Parallel Graph Algorithms Abstract Graph Machine Final Dissertation Defense Thejaka Amila Kanewala November 9, 2018
  • 2. Agenda • Preliminaries • Motivation • Abstract Graph Machine (AGM) • Extended Abstract Graph Machine (EAGM) • Engineering an AGM • Engineering an EAGM • Results • Final Remarks 2Abstract Graph Machine
  • 3. Graphs • Graph is an ordered pair, G = (V, E) where V is a set of vertices and E is a set of edges ( ). 3Abstract Graph Machine Our Focus.
  • 4. Graph Applications (Graph Problems) • Single Source Shortest Paths • Breadth First Search • Connected Components • Maximal Independent Set • Minimum Spanning Tree • Graph Coloring • And many more … 4Abstract Graph Machine Multiple algorithms to solve a graph problem. E.g., - Dijkstra’s SSSP - Bellman-Ford - Etc.
  • 5. Why Performance in Parallel Graph Algorithms is Challenging ? • Low compute/communication ratio (caused by irregularity) • Synchronization overhead • Higher amount of work • Many dependencies - Input graph - The algorithm logic - The underlying runtime - Etc. 5Abstract Graph Machine
  • 6. Regular vs. Irregular Communication only take place at the data boundary Communication take place everywhere. Low communication/computation ratio. High communication/computation ratio. 6Abstract Graph Machine
  • 7. Synchronization Overhead • Many existing parallel graph algorithms are designed focusing shared-memory parallel architectures. • When we ”extend” these algorithms for distributed-memory execution, they end up having many synchronization phases.  Synchronization overhead is significant when processing a large graph on many distributed nodes. 7Abstract Graph Machine
  • 8. Shared-Memory Parallel Graph Algorithms • Shared-memory parallel algorithms can be describe abstractly using the Parallel Random Access Machine (PRAM) model. Independent processors with its own memory (e.g., registers) Shared-memory 8Abstract Graph Machine
  • 9. Shared-Memory Parallel Graph Algorithms . . . . . Consists of multiple phases like these Synchronize Synchronize These algorithms can be naturally extended to distributed-memory using Bulk Synchronous Parallel (BSP). 9Abstract Graph Machine
  • 10. Shared-Memory to Distributed-Memory . . . . . . . . . . Global Barrier Global Barrier Executes many global barriers. Synchronization overhead is significant, especially when executing the program on many nodes. 10Abstract Graph Machine
  • 11. Asynchronous Graph Algorithms Level synchronous BFS, using “S” as the source. 11Abstract Graph Machine
  • 12. Asynchronous Graph Algorithms Level synchronous BFS, using “S” as the source. 12Abstract Graph Machine Synchronize
  • 13. Asynchronous Graph Algorithms Level synchronous BFS, using “S” as the source. Synchronize Once a label is set, no corrections  0 additional work. 13Abstract Graph Machine
  • 14. Asynchronous Graph Algorithms Asynchronous BFS, using “S” as the source. 14Abstract Graph Machine
  • 15. Asynchronous Graph Algorithms Asynchronous BFS, using “S” as the source. Label correction Need to correct these as well. Generates lot of redundant work. 15Abstract Graph Machine
  • 16. Asynchronous Graph Algorithms • Asynchronous graph algorithms are better in the sense they avoid the overhead of synchronization. • BUT, they tend to generate lot of redundant work. - High runtime 16Abstract Graph Machine
  • 17. How much Synchronization? Low diameter Have enough parallel work in each level Fewer barriers (diameter ~ barriers) E.g., Twitter, ER. Level synchronous execution is better. . . . High diameter Not enough parallel work in a level E.g., Road NW. Asynchronous execution is better. We need a way to control the amount of synchronization needed by a graph algorithm  Abstract Graph Machine (AGM) is a model that can control the level of synchronization of a graph algorithm using an ordering. 17Abstract Graph Machine
  • 18. SSSP as an Example • In an undirected weighted graph (G=(V, E)), the Single Source Shortest Path problem is to find a path to every vertex so that the sum of weights in that path is minimized. 18Abstract Graph Machine
  • 19. Algorithms void Chaotic(Graph g, Vertex source) { For each Vertex v in G { state[v] <- INFINITY; } Relax(source, 0); } void Relax(Vertex v, Distance d) { If (d < state[v]) { state[v] <- d; For each edge e of v { Vertex u = target_vertex(e); Relax(u, d+weight[e]); } } } void Dijkstra(Graph g, Vertex source) { For each Vertex v in G { state[v] <- INFINITY; } pq.insert(<source, 0>); While pq is not empty { <v, d> = vertex-distance pair with minimum distance in pq; Relax(v, d); } } void Relax(Vertex v, Distance d) { If (d < state[v]) { state[v] <- d; For each edge e of v { Vertex u = target_vertex(e); pq.insert(<u, d+weight[e]>); } } } void -Stepping(Graph g, Vertex source) { For each Vertex v in G { state[v] <- INFINITY; } insert <source, 0> to appropriate bucket from buckets; While all buckets are not empty { bucket = smallest non-empty bucket in buckets; For each <v, d> in bucket { Relax(v, d); } } } void Relax(Vertex v, Distance d) { If (d < state[v]) { state[v] <- d For each edge e of v { Vertex u = target_vertex(e); insert <u, d+weight[e]> to appropriate bucket in buckets; } } } Relax Relax Relax Process random vertex Put vertex in priority queue Put vertex in ∆ priority queue 19Abstract Graph Machine
  • 20. SSSP Algorithms Processing Work Chaotic ▲-Stepping KLA Dijkstras Vertex, distance pairs generated by the Relax function are fed to itself. Vertex distance pairs generated in by the Relax function are ordered based on ∆. Generated vertex-distance pair is inserted into appropriated, bucket and vertex, distance pairs in the first bucket is processed first. Vertex, distance pairs are ordered based on the level from the source vertex. Buckets are created based on level interval k. Vertex, distance pairs are ordered based on the distance. Smallest distant vertex, distance pairs are processed first. Relax State update Ordering of updates Same Same Different Different 20Abstract Graph Machine
  • 21. Ordering of SSSP Algorithms Chaotic ▲-Stepping KLA Dijkstra Generated vertex, distance pairs are selected randomly and not ordered. Generated vertex, distance pairs are separated into several partitions based on a relation defined on the distance and ∆. Generated vertex, distance pairs are separated into partitions based on the level and k. Generated vertex, distance pairs are separated into partitions based on the distinct distance values. Single partition. Multiple partitions based on ∆. Multiple partitions based on the level. Multiple partitions based on the distance. (v0,d0) (v1,d1) (v2,d2) (v3,d3) (v4,d4)(v5,d5) (v6,d6) (v0,d0)(v1,d1) (v2,d2)(v3,d3) (v4,d4)(v5,d5) (v6,d6) (v0,d0) (v1,d1) (v2,d2) (v3,d3) (v4,d4) (v5,d5) (v6,d6) (v0,d0) (v1,d1) (v2,d2) … Unordered Unordered within ∆ Unordered within k Ordered Ordered between ∆ Ordered between k 21Abstract Graph Machine
  • 22. Distributed Memory Implementations Chaotic ▲-Stepping KLA Dijkstras Barriers are executed at the beginning of the algorithm and at the end of the algorithm. A barrier is executed after processing a ∆ bucket. A barrier is executed after processing k levels. A barrier is executed after processing a vertex- distance pair with different distance. (v0,d0) (v2,d2) (v4,d4) (v6,d6) (v1,d1) (v3,d3) (v5,d5) R0 R1 (v1,d1) (v3,d3) (v5,d5) (v0,d0) (v2,d2) (v4,d4) (v6,d6) R0 R1 (v0,d0) (v2,d2) (v4,d4) (v6,d6) (v1,d1) (v3,d3) (v5,d5) R0 R1 (v0,d0) (v2,d2) … (v1,d1) … R0 R1 • Two ranks (R0 and R1) • Vertices are distributed (e.g., Even vertices in R0 and odd vertices in R1). • Barrier execution : 22Abstract Graph Machine
  • 23. Abstract Graph Machine for SSSP Strict weak ordering relation defined on work units. The strict weak ordering relation creates equivalence classes and induces an ordering on generated equivalence classes. 23Abstract Graph Machine
  • 24. Abstract Graph Machine • Abstract model that represents graph algorithms as a processing function and strict weak ordering • In AGM for SSSP, the (vertex, distance) pair is called a workitem • An AGM consists of:  A definition of a graph (vertices, edges, vertex attributes, edge attributes)  A set of states (of the computation, e.g., distances)  A definition of workitems  A processing function  A strict weak ordering relation  An initial workitem set (to start the algorithm) 24Abstract Graph Machine
  • 25. The Abstract Graph Machine (AGM) Definition of graph Definition of workitems Set of states Processing function Strict weak ordering Initial set of workitems Including vertex and edge attributes SSSP: distance from source SSSP: relax SSSP: edge weight SSSP: starting vertex SSSP: depends on algorithm! 25Abstract Graph Machine
  • 26. SSSP Algorithms in AGM • In general, a SSSP algorithm’s workitem consists of a vertex and a distance. Therefore, the set of workitems for AGM is defined as follows: • is a pair (e.g., w = <v, d>). The “[]” operator retrieves values associated with ”w”. E.g., w[0] returns the vertex associated with workitem and w[1] returns the distances. • The state of the algorithms (shortest distance calculated at a point in time) is stored in a state. We call this state “distance”. 26Abstract Graph Machine
  • 27. The Processing Function for SSSP Input workitem. The workitem is a pair and w[0] returns the vertex associated to the workitem and w[1] returns the distance associated. 1. Check this condition. If input workitem’s distance is better than stored distance go to 2. 2. Updated distance and go to 3. 3. Generate new workitems. relax() in AGM notation 27Abstract Graph Machine
  • 28. State Update & Work Generation 28Abstract Graph Machine State update logic. Work generation logic.
  • 29. Strict weak ordering relations (SSSP) • Chaotic SSSP strict weak ordering relation ( ) • Dijkstra’s SSSP strict weak ordering relation ( ) • ∆-Stepping SSSP strict weak ordering relation ( ) • Similar ordering relation can be derived for KLA. 29Abstract Graph Machine
  • 30. Algorithm Family • All algorithms share the same processing function but with different orderings. • The collection of algorithms that share the same processing function is called an algorithm family. 30Abstract Graph Machine
  • 31. Distributed ∆–Stepping Algorithm Partitions created by global ordering, based on relation the . First bucket created by delta-stepping algorithm Work within a bucket is not ordered Partitions are ordered ∆–Stepping Algorithm on two ranks (R0, R1) Huge opportunity for acceleration 31Abstract Graph Machine
  • 32. Extended AGM (EAGM): ∆–Stepping + thread level ordering For a specific algorithm, work within a bucket is not ordered AGM creates buckets with global ordering, based on relation . And we can do these in parallel Each rank can process its bucket with a better ordering! But why be unordered? 32Abstract Graph Machine E.g., .
  • 33. Extended AGM (EAGM): Hybrid Hierarchical Ordering As we distribute/parallelize, we create different spatial domains for processing Extended AGM: Use different orderings in each domain 33 E.g., Abstract Graph Machine
  • 34. Extended AGM in General Abstract Graph Machine 34 AGM takes a processing function and a strict weak ordering relation EAGM takes a processing function and a hierarchy of strict weak ordering relations attached to each spatial domain. Global Ordering Node Ordering NUMA Ordering Thread Ordering
  • 35. Extended AGM (EAGM): E.g., ∆–Stepping with NUMA level ordering. • Work within each bucket is further ordered using NUMA local priority queues. 35Abstract Graph Machine ordering for B1 bucket and NUMA domain 1 in Rank 0
  • 36. Extended AGM (EAGM) - ∆–Stepping with node level ordering. • Work within each bucket is further ordered using node local priority queues. 36Abstract Graph Machine ordering for B1 bucket in Rank 0
  • 37. Engineering an AGM Abstract Graph Machine 37
  • 38. template<typename buckets> void PF(const WorkItem& wi, int tid, buckets& outset) { Vertex v = std::get<0>(wi); Distance d = std::get<1>(wi); Distance old_dist = vdistance[v], last_old_dist; while (d < old_dist) { Distance din = CAS (d, &vdistance[v]); if (din == old_dist) { FORALL_OUTEDGES_T(v, e, g, Graph) { Vertex u = boost::target(e, g); Weight we = boost::get(vweight, e); WorkItem wigen(u, (d+we)); outset.push(wigen, tid); } break; } else { old_dist = din; } } } AGM Graph Processing Framework Abstract Graph Machine 38 //================== Dijkstra Ordering ===================== template<int index> struct dijkstra : public base_ordering { public: static const eagm_ordering ORDERING_NAME = eagm_ordering::enum_dijkstra; template <typename T> bool operator()(T i, T j) { return (std::get<index>(i) < std::get<index>(j)); } eagm_ordering name() { return ORDERING_NAME; } }; The strict weak ordering, partitions WorkItems. In this case the ordering is based on the distance. The processing function for SSSP. For SSSP a ”WorkItem” is a vertex and distance. CAS = Atomic compare & swap After executing wi, processing function generates new work items and they are pushed for ordering.
  • 39. Invoking the Framework Abstract Graph Machine 39 // SSSP (AGM) algorithm typedef agm<Graph, WorkItem, ProcessingFunction, StrictWeakOrdering, RuntimeModelGen> sssp_agm_t; sssp_agm_t ssspalgo(rtmodelgen, ordering, pf, initial); Creating the AGM for SSSP. The WorkItem type. The processing function (preorder) type. The strict weak ordering relation type. The underlying runtime. (AGM framework abstracts out the runtime). The underlying runtime. The specific ordering function (e.g., Delta Ordering). The processing function instance. The initial work item set (e.g., source vertex).
  • 40. Mapping AGM to an Implementation Abstract Graph Machine 40 Q2. How should we arrange processing function and ordering ? Q1. How should we do ordering ? Abstract Actual ImplementationCommunication method
  • 41. Ordering Implementation • Every compute node keeps a set of equivalence classes locally in a data structure. • Equivalence classes are ordered according to the strict weak ordering (swo) • Every equivalence class has:  A representative work-item  An append buffer that holds work items for the equivalence class. Abstract Graph Machine 41 Representation Shared-memory parallel thread Holds all the work items that are not comparable to the representative. Strict weak ordering is transitive. Will discuss later. An equivalence class.
  • 42. Ordering Implementation: Inserts to an Equivalence class • Finds the equivalence class by comparing with representative work items. • If there is no equivalence class, create a one with the inserting work item as the representative work item.  - Deal with concurrency Abstract Graph Machine 42 Inserting element is not comparable with the representative element.
  • 43. Ordering Implementation: Processing an Equivalence Class • Find the equivalence class with minimum representation workitem.  - Every rank sends the representative work item of the first equivalence class in the list. (Uses all reduce to communicate minimum work items)  - Selects the smallest equivalence class Abstract Graph Machine 43 Equivalence classes are not distributed uniformly If a particular node does not have an equivalence class relevant to the global minimum representative, then insert one.
  • 44. Ordering Implementation: Termination of an Equivalence Class • Termination of a single equivalence class  When processing an equivalence class more work can be added to the same equivalence class.  Every node keeps track of number of work items pushed into the equivalence class and number of work items processed in the equivalence class.  The difference between those two numbers are globally reduced (sum) and when the sum is zero framework decides that equivalence class is processed. Abstract Graph Machine 44 Counts are reduced after making sure all messages are exchanged (i.e. after an epoch).
  • 45. AGM Implementation: Data Structure • A data structure is needed to hold the equivalence classes  Later this data structure will be further extended to use at EAGM levels. • Requirements of the Data Structure  Should order nodes by representative work items  Faster lookups are important  Ability to concurrently insert/delete and lookup/search data Abstract Graph Machine 45 Sounds like a Dictionary ADT.
  • 46. Data Structure • Binary Search Trees (BST), with Locks • Linked List, with Locks and Atomics • Concurrent SkipList • Partitioning Scheme Abstract Graph Machine 46 E.g, RBTrees. Often balances the tree. Therefore, need to lock the whole data structure for insert/delete Not much difference between locked version and lock free version. No need to lock the whole data structure (can lock only the modifying node), but linear lookup time. An alternative probabilistic data structure that gives the same complexity guarantees as BST, but avoids the need for rebalancing, but the contention is much higher. Explain in the next slide.
  • 47. Date Structure: Partitioning Scheme Abstract Graph Machine 47 Every node maintains a set of vectors -- one per each thread Every thread maintains the min work item seen so far. Push back - O(1) (assuming no resize) Find the global min representative Partition by the global min with strict weak ordering relation and move partitioned work to a new vector. (O(n)) Thread buckets is a data structure that has a vector per each thread, represents the next processing equivalence class ordering for B1 bucket in Rank 0
  • 48. Data Structure: Partitioning Scheme • Processing Thread Buckets • Avoid load imbalance issues. • But cannot insert work items that are coming for the same buckets (cannot afford vector resize -- if we need this we need to use locks) • Work items pushed for the same bucket are directly sent. Abstract Graph Machine 48 T0 T1 T2 T0 T1 T2 T0 …
  • 49. Data Structure: Summary Abstract Graph Machine 49 Data Structure Execution Time (sec.) Linked List (Atomics) ~61 Binary Search Trees with Locks ~82 Concurrent SkipList ~89 Partitioning Scheme ~44 • Pre-order processing function • SSSP-Delta ordering • Two nodes, 16 shared-memory parallel threads, Graph500 Scale 24.
  • 50. Mapping AGM to an Implementation Abstract Graph Machine 50 Q2. How should we arrange processing function and ordering ? Q1. How should we do ordering ? Abstract Actual ImplementationCommunication method
  • 51. Communication Paradigms Abstract Graph Machine 51 R0 R1 R2 R3 Push Push Push Push Push Push R0 R1 R2 R3 Pull Pull Pull Pull Pull Pull R0 R1 R2 R3 Gather Gather Gather Scatter Scatter Scatter Apply Push Communication Pull Communication GAS Communication R1 pushes workitems to R0 R0 pulls workitems to R1
  • 52. Placing Processing Function: Pre-Order & Post-Order Abstract Graph Machine 52 After receiving the workitem processing function is executed and generated workitems are inserted to the data structure After receiving the workitem insert to the data structure for ordering Pre-Order Post-Order
  • 53. Placing Processing Function: Split-Order • The processing function contains logic for updating state/s and to generate new work.  - We split the processing function into two functions: a state update function ( and) Abstract Graph Machine 53 workitem Insert the workitem for ordering if state is updated Generate new work only if state is not changed Allows us to prune work further
  • 54. Placing Processing Function: Summary Abstract Graph Machine 54 Run Time Total Inserts Additional WorkItems Processed Split-Order 14.54 17139284 21334 Pre-order 62.55 409388039 2469025 Post-order 44.09 398344581 2284918 Best timing. Total work items pushed into the data structure is much less compared other two implementations. Additional relaxes are much less compared to other two implementations, because of pruning. Post-Order is slightly better than pre-order because of the reduced contention.
  • 55. Engineering an EAGM Abstract Graph Machine 56
  • 56. EAGM Spatial Levels Abstract Graph Machine 57 T0 T1 T2 T3 T4 T5 N0 Global Spatial Level Node Spatial Level NUMA Spatial Level Thread Spatial Level numa0 numa1 T0 T1 T2 T3 T4 T5 N1 numa0 numa1 T0 T1 T2 T3 T4 T5 N2 numa0 numa1
  • 57. EAGM Execution Abstract Graph Machine 58 // SSSP (AGM) algorithm typedef agm<Graph, WorkItem, ProcessingFunction, StrictWeakOrdering, RuntimeModelGen> sssp_agm_t; sssp_agm_t ssspalgo(rtmodelgen, ordering, pf, initial); // SSSP (EAGM) algorithm typedef eagm<Graph, WorkItem, ProcessingFunction, EAGMConfig, RuntimeModelGen> sssp_eagm_t; sssp_eagm_t ssspalgo(rtmodelgen, config, pf, initial); AGM takes a strict weak ordering relation. EAGM takes an hierarchy of orderings The hierarchy correspond to the memory hierarchy. Globally Delta (=5) ordering Node level chaotic Ordering NUMA level chaotic Ordering Thread level chaotic Ordering EAGMConfig
  • 58. EAGMConfig Abstract Graph Machine 59 // strict weak orderings //================== Chaotic ===================== struct chaotic : public base_ordering { public: static const eagm_ordering ORDERING_NAME = eagm_ordering::enum_chaotic; template <typename T> bool operator()(T i, T j) { return false; } eagm_ordering name() { return ORDERING_NAME; } }; Any two workitems are not comparable to each other. CHAOTIC_ORDERING_T ch; DELTA_ORDERING_T delta(agm_params.delta); auto config = boost::graph::agm::create_eagm_config(delta, ch, ch, ch); Globally delta ordering Node level chaotic ordering NUMA level chaotic orderingThread level chaotic ordering
  • 59. EAGM Execution: Abstract Graph Machine 60 T0 T1 T2 T3 T4 T5 N0 numa0 numa1 T0 T1 T2 T3 T4 T5 N1 numa0 numa1 T0 T1 T2 T3 T4 T5 N2 numa0 numa1 Time An equivalence class Barrier synchronization Execution same as AGM Execution is globally synchronous Execution is asynchronous in process, NUMA and thread levels
  • 60. EAGM Execution: Abstract Graph Machine 61 Execution is globally asynchronous  globally single equivalence class. Execution is node level synchronous, Equivalence class at process level. Execution is asynchronous in NUMA and thread levels Local thread barriers
  • 61. EAGM Execution: Abstract Graph Machine 62 T0 T1 T2 T3 T4 T5 N0 numa0 numa1 T0 T1 T2 T3 T4 T5 N1 numa0 numa1 T0 T1 T2 T3 T4 T5 N2 numa0 numa1 Time Global Equivalence Class Global Barrier Node Equivalence Class NUMA Equivalence Class Thread Equivalence Class NUMA Barrier Thread Barrier Spatial Ordering TemporalOrdering • If we have “non- chaotic” orderings defined for every memory level. • E.g., Spatial ordering Temporal ordering
  • 62. EAGM Implementation • The heart of the EAGM implementation is the nested data structure that holds equivalence classes. Abstract Graph Machine 63 Global level classes Node level classes NUMA level classes
  • 63. . . . . . . . . . . . . . . . . . . Time EAGM Implementation: Data Structure Abstract Graph Machine 64 Order by Delta=10 Order by K=2 Order by Delta=5 E.g.,
  • 64. EAGM Implementation: Static Optimizations Abstract Graph Machine 65 Chaotic ordering always create a single equivalence class. Therefore, we exclude levels for chaotic orderings. Optimized data structure. Optimizations are performed statically at compile time with the help of template meta- programming and template specialization. Remove nested levels for chaotic orderings.
  • 65. EAGM Implementation: Static Optimizations Abstract Graph Machine 66 Remove nested levels for chaotic orderings.This is really an AGM.
  • 66. EAGM Implementation: Static Optimizations Abstract Graph Machine 67
  • 68. Single Source Shortest Paths: Pre/Post/Split Orderings Abstract Graph Machine 69 Weak scaling results carried out on 2 Broad- well 16-core Intel Xeon processors. 230 vertices & 234 edges. Shared-memory execution. Split-order is ~5 times faster than pre-order & post- order.
  • 69. Single Source Shortest Paths: Weak Scaling Abstract Graph Machine 70 Graph500 graph. Graph500 proposed SSSP graph. Erdos-Renyi graph. Framework algorithms show better scaling compared to PowerGraph. Thread level Dijkstra ordering shows better scaling behavior for Power-Law graphs. Globally synchronous delta- stepping has better scaling for ER graphs.
  • 70. Single Source Shortest Paths: Strong Scaling (BRII+) Abstract Graph Machine 71 Relative Speedup, = Time for fastest sequential algorithm / Parallel execution time on “n” PEs Synchronization overhead become significant when executing on higher number of nodes Just ordering by level does not reduce the work, but if we apply Dijkstra ordering at thread level we see better performance.
  • 71. SSSP More Results: Weak Scaling (BR II) Abstract Graph Machine 72 Both Power-Graph & PBGL-1 do not scale well at higher scale
  • 72. Breadth First Search: Weak Scaling Abstract Graph Machine 73 K(2) global ordering and level global ordering reduce more redundant work compared thread level ordering.
  • 73. BFS: In-node Performance (Weak Scaling) Abstract Graph Machine 74 Global level synchronous Node level synchronous Within a node both configurations execute in the same way. Therefore, the performance of both configurations are similar.
  • 74. BFS: Strong Scaling Abstract Graph Machine 75 Globally level synchronous versions show better speed-up than thread level synchronous versions.
  • 75. BFS: Road Networks Abstract Graph Machine 76 Strong scaling with Road networks. Road networks has a high diameter (~850-900). Therefore, global level synchronous version show poor scaling because of the synchronization overhead. Global level synchronous Globally asynchronous and thread level synchronous
  • 76. Other Graph Applications • Connected Components • Maximal Independent Set • Triangle Counting Abstract Graph Machine 77
  • 77. Conclusions • Using the AGM abstraction, we showed that, we can generate families of algorithms by keeping the processing function intact and by changing ordering. • With the EAGM we can achieve spatial and temporal orderings. • We discuss the challenges in mapping AGM framework to an implementations and described how we could avoid such challenges. • We came up with an efficient implementation of AGM and EAGM frameworks.  Weak and strong scaling results show that AGM/EAGM algorithms outperform graph algorithms in other frameworks. Abstract Graph Machine 78
  • 78. Thank You ! Abstract Graph Machine 79

Editor's Notes

  1. Work within each bucket is further ordered using thread local priority queues (threadq).
  2. Interesting duality. Expressing the algorithm with AGM lets us pick an ordering which exposes parallelizability, but once parallelized, we can order in different ways. Usually what makes sense is to start as loosely as possible and then add more ordering Work within each bucket is further ordered using thread local priority queues (threadq).