SlideShare a Scribd company logo
Galois: A System for Parallel
Execution of Irregular Algorithms
Donald Nguyen
The University of Texas at Austin
Parallel Programming is Changing
• Data
– Structured (vector, matrix) versus unstructured (graphs)
• Platforms
– Dedicated clusters versus cloud, mobile
• People
– Small number of highly trained scientists versus large number of self-
trained parallel programmers
Old World New World
2
The Search for
Good Programming Models
• Tension between
productivity and
performance
– Support large number of
application programmers with
small number of expert
parallel programmers
– Performance comparable to
hand-written codes
• Galois project
3
Joe
Stephanie
The Search for
Good Programming Models
• Tension between
productivity and
performance
– Support large number of
application programmers with
small number of expert
parallel programmers
– Performance comparable to
hand-written codes
• Galois project
4
Joe
Stephanie
Outline
• Parallel Program = Operator + Schedule +
e
• Galois system implementation
– Operator, data structures, scheduling
• Galois studies
– Intel study
– Deterministic scheduling
– Adding a scheduler: sparse tiling
– Comparison with transactional memory
– Implementing other programming models
5
Parallel Data Structure
SSSP
• Find the shortest distance from source node to all
other nodes in a graph
– Label nodes with tentative distance
– Assume non-negative edge weights
– BFS is a special case
• Algorithms
– Chaoticrelaxation O(2V)
– Bellman-Ford O(VE)
– Dijkstra’s algorithm O(E log V)
• Uses priority queue
– Δ-stepping
• Uses sequence of bags to prioritize work
• Δ=1: Dijkstra
• Δ=∞: Chaotic relaxation
• Different algorithms are different schedules for
applying relaxations
– Improved work efficiency
– Input sensitivity
– Locality
5
3
1
9
∞∞
∞
0
1 9
3
Operator
Relaxation
2
6
SSSP
• Find the shortest distance from source node to all
other nodes in a graph
– Label nodes with tentative distance
– Assume non-negative edge weights
– BFS is a special case
• Algorithms
– Chaoticrelaxation O(2V)
– Bellman-Ford O(VE)
– Dijkstra’s algorithm O(E log V)
• Uses priority queue
– Δ-stepping
• Uses sequence of bags to prioritize work
• Δ=1: Dijkstra
• Δ=∞: Chaotic relaxation
• Different algorithms are different schedules for
applying relaxations
– Improved work efficiency
– Input sensitivity
– Locality
5
3
1
9
∞∞
∞
0
Active node
1 9
3
Operator
Relaxation
2
7
SSSP
• Find the shortest distance from source node to all
other nodes in a graph
– Label nodes with tentative distance
– Assume non-negative edge weights
– BFS is a special case
• Algorithms
– Chaoticrelaxation O(2V)
– Bellman-Ford O(VE)
– Dijkstra’s algorithm O(E log V)
• Uses priority queue
– Δ-stepping
• Uses sequence of bags to prioritize work
• Δ=1: Dijkstra
• Δ=∞: Chaotic relaxation
• Different algorithms are different schedules for
applying relaxations
– Improved work efficiency
– Input sensitivity
– Locality
5
3
1
9
∞∞
∞
0
Active node
1 9 4
3
1
3
Operator
Relaxation
2
8
SSSP
• Find the shortest distance from source node to all
other nodes in a graph
– Label nodes with tentative distance
– Assume non-negative edge weights
– BFS is a special case
• Algorithms
– Chaoticrelaxation O(2V)
– Bellman-Ford O(VE)
– Dijkstra’s algorithm O(E log V)
• Uses priority queue
– Δ-stepping
• Uses sequence of bags to prioritize work
• Δ=1: Dijkstra
• Δ=∞: Chaotic relaxation
• Different algorithms are different schedules for
applying relaxations
– Improved work efficiency
– Input sensitivity
– Locality
5
3
1
9
∞∞
∞
0
Active node
Neighborhood
Activity
1 9 4
3
1
3
Operator
Relaxation
2
9
SSSP
• Find the shortest distance from source node to all
other nodes in a graph
– Label nodes with tentative distance
– Assume non-negative edge weights
– BFS is a special case
• Algorithms
– Chaoticrelaxation O(2V)
– Bellman-Ford O(VE)
– Dijkstra’s algorithm O(E log V)
• Uses priority queue
– Δ-stepping
• Uses sequence of bags to prioritize work
• Δ=1: Dijkstra
• Δ=∞: Chaotic relaxation
• Different algorithms are different schedules for
applying relaxations
– Improved work efficiency
– Input sensitivity
– Locality
5
3
1
9
∞
0
Active node
1 9 4
3
1
3
Operator
Relaxation
2
1 5
10
SSSP
• Find the shortest distance from source node to all
other nodes in a graph
– Label nodes with tentative distance
– Assume non-negative edge weights
– BFS is a special case
• Algorithms
– Chaoticrelaxation O(2V)
– Bellman-Ford O(VE)
– Dijkstra’s algorithm O(E log V)
• Uses priority queue
– Δ-stepping
• Uses sequence of bags to prioritize work
• Δ=1: Dijkstra
• Δ=∞: Chaotic relaxation
• Different algorithms are different schedules for
applying relaxations
– Improved work efficiency
– Input sensitivity
– Locality
5
3
1
9
∞
0
Active node
1 9 4
3
1
3
Operator
Relaxation
2
1 5
11
SSSP
• Find the shortest distance from source node to all
other nodes in a graph
– Label nodes with tentative distance
– Assume non-negative edge weights
– BFS is a special case
• Algorithms
– Chaoticrelaxation O(2V)
– Bellman-Ford O(VE)
– Dijkstra’s algorithm O(E log V)
• Uses priority queue
– Δ-stepping
• Uses sequence of bags to prioritize work
• Δ=1: Dijkstra
• Δ=∞: Chaotic relaxation
• Different algorithms are different schedules for
applying relaxations
– Improved work efficiency
– Input sensitivity
– Locality
5
3
1
9
∞
0
Active node
1 9 4
3
1
3
Operator
Relaxation
2
Algorithm = Operator + Schedule
1 5
12
Stencil Computation
• Finite difference
computation
– Active nodes: nodes in At+1
– Operator: 5 point stencil
– Different schedules have
different locality
• Regular application
– Grid structure and active
nodes known statically
– Can be parallelized by a
compiler
13
At At+1
• Operator formulation
– Active elements: nodes or edges
where there is work to be done
– Operator: computation at active
element
• Activity: application of operator to active
element
• Neighborhood: graph elements read or
written by activity
– Ordering: order in which active
elements must appear to have been
processed
• Unordered algorithms: any order is fine
(e.g., chaotic relaxation) but may have soft
priorities
• Ordered algorithms: algorithm-specific
order (e.g., discrete event simulation)
– Parallelism: process activities in
parallel subject to neighborhood and
ordering constraints
• Data parallelism is a special case (all
nodes active, no conflicts)
: active node
: neighborhood
Abstraction of Algorithms
14
• Operator formulation
– Active elements: nodes or edges
where there is work to be done
– Operator: computation at active
element
• Activity: application of operator to active
element
• Neighborhood: graph elements read or
written by activity
– Ordering: order in which active
elements must appear to have been
processed
• Unordered algorithms: any order is fine
(e.g., chaotic relaxation) but may have soft
priorities
• Ordered algorithms: algorithm-specific
order (e.g., discrete event simulation)
– Parallelism: process activities in
parallel subject to neighborhood and
ordering constraints
• Data parallelism is a special case (all
nodes active, no conflicts)
: active node
: neighborhood
Abstraction of Algorithms
15
• Operator formulation
– Active elements: nodes or edges
where there is work to be done
– Operator: computation at active
element
• Activity: application of operator to active
element
• Neighborhood: graph elements read or
written by activity
– Ordering: order in which active
elements must appear to have been
processed
• Unordered algorithms: any order is fine
(e.g., chaotic relaxation) but may have soft
priorities
• Ordered algorithms: algorithm-specific
order (e.g., discrete event simulation)
– Parallelism: process activities in
parallel subject to neighborhood and
ordering constraints
• Data parallelism is a special case (all
nodes active, no conflicts)
: active node
: neighborhood
Abstraction of Algorithms
16
• Operator formulation
– Active elements: nodes or edges
where there is work to be done
– Operator: computation at active
element
• Activity: application of operator to active
element
• Neighborhood: graph elements read or
written by activity
– Ordering: order in which active
elements must appear to have been
processed
• Unordered algorithms: any order is fine
(e.g., chaotic relaxation) but may have soft
priorities
• Ordered algorithms: algorithm-specific
order (e.g., discrete event simulation)
– Parallelism: process activities in
parallel subject to neighborhood and
ordering constraints
• Data parallelism is a special case (all
nodes active, no conflicts)
: active node
: neighborhood
Abstraction of Algorithms
17
• Operator formulation
– Active elements: nodes or edges
where there is work to be done
– Operator: computation at active
element
• Activity: application of operator to active
element
• Neighborhood: graph elements read or
written by activity
– Ordering: order in which active
elements must appear to have been
processed
• Unordered algorithms: any order is fine
(e.g., chaotic relaxation) but may have soft
priorities
• Ordered algorithms: algorithm-specific
order (e.g., discrete event simulation)
– Parallelism: process activities in
parallel subject to neighborhood and
ordering constraints
• Data parallelism is a special case (all
nodes active, no conflicts)
: active node
: neighborhood
Abstraction of Algorithms
18
• Operator formulation
– Active elements: nodes or edges
where there is work to be done
– Operator: computation at active
element
• Activity: application of operator to active
element
• Neighborhood: graph elements read or
written by activity
– Ordering: order in which active
elements must appear to have been
processed
• Unordered algorithms: any order is fine
(e.g., chaotic relaxation) but may have soft
priorities
• Ordered algorithms: algorithm-specific
order (e.g., discrete event simulation)
– Parallelism: process activities in
parallel subject to neighborhood and
ordering constraints
• Data parallelism is a special case (all
nodes active, no conflicts)
: active node
: neighborhood
Abstraction of Algorithms
19
B
A
C D
• Operator formulation
– Active elements: nodes or edges
where there is work to be done
– Operator: computation at active
element
• Activity: application of operator to active
element
• Neighborhood: graph elements read or
written by activity
– Ordering: order in which active
elements must appear to have been
processed
• Unordered algorithms: any order is fine
(e.g., chaotic relaxation) but may have soft
priorities
• Ordered algorithms: algorithm-specific
order (e.g., discrete event simulation)
– Parallelism: process activities in
parallel subject to neighborhood and
ordering constraints
• Data parallelism is a special case (all
nodes active, no conflicts)
: active node
: neighborhood
Abstraction of Algorithms
20
B
A
C D
TAO of Parallelism
Algorithms
Topology
Active
Nodes
Operator
Structured
Semi-structured
Unstructured
Location
Ordering
Morph
Local computation
Reader
Topology-driven
Data-driven
Unordered
Ordered
21
Pingali, Nguyen, et al. The tao of parallelism in algorithms. PLDI 2011.
TAO of Parallelism
22
Pingali, Nguyen, et al. The tao of parallelism in algorithms. PLDI 2011.
Galois
Static
Inspector-Executor
Bulk-Synchronous
Optimistic
Round-based
Galois Project
Stephanie: Parallel Data Structures
Joe: Operator + Schedule
Program = Algorithm + Data Structure
Galois Project
Parallel Program = Operator + Schedule + Parallel Data Structure
Stephanie: Parallel Data Structures
Joe: Operator + Schedule
Program = Algorithm + Data Structure
My Contribution
Design and implementation of a programming
model for high-performance parallelism by
factoring programs into
operator + schedule + data structure
25
Themes
• Joe
– Program = Operator + Schedule + Data Structure
• Stephanie
– Scalability principle1: disjoint access
• Accesses to distinct logical entities should be disjoint physically as well
• “Do no harm”
– Scalability principle2: virtualizetasks
• Execution of tasks should not be eagerly tied to particular physical resources
• “Be flexible”
• Challenges
– Easy to introduce inadvertentsharing if not looking at entire system
stack
– Small task sizes (~ 100s of instructions)
26
Publications
• M. Amber Hassaan, Donald Nguyen, Keshav Pingali. Kinetic Dependence Graphs (to appear). ASPLOS 2015.
• Konstantions Karantasis, Andrew Lenharth, Donald Nguyen, et al. Parallelization of reordering algorithms for
bandwidth and wavefront reduction. SC 2014.
• Damian Goik, Konrad Jopek, Maciej Paszyński, Andrew Lenharth, Donald Nguyen, and Keshav Pingali. Graph
grammar based multi-thread multi-frontal direct solver with galois scheduler. ICCS 2014.
• Donald Nguyen, Andrew Lenharth, Keshav Pingali. Deterministic Galois: On-demand, parameterless and
portable. ASPLOS 2014.
• Donald Nguyen, Andrew Lenharth, Keshav Pingali. A lightweight infrastructure for graph analytics. SOSP 2013.
• Keshav Pingali, Donald Nguyen, Milind Kulkarni, et al. The tao of parallelism in algorithms. PLDI 2011.
• Donald Nguyen, Keshav Pingali. Synthesizing concurrent schedulers for irregular algorithms. ASPLOS 2011.
• Milind Kulkarni, Donald Nguyen, Dimitrios Prountzos, et al. Exploiting the commutativity lattice. PLDI 2011.
• Xin Sui, Donald Nguyen, Martin Burtscher, Keshav Pingali. Parallel Graph Partitioning on Multicore
Architectures. LCPC 2010.
• Mario Méndez-Lojo, Donald Nguyen, Dimitrios Prountzos, et al. Structure-driven optimizations for amorphous
data-parallel programs. PPOPP 2010.
• Shih-wei Liao, Tzu-Han Hung, Donald Nguyen, et al. Machine learning-based prefetch optimization for data
center applications. SC 2009.
27
Publications
• M. Amber Hassaan, Donald Nguyen, Keshav Pingali. Kinetic Dependence Graphs (to appear). ASPLOS 2015.
• Konstantions Karantasis, Andrew Lenharth, Donald Nguyen, et al. Parallelization of reordering algorithms for
bandwidth and wavefront reduction. SC 2014.
• Damian Goik, Konrad Jopek, Maciej Paszyński, Andrew Lenharth, Donald Nguyen, and Keshav Pingali. Graph
grammar based multi-thread multi-frontal direct solver with galois scheduler. ICCS 2014.
• Donald Nguyen, Andrew Lenharth, Keshav Pingali. Deterministic Galois: On-demand, parameterless and
portable. ASPLOS 2014.
• Donald Nguyen, Andrew Lenharth, Keshav Pingali. A lightweight infrastructure for graph analytics. SOSP 2013.
• Keshav Pingali, Donald Nguyen, Milind Kulkarni, et al. The tao of parallelism in algorithms. PLDI 2011.
• Donald Nguyen, Keshav Pingali. Synthesizing concurrent schedulers for irregular algorithms. ASPLOS 2011.
• Milind Kulkarni, Donald Nguyen, Dimitrios Prountzos, et al. Exploiting the commutativity lattice. PLDI 2011.
• Xin Sui, Donald Nguyen, Martin Burtscher, Keshav Pingali. Parallel Graph Partitioning on Multicore
Architectures. LCPC 2010.
• Mario Méndez-Lojo, Donald Nguyen, Dimitrios Prountzos, et al. Structure-driven optimizations for amorphous
data-parallel programs. PPOPP 2010.
• Shih-wei Liao, Tzu-Han Hung, Donald Nguyen, et al. Machine learning-based prefetch optimization for data
center applications. SC 2009.
28
Outline
• Parallel Program = Operator + Schedule +
e
• Galois system implementation
– Operator, data structures, scheduling
• Galois studies
– Intel study
– Deterministic scheduling
– Adding a scheduler: sparse tiling
– Comparison with transactional memory
– Implementing other programming models
29
Parallel Data Structure
Galois Programming Model
• Galois system (Stephanie)
– Runtimeand libraryof concurrent data structures
• Galois system (Joe)
– Set iteratorsexpress parallelism
– Operator: body of iterator
– ADT (e.g., graph, set)
• Unordered execution
– Could haveconflicts
– Some serialization of iterations
– New iterations can be added
• Ordered execution
– See Hassaan, Nguyen and Pingali. Kinetic Dependence Graphs (to appear).
ASPLOS 2015.
Graph g
for_each (Edge e : wl)
Node n = g.getEdgeDst(e)
…
Hello Graph
31
#include “Galois/Galois.h”
#include “Galois/Graph/Graph.h”
typedef Galois::Graph::LC_CSR_Graph<int, int> Graph;
typedef Graph::GraphNode GNode;
Graph graph;
struct P {
void operator()(GNode& n, Galois::UserContext<GNode>& c) {
int& d = graph.getData(n);
if (d.value++ < 5)
c.push(n);
}
};
int main(int argc, char** argv) {
Galois::Graph::readGraph(graph, argv[1]);
…
Galois::for_each(initial, P(), Galois::wl<WL>());
return 0;
}
Hello Graph
32
#include “Galois/Galois.h”
#include “Galois/Graph/Graph.h”
typedef Galois::Graph::LC_CSR_Graph<int, int> Graph;
typedef Graph::GraphNode GNode;
Graph graph;
struct P {
void operator()(GNode& n, Galois::UserContext<GNode>& c) {
int& d = graph.getData(n);
if (d.value++ < 5)
c.push(n);
}
};
int main(int argc, char** argv) {
Galois::Graph::readGraph(graph, argv[1]);
…
Galois::for_each(initial, P(), Galois::wl<WL>());
return 0;
}
Data structure
declarations
Galois Iterator
Operator
Hello Graph
33
#include “Galois/Galois.h”
#include “Galois/Graph/Graph.h”
typedef Galois::Graph::LC_CSR_Graph<int, int> Graph;
typedef Graph::GraphNode GNode;
Graph graph;
struct P {
void operator()(GNode& n, Galois::UserContext<GNode>& c) {
int& d = graph.getData(n);
if (d.value++ < 5)
c.push(n);
}
};
int main(int argc, char** argv) {
Galois::Graph::readGraph(graph, argv[1]);
…
Galois::for_each(initial, P(), Galois::wl<WL>());
return 0;
}
Data structure
declarations
Galois Iterator
Operator
Sequential program semantics
Hello Graph
34
#include “Galois/Galois.h”
#include “Galois/Graph/Graph.h”
typedef Galois::Graph::LC_CSR_Graph<int, int> Graph;
typedef Graph::GraphNode GNode;
Graph graph;
struct P {
void operator()(GNode& n, Galois::UserContext<GNode>& c) {
int& d = graph.getData(n);
if (d.value++ < 5)
c.push(n);
}
};
int main(int argc, char** argv) {
Galois::Graph::readGraph(graph, argv[1]);
…
Galois::for_each(initial, P(), Galois::wl<WL>());
return 0;
}
Data structure
declarations
Galois Iterator
Operator
Neighborhood defined
by graph API calls
Sequential program semantics
Hello Graph
35
#include “Galois/Galois.h”
#include “Galois/Graph/Graph.h”
typedef Galois::Graph::LC_CSR_Graph<int, int> Graph;
typedef Graph::GraphNode GNode;
Graph graph;
struct P {
void operator()(GNode& n, Galois::UserContext<GNode>& c) {
int& d = graph.getData(n);
if (d.value++ < 5)
c.push(n);
}
};
int main(int argc, char** argv) {
Galois::Graph::readGraph(graph, argv[1]);
…
Galois::for_each(initial, P(), Galois::wl<WL>());
return 0;
}
struct TaskIndexer {
int operator()(GNode& n) {
return graph.getData(n) >> shift;
}
};
using namespace Galois::WorkList;
typedef dChunkedFIFO<> WL;
typedef OrderedByIntegerMetric<TaskIndexer> WL;
Neighborhood defined
by graph API calls
Sequential program semantics
Scheduling DSL
Galois Operator
36
B
A
Graph g
for_each (Node n : wl)
for (Node m : g.edges(n))
…
A B
Galois Operator
37
B
A
T1
Graph g
for_each (Node n : wl)
for (Node m : g.edges(n))
…
B
Galois Operator
38
B
A
T1
T2
Graph g
for_each (Node n : wl)
for (Node m : g.edges(n))
…
Galois Operator
39
B
A
T1
T2
Graph g
for_each (Node n : wl)
for (Node m : g.edges(n))
…
T1
Galois Operator
40
B
A
T1
T2
Graph g
for_each (Node n : wl)
for (Node m : g.edges(n))
…
T1
Galois Operator
41
B
A
T1
T2
Graph g
for_each (Node n : wl)
for (Node m : g.edges(n))
…
T1 T2
Galois Operator
42
B
A
T1
T2
Graph g
for_each (Node n : wl)
for (Node m : g.edges(n))
…
T1 T2
Galois Operator
• All data accesses go through API
– Galois data implementations maintainmarks
– When marks overlap, rollback activity and try later
• Cautious
– Commonlyoperators read their entire neighborhood before modifying any element
– Failsafe point: point in operator before first write
• Non-deterministic
• Difference from Thread Level Speculation [Rauchwerger and Padua, PLDI 1995]
– Iterations are unordered
– Operators are cautious
– Conflicts are detected at ADT level rather than memory level
43
B
A
T1
T2
Graph g
for_each (Node n : wl)
for (Node m : g.edges(n))
…
T1 T2
Galois Data Structures
• Graphs
– May require multiple
loads to access data for
a node
– Exploit local computation
structure
– Simple allocation
schemes may be
imbalanced in physical
memory
Galois Data Structures
• Graphs
– May require multiple
loads to access data for
a node
– Exploit local computation
structure
– Simple allocation
schemes may be
imbalanced in physical
memory
nd
ed
dst
node data[N]
edge idx[N]
edge data[E]
edge dst[E]
CSR
nd: node data
ed: edge data
dst: edge destination
len: # of edges
Galois Data Structures
• Graphs
– May require multiple
loads to access data for
a node
– Exploit local computation
structure
– Simple allocation
schemes may be
imbalanced in physical
memory
nd
ed
dst
Inlining
Inlining: Andrew Lenharth
nd: node data
ed: edge data
dst: edge destination
len: # of edges
Galois Data Structures
• Graphs
– May require multiple
loads to access data for
a node
– Exploit local computation
structure
– Simple allocation
schemes may be
imbalanced in physical
memory
nd
ed dst
Inlining
Inlining: Andrew Lenharth
nd: node data
ed: edge data
dst: edge destination
len: # of edges
Galois Data Structures
• Graphs
– May require multiple
loads to access data for
a node
– Exploit local computation
structure
– Simple allocation
schemes may be
imbalanced in physical
memory
nd len ed ed
Inlining
Inlining: Andrew Lenharth
nd: node data
ed: edge data
dst: edge destination
len: # of edges
Galois Data Structures
• Graphs
– May require multiple
loads to access data for
a node
– Exploit local computation
structure
– Simple allocation
schemes may be
imbalanced in physical
memory
nd len ed ed
Inlining
Inlining: Andrew Lenharth
Galois Data Structures
• Graphs
– May require multiple
loads to access data for
a node
– Exploit local computation
structure
– Simple allocation
schemes may be
imbalanced in physical
memory
nd len ed ed
Inlining
Placement
DRAM 0 DRAM 1
Virtual Memory
Physical Memory
Inlining: Andrew Lenharth
c1 c2
Scheduling in Galois
• Users write high-level scheduling policies
– DSL for specifying policies
• Separation of concerns
– Users understand their operator but have vague
understanding of scheduling
– Schedulers are difficult to implement efficiently
• Galois system synthesizes scheduler by
composing elements from a scheduler library
51
Nguyen and Pingali. Synthesizing concurrent schedulers for irregular algorithms. ASPLOS 2011.
Scheduling Components and Policies
• Components: FIFO, LIFO, BulkSynchronous, ChunkedFIFO,
OrderedByMetric, …
Algorithm Scheduling Policy Used By
DMR OrderedByMetric(λt.minangle(t)) FIFO Shewchuk96
Global: ChunkedFIFO(k) Local: LIFO Kulkarni08
DT OrderedByMetric(λp.round(p)) FIFO Amenta03
Random Clarkson89
PFP FIFO Goldberg88
OrderedByMetric(λn.height(n)) Cherkassy95
PTA FIFO Pearce03
BulkSynchronous Nielson99
SSSP FIFO Bellman-
Ford
OrderedByMetric(λn.w(n)/Δ + (light(n) ? 0 : ½)) FIFO Δ-stepping
52
Scheduling Bag
53
BagAbstraction
Scheduling Bag
54
Bag
Head
Abstraction
Implementation
…
Scheduling Bag
55
Bag
Head
Abstraction
Implementation
…
Serialization
Scheduling Bag
56
Bag
Head
Abstraction
Implementation
…
Head
…
Head
Thread 1 Thread 2
Serialization
Scheduling Bag
57
Bag
Head
Abstraction
Implementation
…
Head
…
Head
Thread 1 Thread 2
Serialization Load Imbalance
Scheduling Bag
c1 c2
Head
Package 1
58
Bag
c1 c2
Head
Package 2
Per-PackageBag: Andrew Lenharth
Scheduling Bag
c1 c2
Head
Package 1
59
Bag
c1 c2
Head
Package 2
Per-PackageBag: Andrew Lenharth
Scheduling Bag
c1 c2
Head
Package 1
60
Bag
c1 c2
Head
Package 2
Per-PackageBag: Andrew Lenharth
Scheduling Bag
c1 c2
Head
Package 1
61
Bag
c1 c2
Head
Package 2
Per-PackageBag: Andrew Lenharth
Scheduling Bag
c1 c2
Head
Package 1
62
Bag
c1 c2
Head
Package 2
Per-PackageBag: Andrew Lenharth
Scheduling Bag
c1 c2
Head
Package 1
63
Bag
c1 c2
Head
Package 2
Per-PackageBag: Andrew Lenharth
Scheduling Bag
c1 c2
Head
Package 1
64
Bag
c1 c2
Head
Package 2
Per-PackageBag: Andrew Lenharth
Scheduling Bag
c1 c2
Head
Package 1
65
Bag
c1 c2
Head
Package 2
OrderedByIntegerMetric
66
0 1 2 3 4 5
Core 1
Core 2
3
Cursor
OrderedByIntegerMetric
• Implements soft priorities
• Drawbacks
– Dense range of priorities
– Contention on cursor location
67
0 1 2 3 4 5
Core 1
Core 2
3
Cursor
Improved OrderedByIntegerMetric
Global Map
Scheduling
Bags
1 30
68
Implementation byAndrew Lenharth
Improved OrderedByIntegerMetric
0 1
Global Map
Local Maps
Core 1 Core 2
Scheduling
Bags
1 3
1 30
69
Implementation byAndrew Lenharth
Improved OrderedByIntegerMetric
0 1
Global Map
Local Maps
Core 1 Core 2
Scheduling
Bags
1 3
1 30
13
70
Implementation byAndrew Lenharth
Improved OrderedByIntegerMetric
0 1
Global Map
Local Maps
Core 1 Core 2
Scheduling
Bags
1 3
1 30
13
71
Implementation byAndrew Lenharth
Improved OrderedByIntegerMetric
0 1
Global Map
Local Maps
Core 1 Core 2
Scheduling
Bags
1 3
1 30
13
3
72
Implementation byAndrew Lenharth
Improved OrderedByIntegerMetric
0 1
Global Map
Local Maps
Core 1 Core 2
Scheduling
Bags
1 3
1 30
1
3
73
Implementation byAndrew Lenharth
Improved OrderedByIntegerMetric
0 1
Global Map
Local Maps
Core 1 Core 2
Scheduling
Bags
1 3
1 30
1
3
15
74
Implementation byAndrew Lenharth
Improved OrderedByIntegerMetric
0 1
Global Map
Local Maps
Core 1 Core 2
Scheduling
Bags
1 3
1 30
1
3
15
75
Implementation byAndrew Lenharth
Improved OrderedByIntegerMetric
0 1
Global Map
Local Maps
Core 1 Core 2
Scheduling
Bags
1 3
1 30
1
3
15
5
5
76
Implementation byAndrew Lenharth
Improved OrderedByIntegerMetric
0 1
Global Map
Local Maps
Core 1 Core 2
Scheduling
Bags
1 3
1 30
1
3
1
5
5
77
Implementation byAndrew Lenharth
Galois System
• Joe
– Writes sequential code
– Uses Galois data structures
and scheduling hints
• Stephanie
– Writes concurrent code
– Provides variety of data
structures and scheduling
policies for Joe to choose
from
• cf., Frameworks that only
support one scheduler or
one graph implementation
78
Stephanie: Parallel Data Structures
Joe: Operator + Schedule
Outline
• Parallel Program = Operator + Schedule +
e
• Galois system implementation
– Operator, data structures, scheduling
• Galois studies
– Intel study
– Deterministic scheduling
– Adding a scheduler: sparse tiling
– Comparison with transactional memory
– Implementing other programming models
79
Parallel Data Structure
Intel Study
80
0.1
1
10
tion(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0.01
0.1
Livejournal
Facebook
Wikipedia
Synthetic
Timeperiterat
(a) PageRank
10
100
e(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0
1
Livejournal
Facebook
Wikipedia
Synthetic
Overalltime
(b) Breadth-First Search
10
100
1000
ation(seconds)
Native Combblas
Socialite Giraph
1
10
Netflix
Timeperitera
(c) Collaborative Fil
Figure 3: Performance results for different algorithms on real-world and synthetic graphs that are s
represents runtime (in log-scale), therefore lower numbers are better.
FDR interconnect. The cluster runs on the Red Hat Enterprise
Linux Server OS release 6.4. We use a mix of OpenMP directives
to parallelize within the node and MPI code to parallelize across
nodes in native code. We use the Intel®
C++ Composer XE 2013
SP1 Compiler3
and the Intel®
MPI library to compile the code.
GraphLab uses a similar combination of OpenMP and MPI for best
performance. CombBLAS runs best as a pure MPI program. We
use multiple MPI processes per node to take advantage of the mul-
ideal results. Given
gorithms, and also be
tions, we expect that i
excel at all scales in te
5.2 Single nod
We now show the r
orative Filtering andNadathur et al. Navigating the maze of graph analytics frameworks. SIGMOD 2014.
Combblas Graphlab
Giraph Galois
Livejournal
Facebook
Wikipedia
Synthetic
readth-First Search
10
100
1000
ation(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
1
10
Netflix
Synthetic
Timeperitera
(c) Collaborative Filtering
10
100
1000
10000
e(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0.1
1
Livejournal
Facebook
Wikipedia
Synthetic
OverallTim
(d) Triangle counting
rithms on real-world and synthetic graphs that are small enough to run on a single node. The y-axis
Intel Study
81
0.1
1
10
tion(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0.01
0.1
Livejournal
Facebook
Wikipedia
Synthetic
Timeperiterat
(a) PageRank
10
100
e(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0
1
Livejournal
Facebook
Wikipedia
Synthetic
Overalltime
(b) Breadth-First Search
10
100
1000
ation(seconds)
Native Combblas
Socialite Giraph
1
10
Netflix
Timeperitera
(c) Collaborative Fil
Figure 3: Performance results for different algorithms on real-world and synthetic graphs that are s
represents runtime (in log-scale), therefore lower numbers are better.
FDR interconnect. The cluster runs on the Red Hat Enterprise
Linux Server OS release 6.4. We use a mix of OpenMP directives
to parallelize within the node and MPI code to parallelize across
nodes in native code. We use the Intel®
C++ Composer XE 2013
SP1 Compiler3
and the Intel®
MPI library to compile the code.
GraphLab uses a similar combination of OpenMP and MPI for best
performance. CombBLAS runs best as a pure MPI program. We
use multiple MPI processes per node to take advantage of the mul-
ideal results. Given
gorithms, and also be
tions, we expect that i
excel at all scales in te
5.2 Single nod
We now show the r
orative Filtering andNadathur et al. Navigating the maze of graph analytics frameworks. SIGMOD 2014.
Combblas Graphlab
Giraph Galois
Livejournal
Facebook
Wikipedia
Synthetic
readth-First Search
10
100
1000
ation(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
1
10
Netflix
Synthetic
Timeperitera
(c) Collaborative Filtering
10
100
1000
10000
e(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0.1
1
Livejournal
Facebook
Wikipedia
Synthetic
OverallTim
(d) Triangle counting
rithms on real-world and synthetic graphs that are small enough to run on a single node. The y-axis
Intel Study
82
0.1
1
10
tion(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0.01
0.1
Livejournal
Facebook
Wikipedia
Synthetic
Timeperiterat
(a) PageRank
10
100
e(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0
1
Livejournal
Facebook
Wikipedia
Synthetic
Overalltime
(b) Breadth-First Search
10
100
1000
ation(seconds)
Native Combblas
Socialite Giraph
1
10
Netflix
Timeperitera
(c) Collaborative Fil
Figure 3: Performance results for different algorithms on real-world and synthetic graphs that are s
represents runtime (in log-scale), therefore lower numbers are better.
FDR interconnect. The cluster runs on the Red Hat Enterprise
Linux Server OS release 6.4. We use a mix of OpenMP directives
to parallelize within the node and MPI code to parallelize across
nodes in native code. We use the Intel®
C++ Composer XE 2013
SP1 Compiler3
and the Intel®
MPI library to compile the code.
GraphLab uses a similar combination of OpenMP and MPI for best
performance. CombBLAS runs best as a pure MPI program. We
use multiple MPI processes per node to take advantage of the mul-
ideal results. Given
gorithms, and also be
tions, we expect that i
excel at all scales in te
5.2 Single nod
We now show the r
orative Filtering andNadathur et al. Navigating the maze of graph analytics frameworks. SIGMOD 2014.
Combblas Graphlab
Giraph Galois
Livejournal
Facebook
Wikipedia
Synthetic
readth-First Search
10
100
1000
ation(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
1
10
Netflix
Synthetic
Timeperitera
(c) Collaborative Filtering
10
100
1000
10000
e(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0.1
1
Livejournal
Facebook
Wikipedia
Synthetic
OverallTim
(d) Triangle counting
rithms on real-world and synthetic graphs that are small enough to run on a single node. The y-axis
Intel Study
83
0.1
1
10
tion(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0.01
0.1
Livejournal
Facebook
Wikipedia
Synthetic
Timeperiterat
(a) PageRank
10
100
e(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0
1
Livejournal
Facebook
Wikipedia
Synthetic
Overalltime
(b) Breadth-First Search
10
100
1000
ation(seconds)
Native Combblas
Socialite Giraph
1
10
Netflix
Timeperitera
(c) Collaborative Fil
Figure 3: Performance results for different algorithms on real-world and synthetic graphs that are s
represents runtime (in log-scale), therefore lower numbers are better.
FDR interconnect. The cluster runs on the Red Hat Enterprise
Linux Server OS release 6.4. We use a mix of OpenMP directives
to parallelize within the node and MPI code to parallelize across
nodes in native code. We use the Intel®
C++ Composer XE 2013
SP1 Compiler3
and the Intel®
MPI library to compile the code.
GraphLab uses a similar combination of OpenMP and MPI for best
performance. CombBLAS runs best as a pure MPI program. We
use multiple MPI processes per node to take advantage of the mul-
ideal results. Given
gorithms, and also be
tions, we expect that i
excel at all scales in te
5.2 Single nod
We now show the r
orative Filtering andNadathur et al. Navigating the maze of graph analytics frameworks. SIGMOD 2014.
Combblas Graphlab
Giraph Galois
Livejournal
Facebook
Wikipedia
Synthetic
readth-First Search
10
100
1000
ation(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
1
10
Netflix
Synthetic
Timeperitera
(c) Collaborative Filtering
10
100
1000
10000
e(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0.1
1
Livejournal
Facebook
Wikipedia
Synthetic
OverallTim
(d) Triangle counting
rithms on real-world and synthetic graphs that are small enough to run on a single node. The y-axis
Intel Study
84
0.1
1
10
tion(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0.01
0.1
Livejournal
Facebook
Wikipedia
Synthetic
Timeperiterat
(a) PageRank
10
100
e(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0
1
Livejournal
Facebook
Wikipedia
Synthetic
Overalltime
(b) Breadth-First Search
10
100
1000
ation(seconds)
Native Combblas
Socialite Giraph
1
10
Netflix
Timeperitera
(c) Collaborative Fil
Figure 3: Performance results for different algorithms on real-world and synthetic graphs that are s
represents runtime (in log-scale), therefore lower numbers are better.
FDR interconnect. The cluster runs on the Red Hat Enterprise
Linux Server OS release 6.4. We use a mix of OpenMP directives
to parallelize within the node and MPI code to parallelize across
nodes in native code. We use the Intel®
C++ Composer XE 2013
SP1 Compiler3
and the Intel®
MPI library to compile the code.
GraphLab uses a similar combination of OpenMP and MPI for best
performance. CombBLAS runs best as a pure MPI program. We
use multiple MPI processes per node to take advantage of the mul-
ideal results. Given
gorithms, and also be
tions, we expect that i
excel at all scales in te
5.2 Single nod
We now show the r
orative Filtering andNadathur et al. Navigating the maze of graph analytics frameworks. SIGMOD 2014.
Combblas Graphlab
Giraph Galois
Livejournal
Facebook
Wikipedia
Synthetic
readth-First Search
10
100
1000
ation(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
1
10
Netflix
Synthetic
Timeperitera
(c) Collaborative Filtering
10
100
1000
10000
e(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0.1
1
Livejournal
Facebook
Wikipedia
Synthetic
OverallTim
(d) Triangle counting
rithms on real-world and synthetic graphs that are small enough to run on a single node. The y-axis
Intel Study
85
0.1
1
10
tion(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0.01
0.1
Livejournal
Facebook
Wikipedia
Synthetic
Timeperiterat
(a) PageRank
10
100
e(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0
1
Livejournal
Facebook
Wikipedia
Synthetic
Overalltime
(b) Breadth-First Search
10
100
1000
ation(seconds)
Native Combblas
Socialite Giraph
1
10
Netflix
Timeperitera
(c) Collaborative Fil
Figure 3: Performance results for different algorithms on real-world and synthetic graphs that are s
represents runtime (in log-scale), therefore lower numbers are better.
FDR interconnect. The cluster runs on the Red Hat Enterprise
Linux Server OS release 6.4. We use a mix of OpenMP directives
to parallelize within the node and MPI code to parallelize across
nodes in native code. We use the Intel®
C++ Composer XE 2013
SP1 Compiler3
and the Intel®
MPI library to compile the code.
GraphLab uses a similar combination of OpenMP and MPI for best
performance. CombBLAS runs best as a pure MPI program. We
use multiple MPI processes per node to take advantage of the mul-
ideal results. Given
gorithms, and also be
tions, we expect that i
excel at all scales in te
5.2 Single nod
We now show the r
orative Filtering andNadathur et al. Navigating the maze of graph analytics frameworks. SIGMOD 2014.
Combblas Graphlab
Giraph Galois
Livejournal
Facebook
Wikipedia
Synthetic
readth-First Search
10
100
1000
ation(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
1
10
Netflix
Synthetic
Timeperitera
(c) Collaborative Filtering
10
100
1000
10000
e(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0.1
1
Livejournal
Facebook
Wikipedia
Synthetic
OverallTim
(d) Triangle counting
rithms on real-world and synthetic graphs that are small enough to run on a single node. The y-axis
Intel Study
86
0.1
1
10
tion(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0.01
0.1
Livejournal
Facebook
Wikipedia
Synthetic
Timeperiterat
(a) PageRank
10
100
e(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0
1
Livejournal
Facebook
Wikipedia
Synthetic
Overalltime
(b) Breadth-First Search
10
100
1000
ation(seconds)
Native Combblas
Socialite Giraph
1
10
Netflix
Timeperitera
(c) Collaborative Fil
Figure 3: Performance results for different algorithms on real-world and synthetic graphs that are s
represents runtime (in log-scale), therefore lower numbers are better.
FDR interconnect. The cluster runs on the Red Hat Enterprise
Linux Server OS release 6.4. We use a mix of OpenMP directives
to parallelize within the node and MPI code to parallelize across
nodes in native code. We use the Intel®
C++ Composer XE 2013
SP1 Compiler3
and the Intel®
MPI library to compile the code.
GraphLab uses a similar combination of OpenMP and MPI for best
performance. CombBLAS runs best as a pure MPI program. We
use multiple MPI processes per node to take advantage of the mul-
ideal results. Given
gorithms, and also be
tions, we expect that i
excel at all scales in te
5.2 Single nod
We now show the r
orative Filtering andNadathur et al. Navigating the maze of graph analytics frameworks. SIGMOD 2014.
Combblas Graphlab
Giraph Galois
Livejournal
Facebook
Wikipedia
Synthetic
readth-First Search
10
100
1000
ation(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
1
10
Netflix
Synthetic
Timeperitera
(c) Collaborative Filtering
10
100
1000
10000
e(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0.1
1
Livejournal
Facebook
Wikipedia
Synthetic
OverallTim
(d) Triangle counting
rithms on real-world and synthetic graphs that are small enough to run on a single node. The y-axis
Outline
• Parallel Program = Operator + Schedule +
e
• Galois system implementation
– Operator, data structures, scheduling
• Galois studies
– Intel study
– Deterministic scheduling
– Adding a scheduler: sparse tiling
– Comparison with transactional memory
– Implementing other programming models
87
Parallel Data Structure
Deterministic Interference Graph
(DIG) Scheduling
• Traditional Galois
execution
– Asynchronous, non-
deterministic
• Deterministic Galois
execution
– Round-based execution
– Construct an interference
graph
– Execute independent set
– Repeat
88
Interference Graph
Data Structure
B
A
C
Nguyen, Lenharth and Pingali. Deterministic Galois: on-demand,
parameterless and portable. ASPLOS 2014.
Deterministic Interference Graph
(DIG) Scheduling
• Traditional Galois
execution
– Asynchronous, non-
deterministic
• Deterministic Galois
execution
– Round-based execution
– Construct an interference
graph
– Execute independent set
– Repeat
89
Interference Graph
Data Structure
B
A
C
Nguyen, Lenharth and Pingali. Deterministic Galois: on-demand,
parameterless and portable. ASPLOS 2014.
Deterministic Interference Graph
(DIG) Scheduling
• Traditional Galois
execution
– Asynchronous, non-
deterministic
• Deterministic Galois
execution
– Round-based execution
– Construct an interference
graph
– Execute independent set
– Repeat
90
B
A
C
Interference Graph
Data Structure
B
A
C
Nguyen, Lenharth and Pingali. Deterministic Galois: on-demand,
parameterless and portable. ASPLOS 2014.
Deterministic Interference Graph
(DIG) Scheduling
• Traditional Galois
execution
– Asynchronous, non-
deterministic
• Deterministic Galois
execution
– Round-based execution
– Construct an interference
graph
– Execute independent set
– Repeat
91
B
A
C
Interference Graph
Data Structure
B
A
C
Nguyen, Lenharth and Pingali. Deterministic Galois: on-demand,
parameterless and portable. ASPLOS 2014.
Deterministic Interference Graph
(DIG) Scheduling
• Traditional Galois
execution
– Asynchronous, non-
deterministic
• Deterministic Galois
execution
– Round-based execution
– Construct an interference
graph
– Execute independent set
– Repeat
92
Interference Graph
Data Structure
B
C
Nguyen, Lenharth and Pingali. Deterministic Galois: on-demand,
parameterless and portable. ASPLOS 2014.
Deterministic Interference Graph
(DIG) Scheduling
• Traditional Galois
execution
– Asynchronous, non-
deterministic
• Deterministic Galois
execution
– Round-based execution
– Construct an interference
graph
– Execute independent set
– Repeat
93
Interference Graph
Data Structure
B
C
Nguyen, Lenharth and Pingali. Deterministic Galois: on-demand,
parameterless and portable. ASPLOS 2014.
Deterministic Interference Graph
(DIG) Scheduling
• Traditional Galois
execution
– Asynchronous, non-
deterministic
• Deterministic Galois
execution
– Round-based execution
– Construct an interference
graph
– Execute independent set
– Repeat
94
B C
Interference Graph
Data Structure
B
C
Nguyen, Lenharth and Pingali. Deterministic Galois: on-demand,
parameterless and portable. ASPLOS 2014.
DIG Scheduling in Practice
• Just find roots in interference graph
• Sequence of rounds
– Round has two phases
• Phase 1: mark neighborhood
– Acquire marks by writing task id
atomically until failsafe point
– Overwriting an id only if replacing
greater value
– Final mark values are deterministic
• Phase 2: execute roots
– Execute tasks whose neighborhood
only contains their own marks
95
A < B < C
DIG Scheduling in Practice
• Just find roots in interference graph
• Sequence of rounds
– Round has two phases
• Phase 1: mark neighborhood
– Acquire marks by writing task id
atomically until failsafe point
– Overwriting an id only if replacing
greater value
– Final mark values are deterministic
• Phase 2: execute roots
– Execute tasks whose neighborhood
only contains their own marks
96
A
A < B < C
DIG Scheduling in Practice
• Just find roots in interference graph
• Sequence of rounds
– Round has two phases
• Phase 1: mark neighborhood
– Acquire marks by writing task id
atomically until failsafe point
– Overwriting an id only if replacing
greater value
– Final mark values are deterministic
• Phase 2: execute roots
– Execute tasks whose neighborhood
only contains their own marks
97
A
A
A
A
A < B < C
DIG Scheduling in Practice
• Just find roots in interference graph
• Sequence of rounds
– Round has two phases
• Phase 1: mark neighborhood
– Acquire marks by writing task id
atomically until failsafe point
– Overwriting an id only if replacing
greater value
– Final mark values are deterministic
• Phase 2: execute roots
– Execute tasks whose neighborhood
only contains their own marks
98
A
C
A
A
A
A < B < C
DIG Scheduling in Practice
• Just find roots in interference graph
• Sequence of rounds
– Round has two phases
• Phase 1: mark neighborhood
– Acquire marks by writing task id
atomically until failsafe point
– Overwriting an id only if replacing
greater value
– Final mark values are deterministic
• Phase 2: execute roots
– Execute tasks whose neighborhood
only contains their own marks
99
A
C
A
A
A
C
C
C
A < B < C
DIG Scheduling in Practice
• Just find roots in interference graph
• Sequence of rounds
– Round has two phases
• Phase 1: mark neighborhood
– Acquire marks by writing task id
atomically until failsafe point
– Overwriting an id only if replacing
greater value
– Final mark values are deterministic
• Phase 2: execute roots
– Execute tasks whose neighborhood
only contains their own marks
100
B
A
C
A
A
A
C
C
C
A < B < C
DIG Scheduling in Practice
• Just find roots in interference graph
• Sequence of rounds
– Round has two phases
• Phase 1: mark neighborhood
– Acquire marks by writing task id
atomically until failsafe point
– Overwriting an id only if replacing
greater value
– Final mark values are deterministic
• Phase 2: execute roots
– Execute tasks whose neighborhood
only contains their own marks
101
B
A
C
A
A
A
C
B
C
B
A < B < C
Optimizations
• Resuming tasks
– Avoid re-executing tasks
– Suspend and resume execution at failsafe point
– Provide buffers to save local state
• Windowing
– Inspect only a subset of tasks at a time
– Adaptive algorithm varies window size
102
Evaluation of Deterministic
Scheduling
Platform
• Intel 4x10 core (2.27 GHz)
Deterministic Systems
• RCDC [Devietti11]
– Deterministic by scheduling
– Implementation of Kendo
algorithm [Olszewski09]
• PBBS
– Deterministic by construction
– Handwritten deterministic
implementations
• Deterministic Galois
– Non-deterministic programs
(g-n) automatically made
deterministic (g-d)
103
Applications
• PBBS [Blelloch11]
– Breadth-first search (bfs)
– Delaunay mesh refinement
(dmr)
– Delaunay triangulation (dt)
– Maximal independent set
(mis)
Evaluation
dtdmr
rcdc
g−d
pbbs
g−n
rcdc
g−d
pbbs
g−n
rcdc
g−d
pbbs
g−n
rcdc
g−d
g−n
pbbs
bfs mis
0
10
20
30
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
Threads
Speedup
104
Threads
Evaluation
dtdmr
rcdc
g−d
pbbs
g−n
rcdc
g−d
pbbs
g−n
rcdc
g−d
pbbs
g−n
rcdc
g−d
g−n
pbbs
bfs mis
0
10
20
30
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
Threads
Speedup
105
Non-deterministic program
Threads
Evaluation
dtdmr
rcdc
g−d
pbbs
g−n
rcdc
g−d
pbbs
g−n
rcdc
g−d
pbbs
g−n
rcdc
g−d
g−n
pbbs
bfs mis
0
10
20
30
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
Threads
Speedup
106
RCDC (Deterministic by Scheduling)
Non-deterministic program
Threads
Evaluation
dtdmr
rcdc
g−d
pbbs
g−n
rcdc
g−d
pbbs
g−n
rcdc
g−d
pbbs
g−n
rcdc
g−d
g−n
pbbs
bfs mis
0
10
20
30
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
Threads
Speedup
107
Deterministic Galois program
RCDC (Deterministic by Scheduling)
Non-deterministic program
Threads
Evaluation
dtdmr
rcdc
g−d
pbbs
g−n
rcdc
g−d
pbbs
g−n
rcdc
g−d
pbbs
g−n
rcdc
g−d
g−n
pbbs
bfs mis
0
10
20
30
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
Threads
Speedup
108
PBBS (Deterministic by Construction)
Deterministic Galois program
RCDC (Deterministic by Scheduling)
Non-deterministic program
Threads
Evaluation
dtdmr
rcdc
g−d
pbbs
g−n
rcdc
g−d
pbbs
g−n
rcdc
g−d
pbbs
g−n
rcdc
g−d
g−n
pbbs
bfs mis
0
10
20
30
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
Threads
Speedup
109
Third-party Evaluation
Maze Router Speedup!
!
!
!
! Deterministic Scheduler! Non-Deterministic Scheduler!
!
Averages VPR 5.0 1.00x VPR 5.0 1.00x
Galois (1 Thread) 0.79x Galois (1 Thread) 0.89x
Galois (2 Threads) 1.21x Galois (2 Threads) 1.47x
Galois (4 Threads) 2.14x Galois (4 Threads) 2.99x
Galois (8 Threads) 3.67x Galois (8 Threads) 5.46x
0!
1!
2!
3!
4!
5!
6!
7!
8!
0!
1!
2!
3!
4!
5!
6!
7!
8!
110
Extended results from: Moctar and Brisk. Parallel FPGA Routing based on the Operator Formulation. DAC 2014.
Speedup
• In FPGA routing useful to have deterministic results
Outline
• Parallel Program = Operator + Schedule +
e
• Galois system implementation
– Operator, data structures, scheduling
• Galois studies
– Intel study
– Deterministic scheduling
– Adding a scheduler: sparse tiling
– Comparison with transactional memory
– Implementing other programming models
111
Parallel Data Structure
Matrix Completion
• Matrix completion problem
– Netflix problem
• An optimization problem
– Find m×k matrix W and k×n
matrix H (k << min(m,n))
such that A ≈ WH
– Low-rank approximation
• Algorithms
– Stochastic gradient descent
(SGD)
– …
1
i j
Aij
Users Movies
2
3
1
2
3
4
Wi* H*j
k
W
H
A
Matrix View
Graph View
Users
Movies
k
Naïve SGD Implementation
• Row-wise operator
113
for (User u : users)
for (Item m : items(u))
op(u, m, Aij);
• Poor cache behavior
– With k = 100, Wi* ≈ 1KB
– Degree(user) > 1000
W
H
A
…
…
100
1000
2D Tiled SGD
• Apply operator to
small 2D tiles
114
W
H
A
…
…
• Additional concerns
– Conflict-free scheduling of tiles
• Future optimizations
– Adaptive, non-uniform tiling
Evaluation of Tiling
Parameters
• Same SGD parameters, initial latent
vectors between implementations
• ε(n) = alpha / (1 + beta * n1.5)
• Handtuned tile size
Datasets
115
Implementations
• Galois
– 2D scheduling
• 2 weeks to implement
– Synchronized updates
• GraphLab
– Standard GraphLab only
supports GD
– Implement 1D SGD with
unsynchronized updates
• Nomad
– From scratch distributed code
• ~1 year to implement
– 2D scheduling
• Tile size is function of #threads
– Synchronized updates
Items Users Ratings Sparsity
netflix 17K 480K 99M 1e-2
yahoo 624K 1M 253M 4e-4
Yun, Yu, Hsieh, Vishwanathan, Dhillon. NOMAD: Non-locking stOchastic Multi-machine
algorithm for Asynchronous and Decentralized matrix completion. VLDB 2013.
Throughput
116
●● ●●●
●
● ●● ●●●
● ●● ● ●
●●
●
● ●●
●
●
●
●
●
●
● ●●
●
●
● ●
●
●
● ● ●●
●
●
●
●
●● ●● ●●●
●●
●
●● ●●● ● ●●●
●
●
●
● ●
●
●
netflix yahoo
0
25
50
75
100
125
10
20
30
40
20 40 20 40
Threads
GFLOPS
Kind ● ● ●
galois graphlab nomad
Theoretical Peak: 363.2 GFLOPS
Convergence
13 32263
netflix
0.9
1.1
1.3
1.5
0 50 100 150 200
Time (s)
Error
331
333
yahoo
17.5
20.0
22.5
25.0
27.5
30.0
0 200 400 600
Time (s)
Error
Kind galois graphlab nomad
11720 Threads
●● ●●●
●
● ●● ●●●
● ●● ● ●
●●
●
● ●●
●
●
●
●
●
●
● ●●
●
●
● ●
●
●
● ● ●●
●
●
●
●
●● ●● ●●●
●●
●
●● ●●● ● ●●●
●
●
●
● ●
●
●
netflix yahoo
0
25
50
75
100
125
10
20
30
40
20 40 20 40
Threads
GFLOPS
Kind ● ● ●
galois graphlab nomad
Outline
• Parallel Program = Operator + Schedule +
e
• Galois system implementation
– Operator, data structures, scheduling
• Galois studies
– Intel study
– Deterministic scheduling
– Adding a scheduler: sparse tiling
– Comparison with transactional memory
– Implementing other programming models
118
Parallel Data Structure
Transactional Memory
• Alternative to adopting a new
programming model
– Parallelization reusing existing
sequential programs
• Transactional memory
– Add threads and transactions
but keep existing sequential
code
– TM system ensures correct
execution of atomic regions
• Software and hardwareimpl.
– Speculative execution
119
while (begin != end)
begin += 1
while (head)
...
head = next
root.size += 1
fork()
while (begin != end)
atomic
begin += 1
atomic
while (head)
...
head = next
root.size += 1
join()
Sequential Program
TM Program
Transactional Memory
• Alternative to adopting a new
programming model
– Parallelization reusing existing
sequential programs
• Transactional memory
– Add threads and transactions
but keep existing sequential
code
– TM system ensures correct
execution of atomic regions
• Software and hardwareimpl.
– Speculative execution
120
while (begin != end)
begin += 1
while (head)
...
head = next
root.size += 1
fork()
while (begin != end)
atomic
begin += 1
atomic
while (head)
...
head = next
root.size += 1
join()
Sequential Program
TM Program
Transactional Memory
• Alternative to adopting a new
programming model
– Parallelization reusing existing
sequential programs
• Transactional memory
– Add threads and transactions
but keep existing sequential
code
– TM system ensures correct
execution of atomic regions
• Software and hardwareimpl.
– Speculative execution
121
while (begin != end)
begin += 1
while (head)
...
head = next
root.size += 1
fork()
while (begin != end)
atomic
begin += 1
atomic
while (head)
...
head = next
root.size += 1
join()
Sequential Program
TM Program
Stampede
• STAMP [CaoMinh08]
– Set of real-world applications
using TM
– Historically, performance has
been poor
• Stampede benchmarks
– Re-implemented STAMP in Galois
– Compared to TinySTM
– Median improvement ratio over
STAMP of 4.4X at 64 threads
(max: 37X, geomean: 3.9X)
122
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●●
● ● ●
●●
●
●
●
●
●
●●
●
●
●
● ●
●●
●
●
●
●
●
●●●● ●
●
●
●●
●
●
●
●
●
●●●● ● ● ●
genome kmeans−low
ssca2 vacation−low
yada
0
5
10
0
10
20
0.0
2.5
5.0
7.5
10.0
12.5
0
5
10
15
20
25
0
20
40
60
0 20 40 60
Threads
Speedup
● ●Galois STM
• Blue Gene/Q
• Median of 5 runs
Nguyen and Pingali. What scalable transactional programs need from
transactional memory. In submission to PLDI 2015.
Stampede Performance
123
fork()
while (begin != end)
atomic
begin += 1
atomic
while (head)
...
head = next
root.size += 1
join()
TM Program
Stampede Performance
• Controlling communication is paramount
– Chief performance cost is moving bits
around
• Scalable principle 1: disjoint access
– Recall implementation of Bag and
OrderedByIntegerMetric scheduler
• Scalable principle 2: virtualize tasks
– Be as flexible as possible to dynamic
conditions
• These changes are beyond capability of
TM
– Use Galois data structures (disjoint access)
– Use Galois scheduler (virtualized tasks)
124
fork()
while (begin != end)
atomic
begin += 1
atomic
while (head)
...
head = next
root.size += 1
join()
TM Program
Outline
• Parallel Program = Operator + Schedule +
e
• Galois system implementation
– Operator, data structures, scheduling
• Galois studies
– Intel study
– Deterministic scheduling
– Adding a scheduler: sparse tiling
– Comparison with transactional memory
– Implementing other programming models
125
Parallel Data Structure
DSLs
• Layer other DSLs on top of Galois
– Other programming models are more restrictive
than operator formulation
• Bulk-Synchronous Vertex Programs
– Nodes are active
– Operator is over node and its neighbors
– Execution is in rounds
• Read values taken from previous round
• Overlapping writes reduced to single value
– Ex: Pregel [Malewicz10], Ligra [Shun13]
• Gather-Apply-Scatter
– Refinement of bulk-synchronous into subrounds
– Ex: GraphLab [Low10] / PowerGraph [Gonzalez12] 126
ca
d
b
e
DSLs
• Layer other DSLs on top of Galois
– Other programming models are more restrictive
than operator formulation
• Bulk-Synchronous Vertex Programs
– Nodes are active
– Operator is over node and its neighbors
– Execution is in rounds
• Read values taken from previous round
• Overlapping writes reduced to single value
– Ex: Pregel [Malewicz10], Ligra [Shun13]
• Gather-Apply-Scatter
– Refinement of bulk-synchronous into subrounds
– Ex: GraphLab [Low10] / PowerGraph [Gonzalez12] 127
ca
d
b
e
DSLs
• Layer other DSLs on top of Galois
– Other programming models are more restrictive
than operator formulation
• Bulk-Synchronous Vertex Programs
– Nodes are active
– Operator is over node and its neighbors
– Execution is in rounds
• Read values taken from previous round
• Overlapping writes reduced to single value
– Ex: Pregel [Malewicz10], Ligra [Shun13]
• Gather-Apply-Scatter
– Refinement of bulk-synchronous into subrounds
– Ex: GraphLab [Low10] / PowerGraph [Gonzalez12] 128
ca
d
b
e
Insufficiency of Vertex Programs
129
• Connected components
problem
– Construct a labeling
function such that color(x)
= color(neighbor(x))
• Algorithms
– Union-Find-based
– Label propagation
– Hybrids
a
b
c
Pointer Jumping
a
b
c
Cost to Implement Other Models
Feature LoC
GraphLab [Low10] 0
PowerGraph [Gonzalez12] (synchronous engine) 200
PowerGraph (asynchronous engine) 200
GraphChi [Kyrola12] 400
Ligra [Shun13] 300
GraphChi + Ligra (additional) 100
130
Cost to Implement Other Models
Feature LoC
GraphLab [Low10] 0
PowerGraph [Gonzalez12] (synchronous engine) 200
PowerGraph (asynchronous engine) 200
GraphChi [Kyrola12] 400
Ligra [Shun13] 300
GraphChi + Ligra (additional) 100
131
PowerGraph-g
Comparison with Other Models
Comparisons
• PowerGraph (distributed)
– Runtimes with 64 16-core
machines (1024 cores) does
not beat one 40-core
machine using Galois
Inputs
• twitter50 (50 M nodes, 2 B edges,
low-diameter)
• road (20 M nodes, 60 M edges,
high-diameter)
132
Applications
• Breadth-first search (bfs)
• Connected components (cc)
• Approximate diameter (dia)
• PageRank (pr)
• Single-source shortest paths
(sssp)
Nguyen, Lenharth and Pingali. A lightweight infrastructure for graph analytics. SOSP 2013.
133
PowerGraph runtime
Galois runtime
1
10
100
1000
10000
1
10
100
1000
10000
bfs cc dia pr sssp
PowerGraph runtime
PowerGraph⇠g runtime
1
2
1
2
4
6
bfs cc dia pr sssp
134
PowerGraph runtime
Galois runtime
1
10
100
1000
10000
1
10
100
1000
10000
bfs cc dia pr sssp
PowerGraph runtime
PowerGraph⇠g runtime
1
2
1
2
4
6
bfs cc dia pr sssp
PowerGraph runtime
Galois runtime
1
10
100
1000
10000
1
10
100
1000
10000
bfs cc dia pr sssp
PowerGraph runtime
PowerGraph⇠g runtime
1
2
1
2
4
6
bfs cc dia pr sssp
135
PowerGraph runtime
Galois runtime
1
10
100
1000
10000
1
10
100
1000
10000
bfs cc dia pr sssp
PowerGraph runtime
PowerGraph⇠g runtime
1
2
1
2
4
6
bfs cc dia pr sssp
• The best algorithm may
require application-
specific scheduling
– Priority scheduling for
SSSP
• The best algorithm may
not be expressible as a
vertex program
– Connected components
with union-find
• Speculative execution
required for high-
diameter graphs
– Bulk-synchronous
scheduling uses many
rounds and has too
much overhead
136
PowerGraph runtime
Galois runtime
1
10
100
1000
10000
1
10
100
1000
10000
bfs cc dia pr sssp
PowerGraph runtime
PowerGraph⇠g runtime
1
2
1
2
4
6
bfs cc dia pr sssp
• The best algorithm may
require application-
specific scheduling
– Priority scheduling for
SSSP
• The best algorithm may
not be expressible as a
vertex program
– Connected components
with union-find
• Speculative execution
required for high-
diameter graphs
– Bulk-synchronous
scheduling uses many
rounds and has too
much overhead
137
PowerGraph runtime
Galois runtime
1
10
100
1000
10000
1
10
100
1000
10000
bfs cc dia pr sssp
PowerGraph runtime
PowerGraph⇠g runtime
1
2
1
2
4
6
bfs cc dia pr sssp
• The best algorithm may
require application-
specific scheduling
– Priority scheduling for
SSSP
• The best algorithm may
not be expressible as a
vertex program
– Connected components
with union-find
• Speculative execution
required for high-
diameter graphs
– Bulk-synchronous
scheduling uses many
rounds and has too
much overhead
138
Outline
• Parallel Program = Operator + Schedule +
e
• Galois system implementation
– Operator, data structures, scheduling
• Galois studies
– Intel study
– Deterministic scheduling
– Adding a scheduler: sparse tiling
– Comparison with transactional memory
– Implementing other programming models
139
Parallel Data Structure
Third-party Evaluations
140
Nadathur et al. Navigating the maze of graph analytics frameworks. SIGMOD 2014.
“Galois performs better than other frameworks, and is close to native performance”
“This paper demonstrates that speculative parallelization
and the operator formalism, central to Galois’ programming
model and philosophy, is the best choice for irregular CAD
algorithms that operateon graph-based data structures.”
Moctar and Brisk. Parallel FPGA Routing based on the
Operator Formulation. DAC 2014.
141

More Related Content

What's hot

Tech Days 2015: User Presentation Vermont Technical College
Tech Days 2015: User Presentation Vermont Technical CollegeTech Days 2015: User Presentation Vermont Technical College
Tech Days 2015: User Presentation Vermont Technical College
AdaCore
 
Lecture 2 transfer-function
Lecture 2 transfer-functionLecture 2 transfer-function
Lecture 2 transfer-function
Saifullah Memon
 
Java.util.concurrent.concurrent hashmap
Java.util.concurrent.concurrent hashmapJava.util.concurrent.concurrent hashmap
Java.util.concurrent.concurrent hashmap
Srinivasan Raghvan
 
Innovative Solar Array Drive Assembly for CubeSat Satellite
Innovative Solar Array Drive Assembly for CubeSat SatelliteInnovative Solar Array Drive Assembly for CubeSat Satellite
Innovative Solar Array Drive Assembly for CubeSat Satellite
Michele Marino
 
Me314 week05a-block diagreduction
Me314 week05a-block diagreductionMe314 week05a-block diagreduction
Me314 week05a-block diagreduction
Dr. Bilal Siddiqui, C.Eng., MIMechE, FRAeS
 
PopMAG: Pop Music Accompaniment Generation
PopMAG: Pop Music Accompaniment GenerationPopMAG: Pop Music Accompaniment Generation
PopMAG: Pop Music Accompaniment Generation
ivaderivader
 
The LCA problem revisited
The LCA problem revisitedThe LCA problem revisited
The LCA problem revisited
Minsung Hong
 
Briefing - The Atlast V Aft Bulkhead Carrier Update - Past Missions, Upcoming...
Briefing - The Atlast V Aft Bulkhead Carrier Update - Past Missions, Upcoming...Briefing - The Atlast V Aft Bulkhead Carrier Update - Past Missions, Upcoming...
Briefing - The Atlast V Aft Bulkhead Carrier Update - Past Missions, Upcoming...
Dave Callen
 
Euler Method using MATLAB
Euler Method using MATLABEuler Method using MATLAB
Euler Method using MATLAB
M.Ahmed Waqas
 
Augmenting Field Data for Testing Systems Subject to Incremental Requirements...
Augmenting Field Data for Testing Systems Subject to Incremental Requirements...Augmenting Field Data for Testing Systems Subject to Incremental Requirements...
Augmenting Field Data for Testing Systems Subject to Incremental Requirements...
Lionel Briand
 
Performance van Java 8 en verder - Jeroen Borgers
Performance van Java 8 en verder - Jeroen BorgersPerformance van Java 8 en verder - Jeroen Borgers
Performance van Java 8 en verder - Jeroen Borgers
NLJUG
 
Algorithm analysis
Algorithm analysisAlgorithm analysis
Algorithm analysis
Akshay Dagar
 
Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkR
Databricks
 

What's hot (13)

Tech Days 2015: User Presentation Vermont Technical College
Tech Days 2015: User Presentation Vermont Technical CollegeTech Days 2015: User Presentation Vermont Technical College
Tech Days 2015: User Presentation Vermont Technical College
 
Lecture 2 transfer-function
Lecture 2 transfer-functionLecture 2 transfer-function
Lecture 2 transfer-function
 
Java.util.concurrent.concurrent hashmap
Java.util.concurrent.concurrent hashmapJava.util.concurrent.concurrent hashmap
Java.util.concurrent.concurrent hashmap
 
Innovative Solar Array Drive Assembly for CubeSat Satellite
Innovative Solar Array Drive Assembly for CubeSat SatelliteInnovative Solar Array Drive Assembly for CubeSat Satellite
Innovative Solar Array Drive Assembly for CubeSat Satellite
 
Me314 week05a-block diagreduction
Me314 week05a-block diagreductionMe314 week05a-block diagreduction
Me314 week05a-block diagreduction
 
PopMAG: Pop Music Accompaniment Generation
PopMAG: Pop Music Accompaniment GenerationPopMAG: Pop Music Accompaniment Generation
PopMAG: Pop Music Accompaniment Generation
 
The LCA problem revisited
The LCA problem revisitedThe LCA problem revisited
The LCA problem revisited
 
Briefing - The Atlast V Aft Bulkhead Carrier Update - Past Missions, Upcoming...
Briefing - The Atlast V Aft Bulkhead Carrier Update - Past Missions, Upcoming...Briefing - The Atlast V Aft Bulkhead Carrier Update - Past Missions, Upcoming...
Briefing - The Atlast V Aft Bulkhead Carrier Update - Past Missions, Upcoming...
 
Euler Method using MATLAB
Euler Method using MATLABEuler Method using MATLAB
Euler Method using MATLAB
 
Augmenting Field Data for Testing Systems Subject to Incremental Requirements...
Augmenting Field Data for Testing Systems Subject to Incremental Requirements...Augmenting Field Data for Testing Systems Subject to Incremental Requirements...
Augmenting Field Data for Testing Systems Subject to Incremental Requirements...
 
Performance van Java 8 en verder - Jeroen Borgers
Performance van Java 8 en verder - Jeroen BorgersPerformance van Java 8 en verder - Jeroen Borgers
Performance van Java 8 en verder - Jeroen Borgers
 
Algorithm analysis
Algorithm analysisAlgorithm analysis
Algorithm analysis
 
Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkR
 

Similar to Galois: A System for Parallel Execution of Irregular Algorithms

A Lightweight Infrastructure for Graph Analytics
A Lightweight Infrastructure for Graph AnalyticsA Lightweight Infrastructure for Graph Analytics
A Lightweight Infrastructure for Graph Analytics
Donald Nguyen
 
DSJ_Unit I & II.pdf
DSJ_Unit I & II.pdfDSJ_Unit I & II.pdf
DSJ_Unit I & II.pdf
Arumugam90
 
Data structure and algorithm using java
Data structure and algorithm using javaData structure and algorithm using java
Data structure and algorithm using java
Narayan Sau
 
INTRODUCTION TO LISP
INTRODUCTION TO LISPINTRODUCTION TO LISP
INTRODUCTION TO LISP
Nilt1234
 
Search algorithms for discrete optimization
Search algorithms for discrete optimizationSearch algorithms for discrete optimization
Search algorithms for discrete optimization
Sally Salem
 
Concurrency
ConcurrencyConcurrency
Concurrency
Biju Nair
 
Deterministic Galois: On-demand, Portable and Parameterless
Deterministic Galois: On-demand, Portable and ParameterlessDeterministic Galois: On-demand, Portable and Parameterless
Deterministic Galois: On-demand, Portable and Parameterless
Donald Nguyen
 
Searching Algorithms
Searching AlgorithmsSearching Algorithms
Searching Algorithms
Afaq Mansoor Khan
 
The Road to Lambda - Mike Duigou
The Road to Lambda - Mike DuigouThe Road to Lambda - Mike Duigou
The Road to Lambda - Mike Duigou
jaxconf
 
Design and Analysis of Algorithms.pptx
Design and Analysis of Algorithms.pptxDesign and Analysis of Algorithms.pptx
Design and Analysis of Algorithms.pptx
Syed Zaid Irshad
 
Groovy concurrency
Groovy concurrencyGroovy concurrency
Groovy concurrency
Alex Miller
 
Algorithms & Complexity Calculation
Algorithms & Complexity CalculationAlgorithms & Complexity Calculation
Algorithms & Complexity Calculation
Akhil Kaushik
 
The joy of functional programming
The joy of functional programmingThe joy of functional programming
The joy of functional programming
Steve Zhang
 
Hadoop classes in mumbai
Hadoop classes in mumbaiHadoop classes in mumbai
Hadoop classes in mumbai
Vibrant Technologies & Computers
 
Phases of distributed query processing
Phases of distributed query processingPhases of distributed query processing
Phases of distributed query processing
Nevil Dsouza
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithms
Julie Iskander
 
Lecutre-6 Datapath Design.ppt
Lecutre-6 Datapath Design.pptLecutre-6 Datapath Design.ppt
Lecutre-6 Datapath Design.ppt
RaJibRaju3
 
Functional Programming.pptx
Functional Programming.pptxFunctional Programming.pptx
Functional Programming.pptx
KarthickT28
 
AI Robotics
AI RoboticsAI Robotics
AI Robotics
Yasir Khan
 
PSPICE_Tutorial-2.ppt
PSPICE_Tutorial-2.pptPSPICE_Tutorial-2.ppt
PSPICE_Tutorial-2.ppt
ssuser46314b
 

Similar to Galois: A System for Parallel Execution of Irregular Algorithms (20)

A Lightweight Infrastructure for Graph Analytics
A Lightweight Infrastructure for Graph AnalyticsA Lightweight Infrastructure for Graph Analytics
A Lightweight Infrastructure for Graph Analytics
 
DSJ_Unit I & II.pdf
DSJ_Unit I & II.pdfDSJ_Unit I & II.pdf
DSJ_Unit I & II.pdf
 
Data structure and algorithm using java
Data structure and algorithm using javaData structure and algorithm using java
Data structure and algorithm using java
 
INTRODUCTION TO LISP
INTRODUCTION TO LISPINTRODUCTION TO LISP
INTRODUCTION TO LISP
 
Search algorithms for discrete optimization
Search algorithms for discrete optimizationSearch algorithms for discrete optimization
Search algorithms for discrete optimization
 
Concurrency
ConcurrencyConcurrency
Concurrency
 
Deterministic Galois: On-demand, Portable and Parameterless
Deterministic Galois: On-demand, Portable and ParameterlessDeterministic Galois: On-demand, Portable and Parameterless
Deterministic Galois: On-demand, Portable and Parameterless
 
Searching Algorithms
Searching AlgorithmsSearching Algorithms
Searching Algorithms
 
The Road to Lambda - Mike Duigou
The Road to Lambda - Mike DuigouThe Road to Lambda - Mike Duigou
The Road to Lambda - Mike Duigou
 
Design and Analysis of Algorithms.pptx
Design and Analysis of Algorithms.pptxDesign and Analysis of Algorithms.pptx
Design and Analysis of Algorithms.pptx
 
Groovy concurrency
Groovy concurrencyGroovy concurrency
Groovy concurrency
 
Algorithms & Complexity Calculation
Algorithms & Complexity CalculationAlgorithms & Complexity Calculation
Algorithms & Complexity Calculation
 
The joy of functional programming
The joy of functional programmingThe joy of functional programming
The joy of functional programming
 
Hadoop classes in mumbai
Hadoop classes in mumbaiHadoop classes in mumbai
Hadoop classes in mumbai
 
Phases of distributed query processing
Phases of distributed query processingPhases of distributed query processing
Phases of distributed query processing
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithms
 
Lecutre-6 Datapath Design.ppt
Lecutre-6 Datapath Design.pptLecutre-6 Datapath Design.ppt
Lecutre-6 Datapath Design.ppt
 
Functional Programming.pptx
Functional Programming.pptxFunctional Programming.pptx
Functional Programming.pptx
 
AI Robotics
AI RoboticsAI Robotics
AI Robotics
 
PSPICE_Tutorial-2.ppt
PSPICE_Tutorial-2.pptPSPICE_Tutorial-2.ppt
PSPICE_Tutorial-2.ppt
 

Recently uploaded

Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
Drona Infotech
 
Flutter vs. React Native: A Detailed Comparison for App Development in 2024
Flutter vs. React Native: A Detailed Comparison for App Development in 2024Flutter vs. React Native: A Detailed Comparison for App Development in 2024
Flutter vs. React Native: A Detailed Comparison for App Development in 2024
dhavalvaghelanectarb
 
Stork Product Overview: An AI-Powered Autonomous Delivery Fleet
Stork Product Overview: An AI-Powered Autonomous Delivery FleetStork Product Overview: An AI-Powered Autonomous Delivery Fleet
Stork Product Overview: An AI-Powered Autonomous Delivery Fleet
Vince Scalabrino
 
Alluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio Webinar | 10x Faster Trino Queries on Your Data PlatformAlluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio, Inc.
 
The Rising Future of CPaaS in the Middle East 2024
The Rising Future of CPaaS in the Middle East 2024The Rising Future of CPaaS in the Middle East 2024
The Rising Future of CPaaS in the Middle East 2024
Yara Milbes
 
美洲杯赔率投注网【​网址​🎉3977·EE​🎉】
美洲杯赔率投注网【​网址​🎉3977·EE​🎉】美洲杯赔率投注网【​网址​🎉3977·EE​🎉】
美洲杯赔率投注网【​网址​🎉3977·EE​🎉】
widenerjobeyrl638
 
14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision
ShulagnaSarkar2
 
How GenAI Can Improve Supplier Performance Management.pdf
How GenAI Can Improve Supplier Performance Management.pdfHow GenAI Can Improve Supplier Performance Management.pdf
How GenAI Can Improve Supplier Performance Management.pdf
Zycus
 
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptxOperational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
sandeepmenon62
 
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
dakas1
 
Cost-Effective Strategies For iOS App Development
Cost-Effective Strategies For iOS App DevelopmentCost-Effective Strategies For iOS App Development
Cost-Effective Strategies For iOS App Development
Softradix Technologies
 
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
safelyiotech
 
Computer Science & Engineering VI Sem- New Syllabus.pdf
Computer Science & Engineering VI Sem- New Syllabus.pdfComputer Science & Engineering VI Sem- New Syllabus.pdf
Computer Science & Engineering VI Sem- New Syllabus.pdf
chandangoswami40933
 
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
kalichargn70th171
 
42 Ways to Generate Real Estate Leads - Sellxpert
42 Ways to Generate Real Estate Leads - Sellxpert42 Ways to Generate Real Estate Leads - Sellxpert
42 Ways to Generate Real Estate Leads - Sellxpert
vaishalijagtap12
 
Penify - Let AI do the Documentation, you write the Code.
Penify - Let AI do the Documentation, you write the Code.Penify - Let AI do the Documentation, you write the Code.
Penify - Let AI do the Documentation, you write the Code.
KrishnaveniMohan1
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
dakas1
 
Photoshop Tutorial for Beginners (2024 Edition)
Photoshop Tutorial for Beginners (2024 Edition)Photoshop Tutorial for Beginners (2024 Edition)
Photoshop Tutorial for Beginners (2024 Edition)
alowpalsadig
 
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid
 
Software Test Automation - A Comprehensive Guide on Automated Testing.pdf
Software Test Automation - A Comprehensive Guide on Automated Testing.pdfSoftware Test Automation - A Comprehensive Guide on Automated Testing.pdf
Software Test Automation - A Comprehensive Guide on Automated Testing.pdf
kalichargn70th171
 

Recently uploaded (20)

Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
 
Flutter vs. React Native: A Detailed Comparison for App Development in 2024
Flutter vs. React Native: A Detailed Comparison for App Development in 2024Flutter vs. React Native: A Detailed Comparison for App Development in 2024
Flutter vs. React Native: A Detailed Comparison for App Development in 2024
 
Stork Product Overview: An AI-Powered Autonomous Delivery Fleet
Stork Product Overview: An AI-Powered Autonomous Delivery FleetStork Product Overview: An AI-Powered Autonomous Delivery Fleet
Stork Product Overview: An AI-Powered Autonomous Delivery Fleet
 
Alluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio Webinar | 10x Faster Trino Queries on Your Data PlatformAlluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio Webinar | 10x Faster Trino Queries on Your Data Platform
 
The Rising Future of CPaaS in the Middle East 2024
The Rising Future of CPaaS in the Middle East 2024The Rising Future of CPaaS in the Middle East 2024
The Rising Future of CPaaS in the Middle East 2024
 
美洲杯赔率投注网【​网址​🎉3977·EE​🎉】
美洲杯赔率投注网【​网址​🎉3977·EE​🎉】美洲杯赔率投注网【​网址​🎉3977·EE​🎉】
美洲杯赔率投注网【​网址​🎉3977·EE​🎉】
 
14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision
 
How GenAI Can Improve Supplier Performance Management.pdf
How GenAI Can Improve Supplier Performance Management.pdfHow GenAI Can Improve Supplier Performance Management.pdf
How GenAI Can Improve Supplier Performance Management.pdf
 
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptxOperational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
 
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
 
Cost-Effective Strategies For iOS App Development
Cost-Effective Strategies For iOS App DevelopmentCost-Effective Strategies For iOS App Development
Cost-Effective Strategies For iOS App Development
 
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
 
Computer Science & Engineering VI Sem- New Syllabus.pdf
Computer Science & Engineering VI Sem- New Syllabus.pdfComputer Science & Engineering VI Sem- New Syllabus.pdf
Computer Science & Engineering VI Sem- New Syllabus.pdf
 
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
 
42 Ways to Generate Real Estate Leads - Sellxpert
42 Ways to Generate Real Estate Leads - Sellxpert42 Ways to Generate Real Estate Leads - Sellxpert
42 Ways to Generate Real Estate Leads - Sellxpert
 
Penify - Let AI do the Documentation, you write the Code.
Penify - Let AI do the Documentation, you write the Code.Penify - Let AI do the Documentation, you write the Code.
Penify - Let AI do the Documentation, you write the Code.
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
 
Photoshop Tutorial for Beginners (2024 Edition)
Photoshop Tutorial for Beginners (2024 Edition)Photoshop Tutorial for Beginners (2024 Edition)
Photoshop Tutorial for Beginners (2024 Edition)
 
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
 
Software Test Automation - A Comprehensive Guide on Automated Testing.pdf
Software Test Automation - A Comprehensive Guide on Automated Testing.pdfSoftware Test Automation - A Comprehensive Guide on Automated Testing.pdf
Software Test Automation - A Comprehensive Guide on Automated Testing.pdf
 

Galois: A System for Parallel Execution of Irregular Algorithms

  • 1. Galois: A System for Parallel Execution of Irregular Algorithms Donald Nguyen The University of Texas at Austin
  • 2. Parallel Programming is Changing • Data – Structured (vector, matrix) versus unstructured (graphs) • Platforms – Dedicated clusters versus cloud, mobile • People – Small number of highly trained scientists versus large number of self- trained parallel programmers Old World New World 2
  • 3. The Search for Good Programming Models • Tension between productivity and performance – Support large number of application programmers with small number of expert parallel programmers – Performance comparable to hand-written codes • Galois project 3 Joe Stephanie
  • 4. The Search for Good Programming Models • Tension between productivity and performance – Support large number of application programmers with small number of expert parallel programmers – Performance comparable to hand-written codes • Galois project 4 Joe Stephanie
  • 5. Outline • Parallel Program = Operator + Schedule + e • Galois system implementation – Operator, data structures, scheduling • Galois studies – Intel study – Deterministic scheduling – Adding a scheduler: sparse tiling – Comparison with transactional memory – Implementing other programming models 5 Parallel Data Structure
  • 6. SSSP • Find the shortest distance from source node to all other nodes in a graph – Label nodes with tentative distance – Assume non-negative edge weights – BFS is a special case • Algorithms – Chaoticrelaxation O(2V) – Bellman-Ford O(VE) – Dijkstra’s algorithm O(E log V) • Uses priority queue – Δ-stepping • Uses sequence of bags to prioritize work • Δ=1: Dijkstra • Δ=∞: Chaotic relaxation • Different algorithms are different schedules for applying relaxations – Improved work efficiency – Input sensitivity – Locality 5 3 1 9 ∞∞ ∞ 0 1 9 3 Operator Relaxation 2 6
  • 7. SSSP • Find the shortest distance from source node to all other nodes in a graph – Label nodes with tentative distance – Assume non-negative edge weights – BFS is a special case • Algorithms – Chaoticrelaxation O(2V) – Bellman-Ford O(VE) – Dijkstra’s algorithm O(E log V) • Uses priority queue – Δ-stepping • Uses sequence of bags to prioritize work • Δ=1: Dijkstra • Δ=∞: Chaotic relaxation • Different algorithms are different schedules for applying relaxations – Improved work efficiency – Input sensitivity – Locality 5 3 1 9 ∞∞ ∞ 0 Active node 1 9 3 Operator Relaxation 2 7
  • 8. SSSP • Find the shortest distance from source node to all other nodes in a graph – Label nodes with tentative distance – Assume non-negative edge weights – BFS is a special case • Algorithms – Chaoticrelaxation O(2V) – Bellman-Ford O(VE) – Dijkstra’s algorithm O(E log V) • Uses priority queue – Δ-stepping • Uses sequence of bags to prioritize work • Δ=1: Dijkstra • Δ=∞: Chaotic relaxation • Different algorithms are different schedules for applying relaxations – Improved work efficiency – Input sensitivity – Locality 5 3 1 9 ∞∞ ∞ 0 Active node 1 9 4 3 1 3 Operator Relaxation 2 8
  • 9. SSSP • Find the shortest distance from source node to all other nodes in a graph – Label nodes with tentative distance – Assume non-negative edge weights – BFS is a special case • Algorithms – Chaoticrelaxation O(2V) – Bellman-Ford O(VE) – Dijkstra’s algorithm O(E log V) • Uses priority queue – Δ-stepping • Uses sequence of bags to prioritize work • Δ=1: Dijkstra • Δ=∞: Chaotic relaxation • Different algorithms are different schedules for applying relaxations – Improved work efficiency – Input sensitivity – Locality 5 3 1 9 ∞∞ ∞ 0 Active node Neighborhood Activity 1 9 4 3 1 3 Operator Relaxation 2 9
  • 10. SSSP • Find the shortest distance from source node to all other nodes in a graph – Label nodes with tentative distance – Assume non-negative edge weights – BFS is a special case • Algorithms – Chaoticrelaxation O(2V) – Bellman-Ford O(VE) – Dijkstra’s algorithm O(E log V) • Uses priority queue – Δ-stepping • Uses sequence of bags to prioritize work • Δ=1: Dijkstra • Δ=∞: Chaotic relaxation • Different algorithms are different schedules for applying relaxations – Improved work efficiency – Input sensitivity – Locality 5 3 1 9 ∞ 0 Active node 1 9 4 3 1 3 Operator Relaxation 2 1 5 10
  • 11. SSSP • Find the shortest distance from source node to all other nodes in a graph – Label nodes with tentative distance – Assume non-negative edge weights – BFS is a special case • Algorithms – Chaoticrelaxation O(2V) – Bellman-Ford O(VE) – Dijkstra’s algorithm O(E log V) • Uses priority queue – Δ-stepping • Uses sequence of bags to prioritize work • Δ=1: Dijkstra • Δ=∞: Chaotic relaxation • Different algorithms are different schedules for applying relaxations – Improved work efficiency – Input sensitivity – Locality 5 3 1 9 ∞ 0 Active node 1 9 4 3 1 3 Operator Relaxation 2 1 5 11
  • 12. SSSP • Find the shortest distance from source node to all other nodes in a graph – Label nodes with tentative distance – Assume non-negative edge weights – BFS is a special case • Algorithms – Chaoticrelaxation O(2V) – Bellman-Ford O(VE) – Dijkstra’s algorithm O(E log V) • Uses priority queue – Δ-stepping • Uses sequence of bags to prioritize work • Δ=1: Dijkstra • Δ=∞: Chaotic relaxation • Different algorithms are different schedules for applying relaxations – Improved work efficiency – Input sensitivity – Locality 5 3 1 9 ∞ 0 Active node 1 9 4 3 1 3 Operator Relaxation 2 Algorithm = Operator + Schedule 1 5 12
  • 13. Stencil Computation • Finite difference computation – Active nodes: nodes in At+1 – Operator: 5 point stencil – Different schedules have different locality • Regular application – Grid structure and active nodes known statically – Can be parallelized by a compiler 13 At At+1
  • 14. • Operator formulation – Active elements: nodes or edges where there is work to be done – Operator: computation at active element • Activity: application of operator to active element • Neighborhood: graph elements read or written by activity – Ordering: order in which active elements must appear to have been processed • Unordered algorithms: any order is fine (e.g., chaotic relaxation) but may have soft priorities • Ordered algorithms: algorithm-specific order (e.g., discrete event simulation) – Parallelism: process activities in parallel subject to neighborhood and ordering constraints • Data parallelism is a special case (all nodes active, no conflicts) : active node : neighborhood Abstraction of Algorithms 14
  • 15. • Operator formulation – Active elements: nodes or edges where there is work to be done – Operator: computation at active element • Activity: application of operator to active element • Neighborhood: graph elements read or written by activity – Ordering: order in which active elements must appear to have been processed • Unordered algorithms: any order is fine (e.g., chaotic relaxation) but may have soft priorities • Ordered algorithms: algorithm-specific order (e.g., discrete event simulation) – Parallelism: process activities in parallel subject to neighborhood and ordering constraints • Data parallelism is a special case (all nodes active, no conflicts) : active node : neighborhood Abstraction of Algorithms 15
  • 16. • Operator formulation – Active elements: nodes or edges where there is work to be done – Operator: computation at active element • Activity: application of operator to active element • Neighborhood: graph elements read or written by activity – Ordering: order in which active elements must appear to have been processed • Unordered algorithms: any order is fine (e.g., chaotic relaxation) but may have soft priorities • Ordered algorithms: algorithm-specific order (e.g., discrete event simulation) – Parallelism: process activities in parallel subject to neighborhood and ordering constraints • Data parallelism is a special case (all nodes active, no conflicts) : active node : neighborhood Abstraction of Algorithms 16
  • 17. • Operator formulation – Active elements: nodes or edges where there is work to be done – Operator: computation at active element • Activity: application of operator to active element • Neighborhood: graph elements read or written by activity – Ordering: order in which active elements must appear to have been processed • Unordered algorithms: any order is fine (e.g., chaotic relaxation) but may have soft priorities • Ordered algorithms: algorithm-specific order (e.g., discrete event simulation) – Parallelism: process activities in parallel subject to neighborhood and ordering constraints • Data parallelism is a special case (all nodes active, no conflicts) : active node : neighborhood Abstraction of Algorithms 17
  • 18. • Operator formulation – Active elements: nodes or edges where there is work to be done – Operator: computation at active element • Activity: application of operator to active element • Neighborhood: graph elements read or written by activity – Ordering: order in which active elements must appear to have been processed • Unordered algorithms: any order is fine (e.g., chaotic relaxation) but may have soft priorities • Ordered algorithms: algorithm-specific order (e.g., discrete event simulation) – Parallelism: process activities in parallel subject to neighborhood and ordering constraints • Data parallelism is a special case (all nodes active, no conflicts) : active node : neighborhood Abstraction of Algorithms 18
  • 19. • Operator formulation – Active elements: nodes or edges where there is work to be done – Operator: computation at active element • Activity: application of operator to active element • Neighborhood: graph elements read or written by activity – Ordering: order in which active elements must appear to have been processed • Unordered algorithms: any order is fine (e.g., chaotic relaxation) but may have soft priorities • Ordered algorithms: algorithm-specific order (e.g., discrete event simulation) – Parallelism: process activities in parallel subject to neighborhood and ordering constraints • Data parallelism is a special case (all nodes active, no conflicts) : active node : neighborhood Abstraction of Algorithms 19 B A C D
  • 20. • Operator formulation – Active elements: nodes or edges where there is work to be done – Operator: computation at active element • Activity: application of operator to active element • Neighborhood: graph elements read or written by activity – Ordering: order in which active elements must appear to have been processed • Unordered algorithms: any order is fine (e.g., chaotic relaxation) but may have soft priorities • Ordered algorithms: algorithm-specific order (e.g., discrete event simulation) – Parallelism: process activities in parallel subject to neighborhood and ordering constraints • Data parallelism is a special case (all nodes active, no conflicts) : active node : neighborhood Abstraction of Algorithms 20 B A C D
  • 21. TAO of Parallelism Algorithms Topology Active Nodes Operator Structured Semi-structured Unstructured Location Ordering Morph Local computation Reader Topology-driven Data-driven Unordered Ordered 21 Pingali, Nguyen, et al. The tao of parallelism in algorithms. PLDI 2011.
  • 22. TAO of Parallelism 22 Pingali, Nguyen, et al. The tao of parallelism in algorithms. PLDI 2011. Galois Static Inspector-Executor Bulk-Synchronous Optimistic Round-based
  • 23. Galois Project Stephanie: Parallel Data Structures Joe: Operator + Schedule Program = Algorithm + Data Structure
  • 24. Galois Project Parallel Program = Operator + Schedule + Parallel Data Structure Stephanie: Parallel Data Structures Joe: Operator + Schedule Program = Algorithm + Data Structure
  • 25. My Contribution Design and implementation of a programming model for high-performance parallelism by factoring programs into operator + schedule + data structure 25
  • 26. Themes • Joe – Program = Operator + Schedule + Data Structure • Stephanie – Scalability principle1: disjoint access • Accesses to distinct logical entities should be disjoint physically as well • “Do no harm” – Scalability principle2: virtualizetasks • Execution of tasks should not be eagerly tied to particular physical resources • “Be flexible” • Challenges – Easy to introduce inadvertentsharing if not looking at entire system stack – Small task sizes (~ 100s of instructions) 26
  • 27. Publications • M. Amber Hassaan, Donald Nguyen, Keshav Pingali. Kinetic Dependence Graphs (to appear). ASPLOS 2015. • Konstantions Karantasis, Andrew Lenharth, Donald Nguyen, et al. Parallelization of reordering algorithms for bandwidth and wavefront reduction. SC 2014. • Damian Goik, Konrad Jopek, Maciej Paszyński, Andrew Lenharth, Donald Nguyen, and Keshav Pingali. Graph grammar based multi-thread multi-frontal direct solver with galois scheduler. ICCS 2014. • Donald Nguyen, Andrew Lenharth, Keshav Pingali. Deterministic Galois: On-demand, parameterless and portable. ASPLOS 2014. • Donald Nguyen, Andrew Lenharth, Keshav Pingali. A lightweight infrastructure for graph analytics. SOSP 2013. • Keshav Pingali, Donald Nguyen, Milind Kulkarni, et al. The tao of parallelism in algorithms. PLDI 2011. • Donald Nguyen, Keshav Pingali. Synthesizing concurrent schedulers for irregular algorithms. ASPLOS 2011. • Milind Kulkarni, Donald Nguyen, Dimitrios Prountzos, et al. Exploiting the commutativity lattice. PLDI 2011. • Xin Sui, Donald Nguyen, Martin Burtscher, Keshav Pingali. Parallel Graph Partitioning on Multicore Architectures. LCPC 2010. • Mario Méndez-Lojo, Donald Nguyen, Dimitrios Prountzos, et al. Structure-driven optimizations for amorphous data-parallel programs. PPOPP 2010. • Shih-wei Liao, Tzu-Han Hung, Donald Nguyen, et al. Machine learning-based prefetch optimization for data center applications. SC 2009. 27
  • 28. Publications • M. Amber Hassaan, Donald Nguyen, Keshav Pingali. Kinetic Dependence Graphs (to appear). ASPLOS 2015. • Konstantions Karantasis, Andrew Lenharth, Donald Nguyen, et al. Parallelization of reordering algorithms for bandwidth and wavefront reduction. SC 2014. • Damian Goik, Konrad Jopek, Maciej Paszyński, Andrew Lenharth, Donald Nguyen, and Keshav Pingali. Graph grammar based multi-thread multi-frontal direct solver with galois scheduler. ICCS 2014. • Donald Nguyen, Andrew Lenharth, Keshav Pingali. Deterministic Galois: On-demand, parameterless and portable. ASPLOS 2014. • Donald Nguyen, Andrew Lenharth, Keshav Pingali. A lightweight infrastructure for graph analytics. SOSP 2013. • Keshav Pingali, Donald Nguyen, Milind Kulkarni, et al. The tao of parallelism in algorithms. PLDI 2011. • Donald Nguyen, Keshav Pingali. Synthesizing concurrent schedulers for irregular algorithms. ASPLOS 2011. • Milind Kulkarni, Donald Nguyen, Dimitrios Prountzos, et al. Exploiting the commutativity lattice. PLDI 2011. • Xin Sui, Donald Nguyen, Martin Burtscher, Keshav Pingali. Parallel Graph Partitioning on Multicore Architectures. LCPC 2010. • Mario Méndez-Lojo, Donald Nguyen, Dimitrios Prountzos, et al. Structure-driven optimizations for amorphous data-parallel programs. PPOPP 2010. • Shih-wei Liao, Tzu-Han Hung, Donald Nguyen, et al. Machine learning-based prefetch optimization for data center applications. SC 2009. 28
  • 29. Outline • Parallel Program = Operator + Schedule + e • Galois system implementation – Operator, data structures, scheduling • Galois studies – Intel study – Deterministic scheduling – Adding a scheduler: sparse tiling – Comparison with transactional memory – Implementing other programming models 29 Parallel Data Structure
  • 30. Galois Programming Model • Galois system (Stephanie) – Runtimeand libraryof concurrent data structures • Galois system (Joe) – Set iteratorsexpress parallelism – Operator: body of iterator – ADT (e.g., graph, set) • Unordered execution – Could haveconflicts – Some serialization of iterations – New iterations can be added • Ordered execution – See Hassaan, Nguyen and Pingali. Kinetic Dependence Graphs (to appear). ASPLOS 2015. Graph g for_each (Edge e : wl) Node n = g.getEdgeDst(e) …
  • 31. Hello Graph 31 #include “Galois/Galois.h” #include “Galois/Graph/Graph.h” typedef Galois::Graph::LC_CSR_Graph<int, int> Graph; typedef Graph::GraphNode GNode; Graph graph; struct P { void operator()(GNode& n, Galois::UserContext<GNode>& c) { int& d = graph.getData(n); if (d.value++ < 5) c.push(n); } }; int main(int argc, char** argv) { Galois::Graph::readGraph(graph, argv[1]); … Galois::for_each(initial, P(), Galois::wl<WL>()); return 0; }
  • 32. Hello Graph 32 #include “Galois/Galois.h” #include “Galois/Graph/Graph.h” typedef Galois::Graph::LC_CSR_Graph<int, int> Graph; typedef Graph::GraphNode GNode; Graph graph; struct P { void operator()(GNode& n, Galois::UserContext<GNode>& c) { int& d = graph.getData(n); if (d.value++ < 5) c.push(n); } }; int main(int argc, char** argv) { Galois::Graph::readGraph(graph, argv[1]); … Galois::for_each(initial, P(), Galois::wl<WL>()); return 0; } Data structure declarations Galois Iterator Operator
  • 33. Hello Graph 33 #include “Galois/Galois.h” #include “Galois/Graph/Graph.h” typedef Galois::Graph::LC_CSR_Graph<int, int> Graph; typedef Graph::GraphNode GNode; Graph graph; struct P { void operator()(GNode& n, Galois::UserContext<GNode>& c) { int& d = graph.getData(n); if (d.value++ < 5) c.push(n); } }; int main(int argc, char** argv) { Galois::Graph::readGraph(graph, argv[1]); … Galois::for_each(initial, P(), Galois::wl<WL>()); return 0; } Data structure declarations Galois Iterator Operator Sequential program semantics
  • 34. Hello Graph 34 #include “Galois/Galois.h” #include “Galois/Graph/Graph.h” typedef Galois::Graph::LC_CSR_Graph<int, int> Graph; typedef Graph::GraphNode GNode; Graph graph; struct P { void operator()(GNode& n, Galois::UserContext<GNode>& c) { int& d = graph.getData(n); if (d.value++ < 5) c.push(n); } }; int main(int argc, char** argv) { Galois::Graph::readGraph(graph, argv[1]); … Galois::for_each(initial, P(), Galois::wl<WL>()); return 0; } Data structure declarations Galois Iterator Operator Neighborhood defined by graph API calls Sequential program semantics
  • 35. Hello Graph 35 #include “Galois/Galois.h” #include “Galois/Graph/Graph.h” typedef Galois::Graph::LC_CSR_Graph<int, int> Graph; typedef Graph::GraphNode GNode; Graph graph; struct P { void operator()(GNode& n, Galois::UserContext<GNode>& c) { int& d = graph.getData(n); if (d.value++ < 5) c.push(n); } }; int main(int argc, char** argv) { Galois::Graph::readGraph(graph, argv[1]); … Galois::for_each(initial, P(), Galois::wl<WL>()); return 0; } struct TaskIndexer { int operator()(GNode& n) { return graph.getData(n) >> shift; } }; using namespace Galois::WorkList; typedef dChunkedFIFO<> WL; typedef OrderedByIntegerMetric<TaskIndexer> WL; Neighborhood defined by graph API calls Sequential program semantics Scheduling DSL
  • 36. Galois Operator 36 B A Graph g for_each (Node n : wl) for (Node m : g.edges(n)) … A B
  • 37. Galois Operator 37 B A T1 Graph g for_each (Node n : wl) for (Node m : g.edges(n)) … B
  • 38. Galois Operator 38 B A T1 T2 Graph g for_each (Node n : wl) for (Node m : g.edges(n)) …
  • 39. Galois Operator 39 B A T1 T2 Graph g for_each (Node n : wl) for (Node m : g.edges(n)) … T1
  • 40. Galois Operator 40 B A T1 T2 Graph g for_each (Node n : wl) for (Node m : g.edges(n)) … T1
  • 41. Galois Operator 41 B A T1 T2 Graph g for_each (Node n : wl) for (Node m : g.edges(n)) … T1 T2
  • 42. Galois Operator 42 B A T1 T2 Graph g for_each (Node n : wl) for (Node m : g.edges(n)) … T1 T2
  • 43. Galois Operator • All data accesses go through API – Galois data implementations maintainmarks – When marks overlap, rollback activity and try later • Cautious – Commonlyoperators read their entire neighborhood before modifying any element – Failsafe point: point in operator before first write • Non-deterministic • Difference from Thread Level Speculation [Rauchwerger and Padua, PLDI 1995] – Iterations are unordered – Operators are cautious – Conflicts are detected at ADT level rather than memory level 43 B A T1 T2 Graph g for_each (Node n : wl) for (Node m : g.edges(n)) … T1 T2
  • 44. Galois Data Structures • Graphs – May require multiple loads to access data for a node – Exploit local computation structure – Simple allocation schemes may be imbalanced in physical memory
  • 45. Galois Data Structures • Graphs – May require multiple loads to access data for a node – Exploit local computation structure – Simple allocation schemes may be imbalanced in physical memory nd ed dst node data[N] edge idx[N] edge data[E] edge dst[E] CSR nd: node data ed: edge data dst: edge destination len: # of edges
  • 46. Galois Data Structures • Graphs – May require multiple loads to access data for a node – Exploit local computation structure – Simple allocation schemes may be imbalanced in physical memory nd ed dst Inlining Inlining: Andrew Lenharth nd: node data ed: edge data dst: edge destination len: # of edges
  • 47. Galois Data Structures • Graphs – May require multiple loads to access data for a node – Exploit local computation structure – Simple allocation schemes may be imbalanced in physical memory nd ed dst Inlining Inlining: Andrew Lenharth nd: node data ed: edge data dst: edge destination len: # of edges
  • 48. Galois Data Structures • Graphs – May require multiple loads to access data for a node – Exploit local computation structure – Simple allocation schemes may be imbalanced in physical memory nd len ed ed Inlining Inlining: Andrew Lenharth nd: node data ed: edge data dst: edge destination len: # of edges
  • 49. Galois Data Structures • Graphs – May require multiple loads to access data for a node – Exploit local computation structure – Simple allocation schemes may be imbalanced in physical memory nd len ed ed Inlining Inlining: Andrew Lenharth
  • 50. Galois Data Structures • Graphs – May require multiple loads to access data for a node – Exploit local computation structure – Simple allocation schemes may be imbalanced in physical memory nd len ed ed Inlining Placement DRAM 0 DRAM 1 Virtual Memory Physical Memory Inlining: Andrew Lenharth c1 c2
  • 51. Scheduling in Galois • Users write high-level scheduling policies – DSL for specifying policies • Separation of concerns – Users understand their operator but have vague understanding of scheduling – Schedulers are difficult to implement efficiently • Galois system synthesizes scheduler by composing elements from a scheduler library 51 Nguyen and Pingali. Synthesizing concurrent schedulers for irregular algorithms. ASPLOS 2011.
  • 52. Scheduling Components and Policies • Components: FIFO, LIFO, BulkSynchronous, ChunkedFIFO, OrderedByMetric, … Algorithm Scheduling Policy Used By DMR OrderedByMetric(λt.minangle(t)) FIFO Shewchuk96 Global: ChunkedFIFO(k) Local: LIFO Kulkarni08 DT OrderedByMetric(λp.round(p)) FIFO Amenta03 Random Clarkson89 PFP FIFO Goldberg88 OrderedByMetric(λn.height(n)) Cherkassy95 PTA FIFO Pearce03 BulkSynchronous Nielson99 SSSP FIFO Bellman- Ford OrderedByMetric(λn.w(n)/Δ + (light(n) ? 0 : ½)) FIFO Δ-stepping 52
  • 58. Scheduling Bag c1 c2 Head Package 1 58 Bag c1 c2 Head Package 2 Per-PackageBag: Andrew Lenharth
  • 59. Scheduling Bag c1 c2 Head Package 1 59 Bag c1 c2 Head Package 2 Per-PackageBag: Andrew Lenharth
  • 60. Scheduling Bag c1 c2 Head Package 1 60 Bag c1 c2 Head Package 2 Per-PackageBag: Andrew Lenharth
  • 61. Scheduling Bag c1 c2 Head Package 1 61 Bag c1 c2 Head Package 2 Per-PackageBag: Andrew Lenharth
  • 62. Scheduling Bag c1 c2 Head Package 1 62 Bag c1 c2 Head Package 2 Per-PackageBag: Andrew Lenharth
  • 63. Scheduling Bag c1 c2 Head Package 1 63 Bag c1 c2 Head Package 2 Per-PackageBag: Andrew Lenharth
  • 64. Scheduling Bag c1 c2 Head Package 1 64 Bag c1 c2 Head Package 2 Per-PackageBag: Andrew Lenharth
  • 65. Scheduling Bag c1 c2 Head Package 1 65 Bag c1 c2 Head Package 2
  • 66. OrderedByIntegerMetric 66 0 1 2 3 4 5 Core 1 Core 2 3 Cursor
  • 67. OrderedByIntegerMetric • Implements soft priorities • Drawbacks – Dense range of priorities – Contention on cursor location 67 0 1 2 3 4 5 Core 1 Core 2 3 Cursor
  • 68. Improved OrderedByIntegerMetric Global Map Scheduling Bags 1 30 68 Implementation byAndrew Lenharth
  • 69. Improved OrderedByIntegerMetric 0 1 Global Map Local Maps Core 1 Core 2 Scheduling Bags 1 3 1 30 69 Implementation byAndrew Lenharth
  • 70. Improved OrderedByIntegerMetric 0 1 Global Map Local Maps Core 1 Core 2 Scheduling Bags 1 3 1 30 13 70 Implementation byAndrew Lenharth
  • 71. Improved OrderedByIntegerMetric 0 1 Global Map Local Maps Core 1 Core 2 Scheduling Bags 1 3 1 30 13 71 Implementation byAndrew Lenharth
  • 72. Improved OrderedByIntegerMetric 0 1 Global Map Local Maps Core 1 Core 2 Scheduling Bags 1 3 1 30 13 3 72 Implementation byAndrew Lenharth
  • 73. Improved OrderedByIntegerMetric 0 1 Global Map Local Maps Core 1 Core 2 Scheduling Bags 1 3 1 30 1 3 73 Implementation byAndrew Lenharth
  • 74. Improved OrderedByIntegerMetric 0 1 Global Map Local Maps Core 1 Core 2 Scheduling Bags 1 3 1 30 1 3 15 74 Implementation byAndrew Lenharth
  • 75. Improved OrderedByIntegerMetric 0 1 Global Map Local Maps Core 1 Core 2 Scheduling Bags 1 3 1 30 1 3 15 75 Implementation byAndrew Lenharth
  • 76. Improved OrderedByIntegerMetric 0 1 Global Map Local Maps Core 1 Core 2 Scheduling Bags 1 3 1 30 1 3 15 5 5 76 Implementation byAndrew Lenharth
  • 77. Improved OrderedByIntegerMetric 0 1 Global Map Local Maps Core 1 Core 2 Scheduling Bags 1 3 1 30 1 3 1 5 5 77 Implementation byAndrew Lenharth
  • 78. Galois System • Joe – Writes sequential code – Uses Galois data structures and scheduling hints • Stephanie – Writes concurrent code – Provides variety of data structures and scheduling policies for Joe to choose from • cf., Frameworks that only support one scheduler or one graph implementation 78 Stephanie: Parallel Data Structures Joe: Operator + Schedule
  • 79. Outline • Parallel Program = Operator + Schedule + e • Galois system implementation – Operator, data structures, scheduling • Galois studies – Intel study – Deterministic scheduling – Adding a scheduler: sparse tiling – Comparison with transactional memory – Implementing other programming models 79 Parallel Data Structure
  • 80. Intel Study 80 0.1 1 10 tion(seconds) Native Combblas Graphlab Socialite Giraph Galois 0.01 0.1 Livejournal Facebook Wikipedia Synthetic Timeperiterat (a) PageRank 10 100 e(seconds) Native Combblas Graphlab Socialite Giraph Galois 0 1 Livejournal Facebook Wikipedia Synthetic Overalltime (b) Breadth-First Search 10 100 1000 ation(seconds) Native Combblas Socialite Giraph 1 10 Netflix Timeperitera (c) Collaborative Fil Figure 3: Performance results for different algorithms on real-world and synthetic graphs that are s represents runtime (in log-scale), therefore lower numbers are better. FDR interconnect. The cluster runs on the Red Hat Enterprise Linux Server OS release 6.4. We use a mix of OpenMP directives to parallelize within the node and MPI code to parallelize across nodes in native code. We use the Intel® C++ Composer XE 2013 SP1 Compiler3 and the Intel® MPI library to compile the code. GraphLab uses a similar combination of OpenMP and MPI for best performance. CombBLAS runs best as a pure MPI program. We use multiple MPI processes per node to take advantage of the mul- ideal results. Given gorithms, and also be tions, we expect that i excel at all scales in te 5.2 Single nod We now show the r orative Filtering andNadathur et al. Navigating the maze of graph analytics frameworks. SIGMOD 2014. Combblas Graphlab Giraph Galois Livejournal Facebook Wikipedia Synthetic readth-First Search 10 100 1000 ation(seconds) Native Combblas Graphlab Socialite Giraph Galois 1 10 Netflix Synthetic Timeperitera (c) Collaborative Filtering 10 100 1000 10000 e(seconds) Native Combblas Graphlab Socialite Giraph Galois 0.1 1 Livejournal Facebook Wikipedia Synthetic OverallTim (d) Triangle counting rithms on real-world and synthetic graphs that are small enough to run on a single node. The y-axis
  • 81. Intel Study 81 0.1 1 10 tion(seconds) Native Combblas Graphlab Socialite Giraph Galois 0.01 0.1 Livejournal Facebook Wikipedia Synthetic Timeperiterat (a) PageRank 10 100 e(seconds) Native Combblas Graphlab Socialite Giraph Galois 0 1 Livejournal Facebook Wikipedia Synthetic Overalltime (b) Breadth-First Search 10 100 1000 ation(seconds) Native Combblas Socialite Giraph 1 10 Netflix Timeperitera (c) Collaborative Fil Figure 3: Performance results for different algorithms on real-world and synthetic graphs that are s represents runtime (in log-scale), therefore lower numbers are better. FDR interconnect. The cluster runs on the Red Hat Enterprise Linux Server OS release 6.4. We use a mix of OpenMP directives to parallelize within the node and MPI code to parallelize across nodes in native code. We use the Intel® C++ Composer XE 2013 SP1 Compiler3 and the Intel® MPI library to compile the code. GraphLab uses a similar combination of OpenMP and MPI for best performance. CombBLAS runs best as a pure MPI program. We use multiple MPI processes per node to take advantage of the mul- ideal results. Given gorithms, and also be tions, we expect that i excel at all scales in te 5.2 Single nod We now show the r orative Filtering andNadathur et al. Navigating the maze of graph analytics frameworks. SIGMOD 2014. Combblas Graphlab Giraph Galois Livejournal Facebook Wikipedia Synthetic readth-First Search 10 100 1000 ation(seconds) Native Combblas Graphlab Socialite Giraph Galois 1 10 Netflix Synthetic Timeperitera (c) Collaborative Filtering 10 100 1000 10000 e(seconds) Native Combblas Graphlab Socialite Giraph Galois 0.1 1 Livejournal Facebook Wikipedia Synthetic OverallTim (d) Triangle counting rithms on real-world and synthetic graphs that are small enough to run on a single node. The y-axis
  • 82. Intel Study 82 0.1 1 10 tion(seconds) Native Combblas Graphlab Socialite Giraph Galois 0.01 0.1 Livejournal Facebook Wikipedia Synthetic Timeperiterat (a) PageRank 10 100 e(seconds) Native Combblas Graphlab Socialite Giraph Galois 0 1 Livejournal Facebook Wikipedia Synthetic Overalltime (b) Breadth-First Search 10 100 1000 ation(seconds) Native Combblas Socialite Giraph 1 10 Netflix Timeperitera (c) Collaborative Fil Figure 3: Performance results for different algorithms on real-world and synthetic graphs that are s represents runtime (in log-scale), therefore lower numbers are better. FDR interconnect. The cluster runs on the Red Hat Enterprise Linux Server OS release 6.4. We use a mix of OpenMP directives to parallelize within the node and MPI code to parallelize across nodes in native code. We use the Intel® C++ Composer XE 2013 SP1 Compiler3 and the Intel® MPI library to compile the code. GraphLab uses a similar combination of OpenMP and MPI for best performance. CombBLAS runs best as a pure MPI program. We use multiple MPI processes per node to take advantage of the mul- ideal results. Given gorithms, and also be tions, we expect that i excel at all scales in te 5.2 Single nod We now show the r orative Filtering andNadathur et al. Navigating the maze of graph analytics frameworks. SIGMOD 2014. Combblas Graphlab Giraph Galois Livejournal Facebook Wikipedia Synthetic readth-First Search 10 100 1000 ation(seconds) Native Combblas Graphlab Socialite Giraph Galois 1 10 Netflix Synthetic Timeperitera (c) Collaborative Filtering 10 100 1000 10000 e(seconds) Native Combblas Graphlab Socialite Giraph Galois 0.1 1 Livejournal Facebook Wikipedia Synthetic OverallTim (d) Triangle counting rithms on real-world and synthetic graphs that are small enough to run on a single node. The y-axis
  • 83. Intel Study 83 0.1 1 10 tion(seconds) Native Combblas Graphlab Socialite Giraph Galois 0.01 0.1 Livejournal Facebook Wikipedia Synthetic Timeperiterat (a) PageRank 10 100 e(seconds) Native Combblas Graphlab Socialite Giraph Galois 0 1 Livejournal Facebook Wikipedia Synthetic Overalltime (b) Breadth-First Search 10 100 1000 ation(seconds) Native Combblas Socialite Giraph 1 10 Netflix Timeperitera (c) Collaborative Fil Figure 3: Performance results for different algorithms on real-world and synthetic graphs that are s represents runtime (in log-scale), therefore lower numbers are better. FDR interconnect. The cluster runs on the Red Hat Enterprise Linux Server OS release 6.4. We use a mix of OpenMP directives to parallelize within the node and MPI code to parallelize across nodes in native code. We use the Intel® C++ Composer XE 2013 SP1 Compiler3 and the Intel® MPI library to compile the code. GraphLab uses a similar combination of OpenMP and MPI for best performance. CombBLAS runs best as a pure MPI program. We use multiple MPI processes per node to take advantage of the mul- ideal results. Given gorithms, and also be tions, we expect that i excel at all scales in te 5.2 Single nod We now show the r orative Filtering andNadathur et al. Navigating the maze of graph analytics frameworks. SIGMOD 2014. Combblas Graphlab Giraph Galois Livejournal Facebook Wikipedia Synthetic readth-First Search 10 100 1000 ation(seconds) Native Combblas Graphlab Socialite Giraph Galois 1 10 Netflix Synthetic Timeperitera (c) Collaborative Filtering 10 100 1000 10000 e(seconds) Native Combblas Graphlab Socialite Giraph Galois 0.1 1 Livejournal Facebook Wikipedia Synthetic OverallTim (d) Triangle counting rithms on real-world and synthetic graphs that are small enough to run on a single node. The y-axis
  • 84. Intel Study 84 0.1 1 10 tion(seconds) Native Combblas Graphlab Socialite Giraph Galois 0.01 0.1 Livejournal Facebook Wikipedia Synthetic Timeperiterat (a) PageRank 10 100 e(seconds) Native Combblas Graphlab Socialite Giraph Galois 0 1 Livejournal Facebook Wikipedia Synthetic Overalltime (b) Breadth-First Search 10 100 1000 ation(seconds) Native Combblas Socialite Giraph 1 10 Netflix Timeperitera (c) Collaborative Fil Figure 3: Performance results for different algorithms on real-world and synthetic graphs that are s represents runtime (in log-scale), therefore lower numbers are better. FDR interconnect. The cluster runs on the Red Hat Enterprise Linux Server OS release 6.4. We use a mix of OpenMP directives to parallelize within the node and MPI code to parallelize across nodes in native code. We use the Intel® C++ Composer XE 2013 SP1 Compiler3 and the Intel® MPI library to compile the code. GraphLab uses a similar combination of OpenMP and MPI for best performance. CombBLAS runs best as a pure MPI program. We use multiple MPI processes per node to take advantage of the mul- ideal results. Given gorithms, and also be tions, we expect that i excel at all scales in te 5.2 Single nod We now show the r orative Filtering andNadathur et al. Navigating the maze of graph analytics frameworks. SIGMOD 2014. Combblas Graphlab Giraph Galois Livejournal Facebook Wikipedia Synthetic readth-First Search 10 100 1000 ation(seconds) Native Combblas Graphlab Socialite Giraph Galois 1 10 Netflix Synthetic Timeperitera (c) Collaborative Filtering 10 100 1000 10000 e(seconds) Native Combblas Graphlab Socialite Giraph Galois 0.1 1 Livejournal Facebook Wikipedia Synthetic OverallTim (d) Triangle counting rithms on real-world and synthetic graphs that are small enough to run on a single node. The y-axis
  • 85. Intel Study 85 0.1 1 10 tion(seconds) Native Combblas Graphlab Socialite Giraph Galois 0.01 0.1 Livejournal Facebook Wikipedia Synthetic Timeperiterat (a) PageRank 10 100 e(seconds) Native Combblas Graphlab Socialite Giraph Galois 0 1 Livejournal Facebook Wikipedia Synthetic Overalltime (b) Breadth-First Search 10 100 1000 ation(seconds) Native Combblas Socialite Giraph 1 10 Netflix Timeperitera (c) Collaborative Fil Figure 3: Performance results for different algorithms on real-world and synthetic graphs that are s represents runtime (in log-scale), therefore lower numbers are better. FDR interconnect. The cluster runs on the Red Hat Enterprise Linux Server OS release 6.4. We use a mix of OpenMP directives to parallelize within the node and MPI code to parallelize across nodes in native code. We use the Intel® C++ Composer XE 2013 SP1 Compiler3 and the Intel® MPI library to compile the code. GraphLab uses a similar combination of OpenMP and MPI for best performance. CombBLAS runs best as a pure MPI program. We use multiple MPI processes per node to take advantage of the mul- ideal results. Given gorithms, and also be tions, we expect that i excel at all scales in te 5.2 Single nod We now show the r orative Filtering andNadathur et al. Navigating the maze of graph analytics frameworks. SIGMOD 2014. Combblas Graphlab Giraph Galois Livejournal Facebook Wikipedia Synthetic readth-First Search 10 100 1000 ation(seconds) Native Combblas Graphlab Socialite Giraph Galois 1 10 Netflix Synthetic Timeperitera (c) Collaborative Filtering 10 100 1000 10000 e(seconds) Native Combblas Graphlab Socialite Giraph Galois 0.1 1 Livejournal Facebook Wikipedia Synthetic OverallTim (d) Triangle counting rithms on real-world and synthetic graphs that are small enough to run on a single node. The y-axis
  • 86. Intel Study 86 0.1 1 10 tion(seconds) Native Combblas Graphlab Socialite Giraph Galois 0.01 0.1 Livejournal Facebook Wikipedia Synthetic Timeperiterat (a) PageRank 10 100 e(seconds) Native Combblas Graphlab Socialite Giraph Galois 0 1 Livejournal Facebook Wikipedia Synthetic Overalltime (b) Breadth-First Search 10 100 1000 ation(seconds) Native Combblas Socialite Giraph 1 10 Netflix Timeperitera (c) Collaborative Fil Figure 3: Performance results for different algorithms on real-world and synthetic graphs that are s represents runtime (in log-scale), therefore lower numbers are better. FDR interconnect. The cluster runs on the Red Hat Enterprise Linux Server OS release 6.4. We use a mix of OpenMP directives to parallelize within the node and MPI code to parallelize across nodes in native code. We use the Intel® C++ Composer XE 2013 SP1 Compiler3 and the Intel® MPI library to compile the code. GraphLab uses a similar combination of OpenMP and MPI for best performance. CombBLAS runs best as a pure MPI program. We use multiple MPI processes per node to take advantage of the mul- ideal results. Given gorithms, and also be tions, we expect that i excel at all scales in te 5.2 Single nod We now show the r orative Filtering andNadathur et al. Navigating the maze of graph analytics frameworks. SIGMOD 2014. Combblas Graphlab Giraph Galois Livejournal Facebook Wikipedia Synthetic readth-First Search 10 100 1000 ation(seconds) Native Combblas Graphlab Socialite Giraph Galois 1 10 Netflix Synthetic Timeperitera (c) Collaborative Filtering 10 100 1000 10000 e(seconds) Native Combblas Graphlab Socialite Giraph Galois 0.1 1 Livejournal Facebook Wikipedia Synthetic OverallTim (d) Triangle counting rithms on real-world and synthetic graphs that are small enough to run on a single node. The y-axis
  • 87. Outline • Parallel Program = Operator + Schedule + e • Galois system implementation – Operator, data structures, scheduling • Galois studies – Intel study – Deterministic scheduling – Adding a scheduler: sparse tiling – Comparison with transactional memory – Implementing other programming models 87 Parallel Data Structure
  • 88. Deterministic Interference Graph (DIG) Scheduling • Traditional Galois execution – Asynchronous, non- deterministic • Deterministic Galois execution – Round-based execution – Construct an interference graph – Execute independent set – Repeat 88 Interference Graph Data Structure B A C Nguyen, Lenharth and Pingali. Deterministic Galois: on-demand, parameterless and portable. ASPLOS 2014.
  • 89. Deterministic Interference Graph (DIG) Scheduling • Traditional Galois execution – Asynchronous, non- deterministic • Deterministic Galois execution – Round-based execution – Construct an interference graph – Execute independent set – Repeat 89 Interference Graph Data Structure B A C Nguyen, Lenharth and Pingali. Deterministic Galois: on-demand, parameterless and portable. ASPLOS 2014.
  • 90. Deterministic Interference Graph (DIG) Scheduling • Traditional Galois execution – Asynchronous, non- deterministic • Deterministic Galois execution – Round-based execution – Construct an interference graph – Execute independent set – Repeat 90 B A C Interference Graph Data Structure B A C Nguyen, Lenharth and Pingali. Deterministic Galois: on-demand, parameterless and portable. ASPLOS 2014.
  • 91. Deterministic Interference Graph (DIG) Scheduling • Traditional Galois execution – Asynchronous, non- deterministic • Deterministic Galois execution – Round-based execution – Construct an interference graph – Execute independent set – Repeat 91 B A C Interference Graph Data Structure B A C Nguyen, Lenharth and Pingali. Deterministic Galois: on-demand, parameterless and portable. ASPLOS 2014.
  • 92. Deterministic Interference Graph (DIG) Scheduling • Traditional Galois execution – Asynchronous, non- deterministic • Deterministic Galois execution – Round-based execution – Construct an interference graph – Execute independent set – Repeat 92 Interference Graph Data Structure B C Nguyen, Lenharth and Pingali. Deterministic Galois: on-demand, parameterless and portable. ASPLOS 2014.
  • 93. Deterministic Interference Graph (DIG) Scheduling • Traditional Galois execution – Asynchronous, non- deterministic • Deterministic Galois execution – Round-based execution – Construct an interference graph – Execute independent set – Repeat 93 Interference Graph Data Structure B C Nguyen, Lenharth and Pingali. Deterministic Galois: on-demand, parameterless and portable. ASPLOS 2014.
  • 94. Deterministic Interference Graph (DIG) Scheduling • Traditional Galois execution – Asynchronous, non- deterministic • Deterministic Galois execution – Round-based execution – Construct an interference graph – Execute independent set – Repeat 94 B C Interference Graph Data Structure B C Nguyen, Lenharth and Pingali. Deterministic Galois: on-demand, parameterless and portable. ASPLOS 2014.
  • 95. DIG Scheduling in Practice • Just find roots in interference graph • Sequence of rounds – Round has two phases • Phase 1: mark neighborhood – Acquire marks by writing task id atomically until failsafe point – Overwriting an id only if replacing greater value – Final mark values are deterministic • Phase 2: execute roots – Execute tasks whose neighborhood only contains their own marks 95 A < B < C
  • 96. DIG Scheduling in Practice • Just find roots in interference graph • Sequence of rounds – Round has two phases • Phase 1: mark neighborhood – Acquire marks by writing task id atomically until failsafe point – Overwriting an id only if replacing greater value – Final mark values are deterministic • Phase 2: execute roots – Execute tasks whose neighborhood only contains their own marks 96 A A < B < C
  • 97. DIG Scheduling in Practice • Just find roots in interference graph • Sequence of rounds – Round has two phases • Phase 1: mark neighborhood – Acquire marks by writing task id atomically until failsafe point – Overwriting an id only if replacing greater value – Final mark values are deterministic • Phase 2: execute roots – Execute tasks whose neighborhood only contains their own marks 97 A A A A A < B < C
  • 98. DIG Scheduling in Practice • Just find roots in interference graph • Sequence of rounds – Round has two phases • Phase 1: mark neighborhood – Acquire marks by writing task id atomically until failsafe point – Overwriting an id only if replacing greater value – Final mark values are deterministic • Phase 2: execute roots – Execute tasks whose neighborhood only contains their own marks 98 A C A A A A < B < C
  • 99. DIG Scheduling in Practice • Just find roots in interference graph • Sequence of rounds – Round has two phases • Phase 1: mark neighborhood – Acquire marks by writing task id atomically until failsafe point – Overwriting an id only if replacing greater value – Final mark values are deterministic • Phase 2: execute roots – Execute tasks whose neighborhood only contains their own marks 99 A C A A A C C C A < B < C
  • 100. DIG Scheduling in Practice • Just find roots in interference graph • Sequence of rounds – Round has two phases • Phase 1: mark neighborhood – Acquire marks by writing task id atomically until failsafe point – Overwriting an id only if replacing greater value – Final mark values are deterministic • Phase 2: execute roots – Execute tasks whose neighborhood only contains their own marks 100 B A C A A A C C C A < B < C
  • 101. DIG Scheduling in Practice • Just find roots in interference graph • Sequence of rounds – Round has two phases • Phase 1: mark neighborhood – Acquire marks by writing task id atomically until failsafe point – Overwriting an id only if replacing greater value – Final mark values are deterministic • Phase 2: execute roots – Execute tasks whose neighborhood only contains their own marks 101 B A C A A A C B C B A < B < C
  • 102. Optimizations • Resuming tasks – Avoid re-executing tasks – Suspend and resume execution at failsafe point – Provide buffers to save local state • Windowing – Inspect only a subset of tasks at a time – Adaptive algorithm varies window size 102
  • 103. Evaluation of Deterministic Scheduling Platform • Intel 4x10 core (2.27 GHz) Deterministic Systems • RCDC [Devietti11] – Deterministic by scheduling – Implementation of Kendo algorithm [Olszewski09] • PBBS – Deterministic by construction – Handwritten deterministic implementations • Deterministic Galois – Non-deterministic programs (g-n) automatically made deterministic (g-d) 103 Applications • PBBS [Blelloch11] – Breadth-first search (bfs) – Delaunay mesh refinement (dmr) – Delaunay triangulation (dt) – Maximal independent set (mis)
  • 105. Evaluation dtdmr rcdc g−d pbbs g−n rcdc g−d pbbs g−n rcdc g−d pbbs g−n rcdc g−d g−n pbbs bfs mis 0 10 20 30 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 Threads Speedup 105 Non-deterministic program Threads
  • 106. Evaluation dtdmr rcdc g−d pbbs g−n rcdc g−d pbbs g−n rcdc g−d pbbs g−n rcdc g−d g−n pbbs bfs mis 0 10 20 30 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 Threads Speedup 106 RCDC (Deterministic by Scheduling) Non-deterministic program Threads
  • 107. Evaluation dtdmr rcdc g−d pbbs g−n rcdc g−d pbbs g−n rcdc g−d pbbs g−n rcdc g−d g−n pbbs bfs mis 0 10 20 30 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 Threads Speedup 107 Deterministic Galois program RCDC (Deterministic by Scheduling) Non-deterministic program Threads
  • 108. Evaluation dtdmr rcdc g−d pbbs g−n rcdc g−d pbbs g−n rcdc g−d pbbs g−n rcdc g−d g−n pbbs bfs mis 0 10 20 30 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 Threads Speedup 108 PBBS (Deterministic by Construction) Deterministic Galois program RCDC (Deterministic by Scheduling) Non-deterministic program Threads
  • 110. Third-party Evaluation Maze Router Speedup! ! ! ! ! Deterministic Scheduler! Non-Deterministic Scheduler! ! Averages VPR 5.0 1.00x VPR 5.0 1.00x Galois (1 Thread) 0.79x Galois (1 Thread) 0.89x Galois (2 Threads) 1.21x Galois (2 Threads) 1.47x Galois (4 Threads) 2.14x Galois (4 Threads) 2.99x Galois (8 Threads) 3.67x Galois (8 Threads) 5.46x 0! 1! 2! 3! 4! 5! 6! 7! 8! 0! 1! 2! 3! 4! 5! 6! 7! 8! 110 Extended results from: Moctar and Brisk. Parallel FPGA Routing based on the Operator Formulation. DAC 2014. Speedup • In FPGA routing useful to have deterministic results
  • 111. Outline • Parallel Program = Operator + Schedule + e • Galois system implementation – Operator, data structures, scheduling • Galois studies – Intel study – Deterministic scheduling – Adding a scheduler: sparse tiling – Comparison with transactional memory – Implementing other programming models 111 Parallel Data Structure
  • 112. Matrix Completion • Matrix completion problem – Netflix problem • An optimization problem – Find m×k matrix W and k×n matrix H (k << min(m,n)) such that A ≈ WH – Low-rank approximation • Algorithms – Stochastic gradient descent (SGD) – … 1 i j Aij Users Movies 2 3 1 2 3 4 Wi* H*j k W H A Matrix View Graph View Users Movies k
  • 113. Naïve SGD Implementation • Row-wise operator 113 for (User u : users) for (Item m : items(u)) op(u, m, Aij); • Poor cache behavior – With k = 100, Wi* ≈ 1KB – Degree(user) > 1000 W H A … … 100 1000
  • 114. 2D Tiled SGD • Apply operator to small 2D tiles 114 W H A … … • Additional concerns – Conflict-free scheduling of tiles • Future optimizations – Adaptive, non-uniform tiling
  • 115. Evaluation of Tiling Parameters • Same SGD parameters, initial latent vectors between implementations • ε(n) = alpha / (1 + beta * n1.5) • Handtuned tile size Datasets 115 Implementations • Galois – 2D scheduling • 2 weeks to implement – Synchronized updates • GraphLab – Standard GraphLab only supports GD – Implement 1D SGD with unsynchronized updates • Nomad – From scratch distributed code • ~1 year to implement – 2D scheduling • Tile size is function of #threads – Synchronized updates Items Users Ratings Sparsity netflix 17K 480K 99M 1e-2 yahoo 624K 1M 253M 4e-4 Yun, Yu, Hsieh, Vishwanathan, Dhillon. NOMAD: Non-locking stOchastic Multi-machine algorithm for Asynchronous and Decentralized matrix completion. VLDB 2013.
  • 116. Throughput 116 ●● ●●● ● ● ●● ●●● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ●●● ●● ● ●● ●●● ● ●●● ● ● ● ● ● ● ● netflix yahoo 0 25 50 75 100 125 10 20 30 40 20 40 20 40 Threads GFLOPS Kind ● ● ● galois graphlab nomad Theoretical Peak: 363.2 GFLOPS
  • 117. Convergence 13 32263 netflix 0.9 1.1 1.3 1.5 0 50 100 150 200 Time (s) Error 331 333 yahoo 17.5 20.0 22.5 25.0 27.5 30.0 0 200 400 600 Time (s) Error Kind galois graphlab nomad 11720 Threads ●● ●●● ● ● ●● ●●● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ●●● ●● ● ●● ●●● ● ●●● ● ● ● ● ● ● ● netflix yahoo 0 25 50 75 100 125 10 20 30 40 20 40 20 40 Threads GFLOPS Kind ● ● ● galois graphlab nomad
  • 118. Outline • Parallel Program = Operator + Schedule + e • Galois system implementation – Operator, data structures, scheduling • Galois studies – Intel study – Deterministic scheduling – Adding a scheduler: sparse tiling – Comparison with transactional memory – Implementing other programming models 118 Parallel Data Structure
  • 119. Transactional Memory • Alternative to adopting a new programming model – Parallelization reusing existing sequential programs • Transactional memory – Add threads and transactions but keep existing sequential code – TM system ensures correct execution of atomic regions • Software and hardwareimpl. – Speculative execution 119 while (begin != end) begin += 1 while (head) ... head = next root.size += 1 fork() while (begin != end) atomic begin += 1 atomic while (head) ... head = next root.size += 1 join() Sequential Program TM Program
  • 120. Transactional Memory • Alternative to adopting a new programming model – Parallelization reusing existing sequential programs • Transactional memory – Add threads and transactions but keep existing sequential code – TM system ensures correct execution of atomic regions • Software and hardwareimpl. – Speculative execution 120 while (begin != end) begin += 1 while (head) ... head = next root.size += 1 fork() while (begin != end) atomic begin += 1 atomic while (head) ... head = next root.size += 1 join() Sequential Program TM Program
  • 121. Transactional Memory • Alternative to adopting a new programming model – Parallelization reusing existing sequential programs • Transactional memory – Add threads and transactions but keep existing sequential code – TM system ensures correct execution of atomic regions • Software and hardwareimpl. – Speculative execution 121 while (begin != end) begin += 1 while (head) ... head = next root.size += 1 fork() while (begin != end) atomic begin += 1 atomic while (head) ... head = next root.size += 1 join() Sequential Program TM Program
  • 122. Stampede • STAMP [CaoMinh08] – Set of real-world applications using TM – Historically, performance has been poor • Stampede benchmarks – Re-implemented STAMP in Galois – Compared to TinySTM – Median improvement ratio over STAMP of 4.4X at 64 threads (max: 37X, geomean: 3.9X) 122 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●●●● ● ● ● ●● ● ● ● ● ● ●●●● ● ● ● genome kmeans−low ssca2 vacation−low yada 0 5 10 0 10 20 0.0 2.5 5.0 7.5 10.0 12.5 0 5 10 15 20 25 0 20 40 60 0 20 40 60 Threads Speedup ● ●Galois STM • Blue Gene/Q • Median of 5 runs Nguyen and Pingali. What scalable transactional programs need from transactional memory. In submission to PLDI 2015.
  • 123. Stampede Performance 123 fork() while (begin != end) atomic begin += 1 atomic while (head) ... head = next root.size += 1 join() TM Program
  • 124. Stampede Performance • Controlling communication is paramount – Chief performance cost is moving bits around • Scalable principle 1: disjoint access – Recall implementation of Bag and OrderedByIntegerMetric scheduler • Scalable principle 2: virtualize tasks – Be as flexible as possible to dynamic conditions • These changes are beyond capability of TM – Use Galois data structures (disjoint access) – Use Galois scheduler (virtualized tasks) 124 fork() while (begin != end) atomic begin += 1 atomic while (head) ... head = next root.size += 1 join() TM Program
  • 125. Outline • Parallel Program = Operator + Schedule + e • Galois system implementation – Operator, data structures, scheduling • Galois studies – Intel study – Deterministic scheduling – Adding a scheduler: sparse tiling – Comparison with transactional memory – Implementing other programming models 125 Parallel Data Structure
  • 126. DSLs • Layer other DSLs on top of Galois – Other programming models are more restrictive than operator formulation • Bulk-Synchronous Vertex Programs – Nodes are active – Operator is over node and its neighbors – Execution is in rounds • Read values taken from previous round • Overlapping writes reduced to single value – Ex: Pregel [Malewicz10], Ligra [Shun13] • Gather-Apply-Scatter – Refinement of bulk-synchronous into subrounds – Ex: GraphLab [Low10] / PowerGraph [Gonzalez12] 126 ca d b e
  • 127. DSLs • Layer other DSLs on top of Galois – Other programming models are more restrictive than operator formulation • Bulk-Synchronous Vertex Programs – Nodes are active – Operator is over node and its neighbors – Execution is in rounds • Read values taken from previous round • Overlapping writes reduced to single value – Ex: Pregel [Malewicz10], Ligra [Shun13] • Gather-Apply-Scatter – Refinement of bulk-synchronous into subrounds – Ex: GraphLab [Low10] / PowerGraph [Gonzalez12] 127 ca d b e
  • 128. DSLs • Layer other DSLs on top of Galois – Other programming models are more restrictive than operator formulation • Bulk-Synchronous Vertex Programs – Nodes are active – Operator is over node and its neighbors – Execution is in rounds • Read values taken from previous round • Overlapping writes reduced to single value – Ex: Pregel [Malewicz10], Ligra [Shun13] • Gather-Apply-Scatter – Refinement of bulk-synchronous into subrounds – Ex: GraphLab [Low10] / PowerGraph [Gonzalez12] 128 ca d b e
  • 129. Insufficiency of Vertex Programs 129 • Connected components problem – Construct a labeling function such that color(x) = color(neighbor(x)) • Algorithms – Union-Find-based – Label propagation – Hybrids a b c Pointer Jumping a b c
  • 130. Cost to Implement Other Models Feature LoC GraphLab [Low10] 0 PowerGraph [Gonzalez12] (synchronous engine) 200 PowerGraph (asynchronous engine) 200 GraphChi [Kyrola12] 400 Ligra [Shun13] 300 GraphChi + Ligra (additional) 100 130
  • 131. Cost to Implement Other Models Feature LoC GraphLab [Low10] 0 PowerGraph [Gonzalez12] (synchronous engine) 200 PowerGraph (asynchronous engine) 200 GraphChi [Kyrola12] 400 Ligra [Shun13] 300 GraphChi + Ligra (additional) 100 131 PowerGraph-g
  • 132. Comparison with Other Models Comparisons • PowerGraph (distributed) – Runtimes with 64 16-core machines (1024 cores) does not beat one 40-core machine using Galois Inputs • twitter50 (50 M nodes, 2 B edges, low-diameter) • road (20 M nodes, 60 M edges, high-diameter) 132 Applications • Breadth-first search (bfs) • Connected components (cc) • Approximate diameter (dia) • PageRank (pr) • Single-source shortest paths (sssp) Nguyen, Lenharth and Pingali. A lightweight infrastructure for graph analytics. SOSP 2013.
  • 133. 133 PowerGraph runtime Galois runtime 1 10 100 1000 10000 1 10 100 1000 10000 bfs cc dia pr sssp PowerGraph runtime PowerGraph⇠g runtime 1 2 1 2 4 6 bfs cc dia pr sssp
  • 134. 134 PowerGraph runtime Galois runtime 1 10 100 1000 10000 1 10 100 1000 10000 bfs cc dia pr sssp PowerGraph runtime PowerGraph⇠g runtime 1 2 1 2 4 6 bfs cc dia pr sssp
  • 135. PowerGraph runtime Galois runtime 1 10 100 1000 10000 1 10 100 1000 10000 bfs cc dia pr sssp PowerGraph runtime PowerGraph⇠g runtime 1 2 1 2 4 6 bfs cc dia pr sssp 135
  • 136. PowerGraph runtime Galois runtime 1 10 100 1000 10000 1 10 100 1000 10000 bfs cc dia pr sssp PowerGraph runtime PowerGraph⇠g runtime 1 2 1 2 4 6 bfs cc dia pr sssp • The best algorithm may require application- specific scheduling – Priority scheduling for SSSP • The best algorithm may not be expressible as a vertex program – Connected components with union-find • Speculative execution required for high- diameter graphs – Bulk-synchronous scheduling uses many rounds and has too much overhead 136
  • 137. PowerGraph runtime Galois runtime 1 10 100 1000 10000 1 10 100 1000 10000 bfs cc dia pr sssp PowerGraph runtime PowerGraph⇠g runtime 1 2 1 2 4 6 bfs cc dia pr sssp • The best algorithm may require application- specific scheduling – Priority scheduling for SSSP • The best algorithm may not be expressible as a vertex program – Connected components with union-find • Speculative execution required for high- diameter graphs – Bulk-synchronous scheduling uses many rounds and has too much overhead 137
  • 138. PowerGraph runtime Galois runtime 1 10 100 1000 10000 1 10 100 1000 10000 bfs cc dia pr sssp PowerGraph runtime PowerGraph⇠g runtime 1 2 1 2 4 6 bfs cc dia pr sssp • The best algorithm may require application- specific scheduling – Priority scheduling for SSSP • The best algorithm may not be expressible as a vertex program – Connected components with union-find • Speculative execution required for high- diameter graphs – Bulk-synchronous scheduling uses many rounds and has too much overhead 138
  • 139. Outline • Parallel Program = Operator + Schedule + e • Galois system implementation – Operator, data structures, scheduling • Galois studies – Intel study – Deterministic scheduling – Adding a scheduler: sparse tiling – Comparison with transactional memory – Implementing other programming models 139 Parallel Data Structure
  • 140. Third-party Evaluations 140 Nadathur et al. Navigating the maze of graph analytics frameworks. SIGMOD 2014. “Galois performs better than other frameworks, and is close to native performance” “This paper demonstrates that speculative parallelization and the operator formalism, central to Galois’ programming model and philosophy, is the best choice for irregular CAD algorithms that operateon graph-based data structures.” Moctar and Brisk. Parallel FPGA Routing based on the Operator Formulation. DAC 2014.
  • 141. 141