Galois: A System for Parallel Execution of Irregular Algorithms

Galois: A System for Parallel
Execution of Irregular Algorithms
Donald Nguyen
The University of Texas at Austin

Parallel Programming is Changing
• Data
– Structured (vector, matrix) versus unstructured (graphs)
• Platforms
– Dedicated clusters versus cloud, mobile
• People
– Small number of highly trained scientists versus large number of self-
trained parallel programmers
Old World New World
2

The Search for
Good Programming Models
• Tension between
productivity and
performance
– Support large number of
application programmers with
small number of expert
parallel programmers
– Performance comparable to
hand-written codes
• Galois project
3
Joe
Stephanie

The Search for
Good Programming Models
• Tension between
productivity and
performance
– Support large number of
application programmers with
small number of expert
parallel programmers
– Performance comparable to
hand-written codes
• Galois project
4
Joe
Stephanie

Outline
• Parallel Program = Operator + Schedule +
e
• Galois system implementation
– Operator, data structures, scheduling
• Galois studies
– Intel study
– Deterministic scheduling
– Adding a scheduler: sparse tiling
– Comparison with transactional memory
– Implementing other programming models
5
Parallel Data Structure

SSSP
• Find the shortest distance from source node to all
other nodes in a graph
– Label nodes with tentative distance
– Assume non-negative edge weights
– BFS is a special case
• Algorithms
– Chaoticrelaxation O(2V)
– Bellman-Ford O(VE)
– Dijkstra’s algorithm O(E log V)
• Uses priority queue
– Δ-stepping
• Uses sequence of bags to prioritize work
• Δ=1: Dijkstra
• Δ=∞: Chaotic relaxation
• Different algorithms are different schedules for
applying relaxations
– Improved work efficiency
– Input sensitivity
– Locality
5
3
1
9
∞∞
∞
0
1 9
3
Operator
Relaxation
2
6

SSSP
• Algorithms
– Δ-stepping
• Δ=1: Dijkstra
– Locality
5
3
1
9
∞∞
∞
0
Active node
1 9
3
Operator
Relaxation
2
7

SSSP
• Algorithms
– Δ-stepping
• Δ=1: Dijkstra
– Locality
5
3
1
9
∞∞
∞
0
Active node
1 9 4
3
1
3
Operator
Relaxation
2
8

SSSP
• Algorithms
– Δ-stepping
• Δ=1: Dijkstra
– Locality
5
3
1
9
∞∞
∞
0
Active node
Neighborhood
Activity
1 9 4
3
1
3
Operator
Relaxation
2
9

SSSP
• Algorithms
– Δ-stepping
• Δ=1: Dijkstra
– Locality
5
3
1
9
∞
0
Active node
1 9 4
3
1
3
Operator
Relaxation
2
1 5
10

SSSP
• Algorithms
– Δ-stepping
• Δ=1: Dijkstra
– Locality
5
3
1
9
∞
0
Active node
1 9 4
3
1
3
Operator
Relaxation
2
1 5
11

SSSP
• Algorithms
– Δ-stepping
• Δ=1: Dijkstra
– Locality
5
3
1
9
∞
0
Active node
1 9 4
3
1
3
Operator
Relaxation
2
Algorithm = Operator + Schedule
1 5
12

Stencil Computation
• Finite difference
computation
– Active nodes: nodes in At+1
– Operator: 5 point stencil
– Different schedules have
different locality
• Regular application
– Grid structure and active
nodes known statically
– Can be parallelized by a
compiler
13
At At+1

• Operator formulation
– Active elements: nodes or edges
where there is work to be done
– Operator: computation at active
element
• Activity: application of operator to active
element
• Neighborhood: graph elements read or
written by activity
– Ordering: order in which active
elements must appear to have been
processed
• Unordered algorithms: any order is fine
(e.g., chaotic relaxation) but may have soft
priorities
• Ordered algorithms: algorithm-specific
order (e.g., discrete event simulation)
– Parallelism: process activities in
parallel subject to neighborhood and
ordering constraints
• Data parallelism is a special case (all
nodes active, no conflicts)
: active node
: neighborhood
Abstraction of Algorithms
14

element
element
written by activity
processed
priorities
: active node
: neighborhood
15

element
element
written by activity
processed
priorities
: active node
: neighborhood
16

element
element
written by activity
processed
priorities
: active node
: neighborhood
17

element
element
written by activity
processed
priorities
: active node
: neighborhood
18

element
element
written by activity
processed
priorities
: active node
: neighborhood
19
B
A
C D

element
element
written by activity
processed
priorities
: active node
: neighborhood
20
B
A
C D

TAO of Parallelism
Algorithms
Topology
Active
Nodes
Operator
Structured
Semi-structured
Unstructured
Location
Ordering
Morph
Local computation
Reader
Topology-driven
Data-driven
Unordered
Ordered
21
Pingali, Nguyen, et al. The tao of parallelism in algorithms. PLDI 2011.

TAO of Parallelism
22
Pingali, Nguyen, et al. The tao of parallelism in algorithms. PLDI 2011.
Galois
Static
Inspector-Executor
Bulk-Synchronous
Optimistic
Round-based

Galois Project
Stephanie: Parallel Data Structures
Joe: Operator + Schedule
Program = Algorithm + Data Structure

Galois Project
Parallel Program = Operator + Schedule + Parallel Data Structure
Program = Algorithm + Data Structure

My Contribution
Design and implementation of a programming
model for high-performance parallelism by
factoring programs into
operator + schedule + data structure
25

Themes
• Joe
– Program = Operator + Schedule + Data Structure
• Stephanie
– Scalability principle1: disjoint access
• Accesses to distinct logical entities should be disjoint physically as well
• “Do no harm”
– Scalability principle2: virtualizetasks
• Execution of tasks should not be eagerly tied to particular physical resources
• “Be flexible”
• Challenges
– Easy to introduce inadvertentsharing if not looking at entire system
stack
– Small task sizes (~ 100s of instructions)
26

Publications
• M. Amber Hassaan, Donald Nguyen, Keshav Pingali. Kinetic Dependence Graphs (to appear). ASPLOS 2015.
• Konstantions Karantasis, Andrew Lenharth, Donald Nguyen, et al. Parallelization of reordering algorithms for
bandwidth and wavefront reduction. SC 2014.
• Damian Goik, Konrad Jopek, Maciej Paszyński, Andrew Lenharth, Donald Nguyen, and Keshav Pingali. Graph
grammar based multi-thread multi-frontal direct solver with galois scheduler. ICCS 2014.
• Donald Nguyen, Andrew Lenharth, Keshav Pingali. Deterministic Galois: On-demand, parameterless and
portable. ASPLOS 2014.
• Donald Nguyen, Andrew Lenharth, Keshav Pingali. A lightweight infrastructure for graph analytics. SOSP 2013.
• Keshav Pingali, Donald Nguyen, Milind Kulkarni, et al. The tao of parallelism in algorithms. PLDI 2011.
• Donald Nguyen, Keshav Pingali. Synthesizing concurrent schedulers for irregular algorithms. ASPLOS 2011.
• Milind Kulkarni, Donald Nguyen, Dimitrios Prountzos, et al. Exploiting the commutativity lattice. PLDI 2011.
• Xin Sui, Donald Nguyen, Martin Burtscher, Keshav Pingali. Parallel Graph Partitioning on Multicore
Architectures. LCPC 2010.
• Mario Méndez-Lojo, Donald Nguyen, Dimitrios Prountzos, et al. Structure-driven optimizations for amorphous
data-parallel programs. PPOPP 2010.
• Shih-wei Liao, Tzu-Han Hung, Donald Nguyen, et al. Machine learning-based prefetch optimization for data
center applications. SC 2009.
27

Publications
• M. Amber Hassaan, Donald Nguyen, Keshav Pingali. Kinetic Dependence Graphs (to appear). ASPLOS 2015.
• Konstantions Karantasis, Andrew Lenharth, Donald Nguyen, et al. Parallelization of reordering algorithms for
bandwidth and wavefront reduction. SC 2014.
• Damian Goik, Konrad Jopek, Maciej Paszyński, Andrew Lenharth, Donald Nguyen, and Keshav Pingali. Graph
grammar based multi-thread multi-frontal direct solver with galois scheduler. ICCS 2014.
• Donald Nguyen, Andrew Lenharth, Keshav Pingali. Deterministic Galois: On-demand, parameterless and
portable. ASPLOS 2014.
• Donald Nguyen, Andrew Lenharth, Keshav Pingali. A lightweight infrastructure for graph analytics. SOSP 2013.
• Keshav Pingali, Donald Nguyen, Milind Kulkarni, et al. The tao of parallelism in algorithms. PLDI 2011.
• Donald Nguyen, Keshav Pingali. Synthesizing concurrent schedulers for irregular algorithms. ASPLOS 2011.
• Milind Kulkarni, Donald Nguyen, Dimitrios Prountzos, et al. Exploiting the commutativity lattice. PLDI 2011.
• Xin Sui, Donald Nguyen, Martin Burtscher, Keshav Pingali. Parallel Graph Partitioning on Multicore
Architectures. LCPC 2010.
• Mario Méndez-Lojo, Donald Nguyen, Dimitrios Prountzos, et al. Structure-driven optimizations for amorphous
data-parallel programs. PPOPP 2010.
• Shih-wei Liao, Tzu-Han Hung, Donald Nguyen, et al. Machine learning-based prefetch optimization for data
center applications. SC 2009.
28

Outline
e
• Galois studies
– Intel study
29

Galois Programming Model
• Galois system (Stephanie)
– Runtimeand libraryof concurrent data structures
• Galois system (Joe)
– Set iteratorsexpress parallelism
– Operator: body of iterator
– ADT (e.g., graph, set)
• Unordered execution
– Could haveconflicts
– Some serialization of iterations
– New iterations can be added
• Ordered execution
– See Hassaan, Nguyen and Pingali. Kinetic Dependence Graphs (to appear).
ASPLOS 2015.
Graph g
for_each (Edge e : wl)
Node n = g.getEdgeDst(e)
…

Hello Graph
31
#include “Galois/Galois.h”
#include “Galois/Graph/Graph.h”
typedef Galois::Graph::LC_CSR_Graph<int, int> Graph;
typedef Graph::GraphNode GNode;
Graph graph;
struct P {
void operator()(GNode& n, Galois::UserContext<GNode>& c) {
int& d = graph.getData(n);
if (d.value++ < 5)
c.push(n);
}
};
int main(int argc, char** argv) {
Galois::Graph::readGraph(graph, argv[1]);
…
Galois::for_each(initial, P(), Galois::wl<WL>());
return 0;
}

Hello Graph
32
Graph graph;
struct P {
if (d.value++ < 5)
c.push(n);
}
};
…
return 0;
}
Data structure
declarations
Galois Iterator
Operator

Hello Graph
33
Graph graph;
struct P {
if (d.value++ < 5)
c.push(n);
}
};
…
return 0;
}
Data structure
declarations
Galois Iterator
Operator
Sequential program semantics

Hello Graph
34
Graph graph;
struct P {
if (d.value++ < 5)
c.push(n);
}
};
…
return 0;
}
Data structure
declarations
Galois Iterator
Operator
Neighborhood defined
by graph API calls

Hello Graph
35
Graph graph;
struct P {
if (d.value++ < 5)
c.push(n);
}
};
…
return 0;
}
struct TaskIndexer {
int operator()(GNode& n) {
return graph.getData(n) >> shift;
}
};
using namespace Galois::WorkList;
typedef dChunkedFIFO<> WL;
typedef OrderedByIntegerMetric<TaskIndexer> WL;
Neighborhood defined
by graph API calls
Scheduling DSL

Galois Operator
36
B
A
Graph g
for_each (Node n : wl)
for (Node m : g.edges(n))
…
A B

Galois Operator
37
B
A
T1
Graph g
…
B

Galois Operator
38
B
A
T1
T2
Graph g
…

Galois Operator
39
B
A
T1
T2
Graph g
…
T1

Galois Operator
40
B
A
T1
T2
Graph g
…
T1

Galois Operator
41
B
A
T1
T2
Graph g
…
T1 T2

Galois Operator
42
B
A
T1
T2
Graph g
…
T1 T2

Galois Operator
• All data accesses go through API
– Galois data implementations maintainmarks
– When marks overlap, rollback activity and try later
• Cautious
– Commonlyoperators read their entire neighborhood before modifying any element
– Failsafe point: point in operator before first write
• Non-deterministic
• Difference from Thread Level Speculation [Rauchwerger and Padua, PLDI 1995]
– Iterations are unordered
– Operators are cautious
– Conflicts are detected at ADT level rather than memory level
43
B
A
T1
T2
Graph g
…
T1 T2

Galois Data Structures
• Graphs
– May require multiple
loads to access data for
a node
– Exploit local computation
structure
– Simple allocation
schemes may be
imbalanced in physical
memory

• Graphs
a node
structure
schemes may be
memory
nd
ed
dst
node data[N]
edge idx[N]
edge data[E]
edge dst[E]
CSR
nd: node data
ed: edge data
dst: edge destination
len: # of edges

• Graphs
a node
structure
schemes may be
memory
nd
ed
dst
Inlining
Inlining: Andrew Lenharth
nd: node data
ed: edge data
len: # of edges

• Graphs
a node
structure
schemes may be
memory
nd
ed dst
Inlining
nd: node data
ed: edge data
len: # of edges

• Graphs
a node
structure
schemes may be
memory
nd len ed ed
Inlining
nd: node data
ed: edge data
len: # of edges

• Graphs
a node
structure
schemes may be
memory
nd len ed ed
Inlining

• Graphs
a node
structure
schemes may be
memory
nd len ed ed
Inlining
Placement
DRAM 0 DRAM 1
Virtual Memory
Physical Memory
c1 c2

Scheduling in Galois
• Users write high-level scheduling policies
– DSL for specifying policies
• Separation of concerns
– Users understand their operator but have vague
understanding of scheduling
– Schedulers are difficult to implement efficiently
• Galois system synthesizes scheduler by
composing elements from a scheduler library
51
Nguyen and Pingali. Synthesizing concurrent schedulers for irregular algorithms. ASPLOS 2011.

Scheduling Components and Policies
• Components: FIFO, LIFO, BulkSynchronous, ChunkedFIFO,
OrderedByMetric, …
Algorithm Scheduling Policy Used By
DMR OrderedByMetric(λt.minangle(t)) FIFO Shewchuk96
Global: ChunkedFIFO(k) Local: LIFO Kulkarni08
DT OrderedByMetric(λp.round(p)) FIFO Amenta03
Random Clarkson89
PFP FIFO Goldberg88
OrderedByMetric(λn.height(n)) Cherkassy95
PTA FIFO Pearce03
BulkSynchronous Nielson99
SSSP FIFO Bellman-
Ford
OrderedByMetric(λn.w(n)/Δ + (light(n) ? 0 : ½)) FIFO Δ-stepping
52

Scheduling Bag
53
BagAbstraction

Scheduling Bag
54
Bag
Head
Abstraction
Implementation
…

Scheduling Bag
55
Bag
Head
Abstraction
Implementation
…
Serialization

Scheduling Bag
56
Bag
Head
Abstraction
Implementation
…
Head
…
Head
Thread 1 Thread 2
Serialization

Scheduling Bag
57
Bag
Head
Abstraction
Implementation
…
Head
…
Head
Thread 1 Thread 2
Serialization Load Imbalance

Scheduling Bag
c1 c2
Head
Package 1
58
Bag
c1 c2
Head
Package 2
Per-PackageBag: Andrew Lenharth

Scheduling Bag
c1 c2
Head
Package 1
59
Bag
c1 c2
Head
Package 2

Scheduling Bag
c1 c2
Head
Package 1
60
Bag
c1 c2
Head
Package 2

Scheduling Bag
c1 c2
Head
Package 1
61
Bag
c1 c2
Head
Package 2

Scheduling Bag
c1 c2
Head
Package 1
62
Bag
c1 c2
Head
Package 2

Scheduling Bag
c1 c2
Head
Package 1
63
Bag
c1 c2
Head
Package 2

Scheduling Bag
c1 c2
Head
Package 1
64
Bag
c1 c2
Head
Package 2

Scheduling Bag
c1 c2
Head
Package 1
65
Bag
c1 c2
Head
Package 2

OrderedByIntegerMetric
66
0 1 2 3 4 5
Core 1
Core 2
3
Cursor

OrderedByIntegerMetric
• Implements soft priorities
• Drawbacks
– Dense range of priorities
– Contention on cursor location
67
0 1 2 3 4 5
Core 1
Core 2
3
Cursor

Improved OrderedByIntegerMetric
Global Map
Scheduling
Bags
1 30
68
Implementation byAndrew Lenharth

0 1
Global Map
Local Maps
Core 1 Core 2
Scheduling
Bags
1 3
1 30
69

0 1
Global Map
Local Maps
Core 1 Core 2
Scheduling
Bags
1 3
1 30
13
70

0 1
Global Map
Local Maps
Core 1 Core 2
Scheduling
Bags
1 3
1 30
13
71

0 1
Global Map
Local Maps
Core 1 Core 2
Scheduling
Bags
1 3
1 30
13
3
72

0 1
Global Map
Local Maps
Core 1 Core 2
Scheduling
Bags
1 3
1 30
1
3
73

0 1
Global Map
Local Maps
Core 1 Core 2
Scheduling
Bags
1 3
1 30
1
3
15
74

0 1
Global Map
Local Maps
Core 1 Core 2
Scheduling
Bags
1 3
1 30
1
3
15
75

0 1
Global Map
Local Maps
Core 1 Core 2
Scheduling
Bags
1 3
1 30
1
3
15
5
5
76

0 1
Global Map
Local Maps
Core 1 Core 2
Scheduling
Bags
1 3
1 30
1
3
1
5
5
77

Galois System
• Joe
– Writes sequential code
– Uses Galois data structures
and scheduling hints
• Stephanie
– Writes concurrent code
– Provides variety of data
structures and scheduling
policies for Joe to choose
from
• cf., Frameworks that only
support one scheduler or
one graph implementation
78

Outline
e
• Galois studies
– Intel study
79

Intel Study
80
0.1
1
10
tion(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0.01
0.1
Livejournal
Facebook
Wikipedia
Synthetic
Timeperiterat
(a) PageRank
10
100
e(seconds)
0
1
Livejournal
Facebook
Wikipedia
Synthetic
Overalltime
(b) Breadth-First Search
10
100
1000
ation(seconds)
Native Combblas
Socialite Giraph
1
10
Netflix
Timeperitera
(c) Collaborative Fil
Figure 3: Performance results for different algorithms on real-world and synthetic graphs that are s
represents runtime (in log-scale), therefore lower numbers are better.
FDR interconnect. The cluster runs on the Red Hat Enterprise
Linux Server OS release 6.4. We use a mix of OpenMP directives
to parallelize within the node and MPI code to parallelize across
nodes in native code. We use the Intel®
C++ Composer XE 2013
SP1 Compiler3
and the Intel®
MPI library to compile the code.
GraphLab uses a similar combination of OpenMP and MPI for best
performance. CombBLAS runs best as a pure MPI program. We
use multiple MPI processes per node to take advantage of the mul-
ideal results. Given
gorithms, and also be
tions, we expect that i
excel at all scales in te
5.2 Single nod
We now show the r
orative Filtering andNadathur et al. Navigating the maze of graph analytics frameworks. SIGMOD 2014.
Combblas Graphlab
Giraph Galois
Livejournal
Facebook
Wikipedia
Synthetic
readth-First Search
10
100
1000
ation(seconds)
1
10
Netflix
Synthetic
Timeperitera
(c) Collaborative Filtering
10
100
1000
10000
e(seconds)
0.1
1
Livejournal
Facebook
Wikipedia
Synthetic
OverallTim
(d) Triangle counting
rithms on real-world and synthetic graphs that are small enough to run on a single node. The y-axis

Intel Study
81
0.1
1
10
tion(seconds)
0.01
0.1
Livejournal
Facebook
Wikipedia
Synthetic
Timeperiterat
(a) PageRank
10
100
e(seconds)
0
1
Livejournal
Facebook
Wikipedia
Synthetic
Overalltime
10
100
1000
ation(seconds)
Native Combblas
Socialite Giraph
1
10
Netflix
Timeperitera
SP1 Compiler3
and the Intel®
5.2 Single nod
We now show the r
Combblas Graphlab
Giraph Galois
Livejournal
Facebook
Wikipedia
Synthetic
readth-First Search
10
100
1000
ation(seconds)
1
10
Netflix
Synthetic
Timeperitera
10
100
1000
10000
e(seconds)
0.1
1
Livejournal
Facebook
Wikipedia
Synthetic
OverallTim

Intel Study
82
0.1
1
10
tion(seconds)
0.01
0.1
Livejournal
Facebook
Wikipedia
Synthetic
Timeperiterat
(a) PageRank
10
100
e(seconds)
0
1
Livejournal
Facebook
Wikipedia
Synthetic
Overalltime
10
100
1000
ation(seconds)
Native Combblas
Socialite Giraph
1
10
Netflix
Timeperitera
SP1 Compiler3
and the Intel®
5.2 Single nod
We now show the r
Combblas Graphlab
Giraph Galois
Livejournal
Facebook
Wikipedia
Synthetic
readth-First Search
10
100
1000
ation(seconds)
1
10
Netflix
Synthetic
Timeperitera
10
100
1000
10000
e(seconds)
0.1
1
Livejournal
Facebook
Wikipedia
Synthetic
OverallTim

Intel Study
83
0.1
1
10
tion(seconds)
0.01
0.1
Livejournal
Facebook
Wikipedia
Synthetic
Timeperiterat
(a) PageRank
10
100
e(seconds)
0
1
Livejournal
Facebook
Wikipedia
Synthetic
Overalltime
10
100
1000
ation(seconds)
Native Combblas
Socialite Giraph
1
10
Netflix
Timeperitera
SP1 Compiler3
and the Intel®
5.2 Single nod
We now show the r
Combblas Graphlab
Giraph Galois
Livejournal
Facebook
Wikipedia
Synthetic
readth-First Search
10
100
1000
ation(seconds)
1
10
Netflix
Synthetic
Timeperitera
10
100
1000
10000
e(seconds)
0.1
1
Livejournal
Facebook
Wikipedia
Synthetic
OverallTim

Intel Study
84
0.1
1
10
tion(seconds)
0.01
0.1
Livejournal
Facebook
Wikipedia
Synthetic
Timeperiterat
(a) PageRank
10
100
e(seconds)
0
1
Livejournal
Facebook
Wikipedia
Synthetic
Overalltime
10
100
1000
ation(seconds)
Native Combblas
Socialite Giraph
1
10
Netflix
Timeperitera
SP1 Compiler3
and the Intel®
5.2 Single nod
We now show the r
Combblas Graphlab
Giraph Galois
Livejournal
Facebook
Wikipedia
Synthetic
readth-First Search
10
100
1000
ation(seconds)
1
10
Netflix
Synthetic
Timeperitera
10
100
1000
10000
e(seconds)
0.1
1
Livejournal
Facebook
Wikipedia
Synthetic
OverallTim

Intel Study
85
0.1
1
10
tion(seconds)
0.01
0.1
Livejournal
Facebook
Wikipedia
Synthetic
Timeperiterat
(a) PageRank
10
100
e(seconds)
0
1
Livejournal
Facebook
Wikipedia
Synthetic
Overalltime
10
100
1000
ation(seconds)
Native Combblas
Socialite Giraph
1
10
Netflix
Timeperitera
SP1 Compiler3
and the Intel®
5.2 Single nod
We now show the r
Combblas Graphlab
Giraph Galois
Livejournal
Facebook
Wikipedia
Synthetic
readth-First Search
10
100
1000
ation(seconds)
1
10
Netflix
Synthetic
Timeperitera
10
100
1000
10000
e(seconds)
0.1
1
Livejournal
Facebook
Wikipedia
Synthetic
OverallTim

Intel Study
86
0.1
1
10
tion(seconds)
0.01
0.1
Livejournal
Facebook
Wikipedia
Synthetic
Timeperiterat
(a) PageRank
10
100
e(seconds)
0
1
Livejournal
Facebook
Wikipedia
Synthetic
Overalltime
10
100
1000
ation(seconds)
Native Combblas
Socialite Giraph
1
10
Netflix
Timeperitera
SP1 Compiler3
and the Intel®
5.2 Single nod
We now show the r
Combblas Graphlab
Giraph Galois
Livejournal
Facebook
Wikipedia
Synthetic
readth-First Search
10
100
1000
ation(seconds)
1
10
Netflix
Synthetic
Timeperitera
10
100
1000
10000
e(seconds)
0.1
1
Livejournal
Facebook
Wikipedia
Synthetic
OverallTim

Outline
e
• Galois studies
– Intel study
87

Deterministic Interference Graph
(DIG) Scheduling
• Traditional Galois
execution
– Asynchronous, non-
deterministic
• Deterministic Galois
execution
– Round-based execution
– Construct an interference
graph
– Execute independent set
– Repeat
88
Interference Graph
Data Structure
B
A
C
Nguyen, Lenharth and Pingali. Deterministic Galois: on-demand,
parameterless and portable. ASPLOS 2014.

(DIG) Scheduling
execution
deterministic
execution
graph
– Repeat
89
Interference Graph
Data Structure
B
A
C

(DIG) Scheduling
execution
deterministic
execution
graph
– Repeat
90
B
A
C
Interference Graph
Data Structure
B
A
C

(DIG) Scheduling
execution
deterministic
execution
graph
– Repeat
91
B
A
C
Interference Graph
Data Structure
B
A
C

(DIG) Scheduling
execution
deterministic
execution
graph
– Repeat
92
Interference Graph
Data Structure
B
C

(DIG) Scheduling
execution
deterministic
execution
graph
– Repeat
93
Interference Graph
Data Structure
B
C

(DIG) Scheduling
execution
deterministic
execution
graph
– Repeat
94
B C
Interference Graph
Data Structure
B
C

DIG Scheduling in Practice
• Just find roots in interference graph
• Sequence of rounds
– Round has two phases
• Phase 1: mark neighborhood
– Acquire marks by writing task id
atomically until failsafe point
– Overwriting an id only if replacing
greater value
– Final mark values are deterministic
• Phase 2: execute roots
– Execute tasks whose neighborhood
only contains their own marks
95
A < B < C

greater value
96
A
A < B < C

greater value
97
A
A
A
A
A < B < C

greater value
98
A
C
A
A
A
A < B < C

greater value
99
A
C
A
A
A
C
C
C
A < B < C

greater value
100
B
A
C
A
A
A
C
C
C
A < B < C

greater value
101
B
A
C
A
A
A
C
B
C
B
A < B < C

Optimizations
• Resuming tasks
– Avoid re-executing tasks
– Suspend and resume execution at failsafe point
– Provide buffers to save local state
• Windowing
– Inspect only a subset of tasks at a time
– Adaptive algorithm varies window size
102

Evaluation of Deterministic
Scheduling
Platform
• Intel 4x10 core (2.27 GHz)
Deterministic Systems
• RCDC [Devietti11]
– Deterministic by scheduling
– Implementation of Kendo
algorithm [Olszewski09]
• PBBS
– Deterministic by construction
– Handwritten deterministic
implementations
– Non-deterministic programs
(g-n) automatically made
deterministic (g-d)
103
Applications
• PBBS [Blelloch11]
– Breadth-first search (bfs)
– Delaunay mesh refinement
(dmr)
– Delaunay triangulation (dt)
– Maximal independent set
(mis)

Evaluation
dtdmr
rcdc
g−d
pbbs
g−n
rcdc
g−d
pbbs
g−n
rcdc
g−d
pbbs
g−n
rcdc
g−d
g−n
pbbs
bfs mis
0
10
20
30
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
Threads
Speedup
104
Threads

Evaluation
dtdmr
rcdc
g−d
pbbs
g−n
rcdc
g−d
pbbs
g−n
rcdc
g−d
pbbs
g−n
rcdc
g−d
g−n
pbbs
bfs mis
0
10
20
30
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
Threads
Speedup
105
Non-deterministic program
Threads

Evaluation
dtdmr
rcdc
g−d
pbbs
g−n
rcdc
g−d
pbbs
g−n
rcdc
g−d
pbbs
g−n
rcdc
g−d
g−n
pbbs
bfs mis
0
10
20
30
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
Threads
Speedup
106
RCDC (Deterministic by Scheduling)
Threads

Evaluation
dtdmr
rcdc
g−d
pbbs
g−n
rcdc
g−d
pbbs
g−n
rcdc
g−d
pbbs
g−n
rcdc
g−d
g−n
pbbs
bfs mis
0
10
20
30
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
Threads
Speedup
107
Deterministic Galois program
Threads

Evaluation
dtdmr
rcdc
g−d
pbbs
g−n
rcdc
g−d
pbbs
g−n
rcdc
g−d
pbbs
g−n
rcdc
g−d
g−n
pbbs
bfs mis
0
10
20
30
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
Threads
Speedup
108
PBBS (Deterministic by Construction)
Deterministic Galois program
Threads

Evaluation
dtdmr
rcdc
g−d
pbbs
g−n
rcdc
g−d
pbbs
g−n
rcdc
g−d
pbbs
g−n
rcdc
g−d
g−n
pbbs
bfs mis
0
10
20
30
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
Threads
Speedup
109

Third-party Evaluation
Maze Router Speedup!
!
!
!
! Deterministic Scheduler! Non-Deterministic Scheduler!
!
Averages VPR 5.0 1.00x VPR 5.0 1.00x
Galois (1 Thread) 0.79x Galois (1 Thread) 0.89x
Galois (2 Threads) 1.21x Galois (2 Threads) 1.47x
0!
1!
2!
3!
4!
5!
6!
7!
8!
0!
1!
2!
3!
4!
5!
6!
7!
8!
110
Extended results from: Moctar and Brisk. Parallel FPGA Routing based on the Operator Formulation. DAC 2014.
Speedup
• In FPGA routing useful to have deterministic results

Outline
e
• Galois studies
– Intel study
111

Matrix Completion
• Matrix completion problem
– Netflix problem
• An optimization problem
– Find m×k matrix W and k×n
matrix H (k << min(m,n))
such that A ≈ WH
– Low-rank approximation
• Algorithms
– Stochastic gradient descent
(SGD)
– …
1
i j
Aij
Users Movies
2
3
1
2
3
4
Wi* H*j
k
W
H
A
Matrix View
Graph View
Users
Movies
k

Naïve SGD Implementation
• Row-wise operator
113
for (User u : users)
for (Item m : items(u))
op(u, m, Aij);
• Poor cache behavior
– With k = 100, Wi* ≈ 1KB
– Degree(user) > 1000
W
H
A
…
…
100
1000

2D Tiled SGD
• Apply operator to
small 2D tiles
114
W
H
A
…
…
• Additional concerns
– Conflict-free scheduling of tiles
• Future optimizations
– Adaptive, non-uniform tiling

Evaluation of Tiling
Parameters
• Same SGD parameters, initial latent
vectors between implementations
• ε(n) = alpha / (1 + beta * n1.5)
• Handtuned tile size
Datasets
115
Implementations
• Galois
– 2D scheduling
• 2 weeks to implement
– Synchronized updates
• GraphLab
– Standard GraphLab only
supports GD
– Implement 1D SGD with
unsynchronized updates
• Nomad
– From scratch distributed code
• ~1 year to implement
– 2D scheduling
• Tile size is function of #threads
– Synchronized updates
Items Users Ratings Sparsity
netflix 17K 480K 99M 1e-2
yahoo 624K 1M 253M 4e-4
Yun, Yu, Hsieh, Vishwanathan, Dhillon. NOMAD: Non-locking stOchastic Multi-machine
algorithm for Asynchronous and Decentralized matrix completion. VLDB 2013.

Throughput
116
●● ●●●
●
● ●● ●●●
● ●● ● ●
●●
●
● ●●
●
●
●
●
●
●
● ●●
●
●
● ●
●
●
● ● ●●
●
●
●
●
●● ●● ●●●
●●
●
●● ●●● ● ●●●
●
●
●
● ●
●
●
netflix yahoo
0
25
50
75
100
125
10
20
30
40
20 40 20 40
Threads
GFLOPS
Kind ● ● ●
galois graphlab nomad
Theoretical Peak: 363.2 GFLOPS

Convergence
13 32263
netflix
0.9
1.1
1.3
1.5
0 50 100 150 200
Time (s)
Error
331
333
yahoo
17.5
20.0
22.5
25.0
27.5
30.0
0 200 400 600
Time (s)
Error
Kind galois graphlab nomad
11720 Threads
●● ●●●
●
● ●● ●●●
● ●● ● ●
●●
●
● ●●
●
●
●
●
●
●
● ●●
●
●
● ●
●
●
● ● ●●
●
●
●
●
●● ●● ●●●
●●
●
●● ●●● ● ●●●
●
●
●
● ●
●
●
netflix yahoo
0
25
50
75
100
125
10
20
30
40
20 40 20 40
Threads
GFLOPS
Kind ● ● ●
galois graphlab nomad

Outline
e
• Galois studies
– Intel study
118

Transactional Memory
• Alternative to adopting a new
programming model
– Parallelization reusing existing
sequential programs
• Transactional memory
– Add threads and transactions
but keep existing sequential
code
– TM system ensures correct
execution of atomic regions
• Software and hardwareimpl.
– Speculative execution
119
while (begin != end)
begin += 1
while (head)
...
head = next
root.size += 1
fork()
atomic
begin += 1
atomic
while (head)
...
head = next
root.size += 1
join()
Sequential Program
TM Program

programming model
sequential programs
code
120
begin += 1
while (head)
...
head = next
root.size += 1
fork()
atomic
begin += 1
atomic
while (head)
...
head = next
root.size += 1
join()
Sequential Program
TM Program

programming model
sequential programs
code
121
begin += 1
while (head)
...
head = next
root.size += 1
fork()
atomic
begin += 1
atomic
while (head)
...
head = next
root.size += 1
join()
Sequential Program
TM Program

Stampede
• STAMP [CaoMinh08]
– Set of real-world applications
using TM
– Historically, performance has
been poor
• Stampede benchmarks
– Re-implemented STAMP in Galois
– Compared to TinySTM
– Median improvement ratio over
STAMP of 4.4X at 64 threads
(max: 37X, geomean: 3.9X)
122
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●●
● ● ●
●●
●
●
●
●
●
●●
●
●
●
● ●
●●
●
●
●
●
●
●●●● ●
●
●
●●
●
●
●
●
●
●●●● ● ● ●
genome kmeans−low
ssca2 vacation−low
yada
0
5
10
0
10
20
0.0
2.5
5.0
7.5
10.0
12.5
0
5
10
15
20
25
0
20
40
60
0 20 40 60
Threads
Speedup
● ●Galois STM
• Blue Gene/Q
• Median of 5 runs
Nguyen and Pingali. What scalable transactional programs need from
transactional memory. In submission to PLDI 2015.

Stampede Performance
123
fork()
atomic
begin += 1
atomic
while (head)
...
head = next
root.size += 1
join()
TM Program

Stampede Performance
• Controlling communication is paramount
– Chief performance cost is moving bits
around
• Scalable principle 1: disjoint access
– Recall implementation of Bag and
OrderedByIntegerMetric scheduler
• Scalable principle 2: virtualize tasks
– Be as flexible as possible to dynamic
conditions
• These changes are beyond capability of
TM
– Use Galois data structures (disjoint access)
– Use Galois scheduler (virtualized tasks)
124
fork()
atomic
begin += 1
atomic
while (head)
...
head = next
root.size += 1
join()
TM Program

Outline
e
• Galois studies
– Intel study
125

DSLs
• Layer other DSLs on top of Galois
– Other programming models are more restrictive
than operator formulation
• Bulk-Synchronous Vertex Programs
– Nodes are active
– Operator is over node and its neighbors
– Execution is in rounds
• Read values taken from previous round
• Overlapping writes reduced to single value
– Ex: Pregel [Malewicz10], Ligra [Shun13]
• Gather-Apply-Scatter
– Refinement of bulk-synchronous into subrounds
– Ex: GraphLab [Low10] / PowerGraph [Gonzalez12] 126
ca
d
b
e

DSLs
ca
d
b
e

Insufficiency of Vertex Programs
129
• Connected components
problem
– Construct a labeling
function such that color(x)
= color(neighbor(x))
• Algorithms
– Union-Find-based
– Label propagation
– Hybrids
a
b
c
Pointer Jumping
a
b
c

Cost to Implement Other Models
Feature LoC
GraphLab [Low10] 0
PowerGraph [Gonzalez12] (synchronous engine) 200
PowerGraph (asynchronous engine) 200
GraphChi [Kyrola12] 400
Ligra [Shun13] 300
GraphChi + Ligra (additional) 100
130

Cost to Implement Other Models
Feature LoC
GraphLab [Low10] 0
PowerGraph [Gonzalez12] (synchronous engine) 200
PowerGraph (asynchronous engine) 200
GraphChi [Kyrola12] 400
Ligra [Shun13] 300
GraphChi + Ligra (additional) 100
131
PowerGraph-g

Comparison with Other Models
Comparisons
• PowerGraph (distributed)
– Runtimes with 64 16-core
machines (1024 cores) does
not beat one 40-core
machine using Galois
Inputs
• twitter50 (50 M nodes, 2 B edges,
low-diameter)
• road (20 M nodes, 60 M edges,
high-diameter)
132
Applications
• Breadth-first search (bfs)
• Connected components (cc)
• Approximate diameter (dia)
• PageRank (pr)
• Single-source shortest paths
(sssp)
Nguyen, Lenharth and Pingali. A lightweight infrastructure for graph analytics. SOSP 2013.

133
PowerGraph runtime
Galois runtime
1
10
100
1000
10000
1
10
100
1000
10000
bfs cc dia pr sssp
PowerGraph runtime
PowerGraph⇠g runtime
1
2
1
2
4
6
bfs cc dia pr sssp

134
PowerGraph runtime
Galois runtime
1
10
100
1000
10000
1
10
100
1000
10000
bfs cc dia pr sssp
PowerGraph runtime
1
2
1
2
4
6
bfs cc dia pr sssp

PowerGraph runtime
Galois runtime
1
10
100
1000
10000
1
10
100
1000
10000
bfs cc dia pr sssp
PowerGraph runtime
1
2
1
2
4
6
bfs cc dia pr sssp
135

PowerGraph runtime
Galois runtime
1
10
100
1000
10000
1
10
100
1000
10000
bfs cc dia pr sssp
PowerGraph runtime
1
2
1
2
4
6
bfs cc dia pr sssp
• The best algorithm may
require application-
specific scheduling
– Priority scheduling for
SSSP
not be expressible as a
vertex program
– Connected components
with union-find
• Speculative execution
required for high-
diameter graphs
– Bulk-synchronous
scheduling uses many
rounds and has too
much overhead
136

PowerGraph runtime
Galois runtime
1
10
100
1000
10000
1
10
100
1000
10000
bfs cc dia pr sssp
PowerGraph runtime
1
2
1
2
4
6
bfs cc dia pr sssp
specific scheduling
SSSP
vertex program
with union-find
required for high-
diameter graphs
rounds and has too
much overhead
137

PowerGraph runtime
Galois runtime
1
10
100
1000
10000
1
10
100
1000
10000
bfs cc dia pr sssp
PowerGraph runtime
1
2
1
2
4
6
bfs cc dia pr sssp
specific scheduling
SSSP
vertex program
with union-find
required for high-
diameter graphs
rounds and has too
much overhead
138

Outline
e
• Galois studies
– Intel study
139

Third-party Evaluations
140
Nadathur et al. Navigating the maze of graph analytics frameworks. SIGMOD 2014.
“Galois performs better than other frameworks, and is close to native performance”
“This paper demonstrates that speculative parallelization
and the operator formalism, central to Galois’ programming
model and philosophy, is the best choice for irregular CAD
algorithms that operateon graph-based data structures.”
Moctar and Brisk. Parallel FPGA Routing based on the
Operator Formulation. DAC 2014.

Galois: A System for Parallel Execution of Irregular Algorithms

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Similar to Galois: A System for Parallel Execution of Irregular Algorithms

Similar to Galois: A System for Parallel Execution of Irregular Algorithms (20)

Recently uploaded

Recently uploaded (20)

Galois: A System for Parallel Execution of Irregular Algorithms