A Scalable Implementation of a MapReduce-based Graph Processing Algorithm for Large-scale Heterogeneous Supercomputers

A
Scalable
Implementa.on
of
a
MapReduce-‐
based
Graph
Processing
Algorithm
for
Large-‐
scale
Heterogeneous
Supercomputers
Koichi
Shirahata*1，Hitoshi
Sato*1,*2，

Toyotaro
Suzumura*1,*2,*3，Satoshi
Matsuoka*1

*1
Tokyo
Ins;tute
of
Technology

*2
CREST,
Japan
Science
and
Technology
Agency

*3
IBM
Research
-‐
Tokyo

1

Emergence
of
Large
Scale
Graphs

Need
fast
and
scalable
analysis
using
HPC

2
900
Million
Ver;ces

100
Billion
Edges

GPU-‐based
Heterogeneous

supercomputers
3
Fast
Large
Graph
Processing
with
GPGPU

High
peak
performance

High
memory
bandwidth

GPGPU
Mo.va.on
TSUBAME
2.0

1408
compute
nodes
(3
GPUs
/
node)

Problems
of
Large
Scale
Graph

Processing
with
GPGPU
•  How
much
do
GPUs
accelerate

large
scale
graph
processing
?

– Applicability
to
graph
applica;ons

•  Computa;on
paXerns
of
graph

algorithm
aﬀects
performance

•  Tradeoﬀ
between
computa;on
and

CPU-‐GPU
data
transfer
overhead

– How
to
distribute
graph
data
to

each
GPU
in
order
to
exploit

mul;ple
GPUs

4
GPU
memory
CPU
memory
Scalability
Load

balancing
Communica;on

Motivating Example:  
CPU-based Graph Processing
•  How
much
is
the
graph
applica.on
accelerated
using
GPU
?

–  Simple
computa;on
paXerns，High
memory
bandwidth

–  Complex
computa;on
paXerns,
PCI-‐E
overhead

5
0

2000

4000

6000

8000

10000

12000

14000

1
2
4
8
16
32
64
128

Elapsed
Time
[ms]
#
Compute
Nodes
Reduce
Sort

Copy
Map

Contribu;ons
•  Implemented
a
scalable
mul.-‐GPU-‐based

PageRank
applica.on

–  Extend
Mars
(an
exis;ng
GPU
MapReduce
framework)

•  Using
the
MPI
library

–  Implement
GIM-‐V
on
mul;-‐GPU
MapReduce

•  GIM-‐V:
a
graph
processing
algorithm

–  Load
balance
op;miza;on
between
GPU
devices
for
large-‐scale

graphs

•  Task
scheduling-‐based
graph
par;;oning

6
•  Scale
well
up
to
256
nodes
(768
GPUs)

•  1.52x
speedup
compared
with
on
CPUs

Performance
on
TSUBAME2.0
supercomputer

Proposal:
Mul;-‐GPU
GIM-‐V
with

Load
Balance
Op;miza;on

7
Graph
Applica.on

PageRank

Graph
Algorithm

Mul.-‐GPU
GIM-‐V
MapReduce
Framework

Mul.-‐GPU
Mars
PlaZorm

CUDA,
MPI
Implement
GIM-‐V
on

mul.-‐GPUs
MapReduce

-‐  Op;miza;on
for
GIM-‐V

-‐  Load
balance
op;miza;on

Extend
an
exis.ng
GPU

MapReduce
framework

(Mars)
for
mul.-‐GPU

Proposal:
Mul;-‐GPU
GIM-‐V
with

Load
Balance
Op;miza;on

8
Graph
Applica.on

PageRank

Graph
Algorithm

Mul.-‐GPU
GIM-‐V
MapReduce
Framework

Mul.-‐GPU
Mars
PlaZorm

CUDA,
MPI
Implement
GIM-‐V
on

mul.-‐GPUs
MapReduce

-‐  Op;miza;on
for
GIM-‐V

-‐  Load
balance
op;miza;on

Extend
an
exis.ng
GPU

MapReduce
framework

(Mars)
for
mul.-‐GPU

Structure
of
Mars

•  Mars*1
:
an
exis;ng
GPU-‐based
MapReduce

framework

–  CPU-‐GPU
data
transfer
(Map)

–  GPU-‐based
Bitonic
Sort
(Shuﬄe)

–  Allocates
one
CUDA
thread
/
key
(Map,
Reduce)

9
*1
:
Bingsheng
He
et
al.
Mars:
A
MapReduce
Framework
on
Graphics
Processors.

PACT
2008
Preprocess
GPU
Processing
Map
Sort
Reduce
Scheduler

Structure
of
Mars

•  Mars*1
:
an
exis;ng
GPU-‐based
MapReduce

framework

–  CPU-‐GPU
data
transfer
(Map)

–  GPU-‐based
Bitonic
Sort
(Shuﬄe)

–  Allocates
one
CUDA
thread
/
key
(Map,
Reduce)

10
*1
:
Bingsheng
He
et
al.
Mars:
A
MapReduce
Framework
on
Graphics
Processors.

PACT
2008
→
We
extend
Mars
for
mul.-‐GPU
support

Preprocess
GPU
Processing
Map
Sort
Reduce
Scheduler

Proposal:

Mars
Extension
for
Mul.-‐GPU
using
MPI
Map
Sort
Map
Sort
Reduce
Reduce
GPU
Processing
Scheduler
Upload

CPU
→
GPU
Download

GPU
→
CPU
Download

GPU
→
CPU
Upload

CPU
→
GPU
•  Inter-‐GPU
communica;ons
in
Shuﬄe

–  G2C
→
MPI_Alltoallv

→
C2G
→
local
Sort

•  Parallel
I/O
feature
using
MPI-‐IO

–  Improve
I/O
throughput
between
memory
and
storage

11
Map
Copy
Sort
Reduce

Proposal:
Mul;-‐GPU
GIM-‐V
with

Load
Balance
Op;miza;on

12
Graph
Applica.on

PageRank

Graph
Algorithm

Mul.-‐GPU
GIM-‐V
MapReduce
Framework

Mul.-‐GPU
Mars
PlaZorm

CUDA,
MPI
Implement
GIM-‐V
on

mul.-‐GPUs
MapReduce

-‐  Op;miza;on
for
GIM-‐V

-‐  Load
balance
op;miza;on

Extend
an
exis.ng
GPU

MapReduce
framework

(Mars)
for
mul.-‐GPU

Large
graph
processing
algorithm
GIM-‐V
13
*1
:
Kang,
U.
et
al,
“PEGASUS:
A
Peta-‐Scale
Graph
Mining
System-‐
Implementa;on

and
Observa;ons”,
IEEE
INTERNATIONAL
CONFERENCE
ON
DATA
MINING
2009
•  Generalized
Itera.ve
Matrix-‐Vector
mul.plica.on*1

–  Graph
applica;ons
are
implemented
by
deﬁning
3
func;ons

–  v’
=
M
×G
v

where

v’i
=
Assign(vj
,
CombineAllj
({xj
|
j
=
1..n,
xj
=
Combine2(mi,j,
vj)}))

(i
=
1..n)

×G
Vj
V
M
×G
V

Large
graph
processing
algorithm
GIM-‐V
14
*1
:
Kang,
U.
et
al,
“PEGASUS:
A
Peta-‐Scale
Graph
Mining
System-‐
Implementa;on

and
Observa;ons”,
IEEE
INTERNATIONAL
CONFERENCE
ON
DATA
MINING
2009
×G
Vj
V
M
×G
Combine2
V
•  Generalized
Itera.ve
Matrix-‐Vector
mul.plica.on*1

–  Graph
applica;ons
are
implemented
by
deﬁning
3
func;ons

–  v’
=
M
×G
v

where

v’i
=
Assign(vj
,
CombineAllj
({xj
|
j
=
1..n,
xj
=
Combine2(mi,j,
vj)}))

(i
=
1..n)

Large
graph
processing
algorithm
GIM-‐V
15
*1
:
Kang,
U.
et
al,
“PEGASUS:
A
Peta-‐Scale
Graph
Mining
System-‐
Implementa;on

and
Observa;ons”,
IEEE
INTERNATIONAL
CONFERENCE
ON
DATA
MINING
2009
×G
×G
Vj
V
M
CombineAll
V
Combine2
•  Generalized
Itera.ve
Matrix-‐Vector
mul.plica.on*1

–  Graph
applica;ons
are
implemented
by
deﬁning
3
func;ons

–  v’
=
M
×G
v

where

v’i
=
Assign(vj
,
CombineAllj
({xj
|
j
=
1..n,
xj
=
Combine2(mi,j,
vj)}))

(i
=
1..n)

Large
graph
processing
algorithm
GIM-‐V
16
*1
:
Kang,
U.
et
al,
“PEGASUS:
A
Peta-‐Scale
Graph
Mining
System-‐
Implementa;on

and
Observa;ons”,
IEEE
INTERNATIONAL
CONFERENCE
ON
DATA
MINING
2009
×G
×G
Vj
V
M
CombineAll
V
Assign
Combine2
•  Generalized
Itera.ve
Matrix-‐Vector
mul.plica.on*1

–  Graph
applica;ons
are
implemented
by
deﬁning
3
func;ons

–  v’
=
M
×G
v

where

v’i
=
Assign(vj
,
CombineAllj
({xj
|
j
=
1..n,
xj
=
Combine2(mi,j,
vj)}))

(i
=
1..n)

•  Generalized
Itera.ve
Matrix-‐Vector
mul.plica.on*1

–  Graph
applica;ons
are
implemented
by
deﬁning
3
func;ons

–  v’
=
M
×G
v

where

v’i
=
Assign(vj
,
CombineAllj
({xj
|
j
=
1..n,
xj
=
Combine2(mi,j,
vj)}))

(i
=
1..n)

Large
graph
processing
algorithm
GIM-‐V
17
*1
:
Kang,
U.
et
al,
“PEGASUS:
A
Peta-‐Scale
Graph
Mining
System-‐
Implementa;on

and
Observa;ons”,
IEEE
INTERNATIONAL
CONFERENCE
ON
DATA
MINING
2009
×G
×G
Vj
V
M
CombineAll
V
Assign
Combine2
GIM-‐V
can
be
implemented
by
2-‐stage
MapReduce

→
Implement
on
mul.-‐GPU
environment

Proposal:

GIM-‐V
implementa.on
on
mul.-‐GPU
•  Con.nuous
execu.on
feature
for
itera.ons

–  2
MapReduce
stages
/
itera;on

–  Graph
par;;on
at
Pre-‐processing

•  Divide
the
input
graph
ver;ces/edges
among
GPUs

–  Parallel
Convergence
test
at
Post-‐processing

•  Locally
on
each
process
-‐>
globally
using
MPI_Allreduce

18
Graph

Par;;on
Stage
1

Stage
2

GPU
Processing
Scheduler
Pre-‐process
Convergence

Test
Post-‐process
Mul.-‐GPU
GIM-‐V
Combine2
CombineAll

Eliminate
metadata
and

use
ﬁxed
size
payload

Op;miza;ons
for
mul;-‐GPU
GIM-‐V
•  Data
structure

–  Mars
handles

metadata
and
payload

•  Thread
alloca.on

–  Mars
handles
one
key

per
thread

•  Load
balance

op.miza.on

–  Scale-‐free
property

•  Small
number
of
ver;ces

have
many
edges

19
In
Reduce
stage,
allocate

mul.
CUDA
threads
to
a

single
key
according
to
value

size

Minimize
load
imbalance

among
GPUS

Mars
Our
Implementa.on

Eliminate
metadata
and

use
ﬁxed
size
payload

Op;miza;ons
for
mul;-‐GPU
GIM-‐V
•  Data
structure

–  Mars
handles

metadata
and
payload

•  Thread
alloca.on

–  Mars
handles
one
key

per
thread

•  Load
balance

op.miza.on

–  Scale-‐free
property

•  Small
number
of
ver;ces

have
many
edges

20
In
Reduce
stage,
allocate

mul.
CUDA
threads
to
a

single
key
according
to
value

size

Mars
Our
Implementa.on
Minimize
load
imbalance

among
GPUS

Apply
Load
Balancing
Op;miza;on
•  Par..on
the
graph
in
order
to
minimize
load

imbalance
among
GPUs

–  Applying
a
task
scheduling
algorithm

•  Regard
Vertex/Edges
as
Task

•  TaskSize
i
=
1
+
Σ
Outgoing
Edges

–  LPT
(Least
Processing
Time)
schedule
*1

•  Assign
tasks
in
decreasing
order
of
task
size

*1
:
R.
L.
Graham,
“Bounds
on
mul;processing
anomalies
and
related
packing
algorithms,”
in

Proceedings
of
the
May
16-‐18,
1972,
spring
joint
computer
conference,
ser.
AFIPS
’72
(Spring)

P3
P2
P1
4
5
6
8
7
Tasks
=
{8,
5,
4,
3,
1}
Minimize
the
maximum
amount
Vertex
i
i
TaskSize
i
=
1
+
3
V
Eout
21

Experiments
•  Methods

–  A
single
round
of
itera;ons
(w/o
Preprocessing)

–  PageRank
applica;on

•  Measures
rela;ve

importance
of
web
pages

–  Input
data

•  Ar;ﬁcial
Kronecker
graphs

–  Generated
by
generator
in
Graph
500

•  Parameters

–  SCALE:
log
2
of
#ver;ces
(#ver;ces
=
2SCALE)

–  Edge_factor:
16
(#edges
=
Edge_factor
×
#ver;ces)

22
16
4
3
2
1
8
12
4
8
6
4
2

12
3
2
1
4
3
2
1
G2 = G1 ⊗ G1G1
Study
the
performance
of
our
mul.-‐GPU
GIM-‐V

•  Scalability

•  Comparison
w/
a
CPU-‐based
implementa.on

•  Validity
of
the
load
balance
op.miza.on

Experimental
environments
•  TSUBAME
2.0
supercomputer

–  We
use
256
nodes
(768
GPUs)

•  CPU-‐GPU:

PCI-‐E
2.0
x16

•  Internode:
QDR
IB
(40
Gbps)
dual
rail

•  Mars

–  MarsGPU-‐n

•  n
GPUs
/
node

(n:
1,
2,
3)

–  MarsCPU

•  12
threads
/
node

•  MPI
and
pthread

•  Parallel
quick
sort

23
CPU
GPU
Model

Intel®
Xeon®

X5670
Tesla
M2050

#
Cores
6
448
Frequency

2.93
GHz
1.15
GHz
Memory
54
GB
2.7
GB
Compiler

gcc
4.3.4
nvcc
4.0

24
0

10

20

30

40

50

60

70

80

90

100

0
50
100
150
200
250
300

MEgdes
/
sec
#
Compute
Nodes
MarsGPU-‐1

MarsGPU-‐2

MarsGPU-‐3

MarsCPU

SCALE
30
SCALE
29
SCALE
28
SCALE
27
87.04
ME/s

(256
nodes)
1.52x
speedup

(3
GPU
v
CPU)
Weak
Scaling
Performance:

MarsGPU
vs.
MarsCPU
Becer
•  W/O
load
balance
op;miza;on

Weak
Scaling
Performance:

MarsGPU
vs.
MarsCPU
25
0

10

20

30

40

50

60

70

80

90

100

0
50
100
150
200
250
300

MEgdes
/
sec
#
Compute
Nodes
MarsGPU-‐1

MarsGPU-‐2

MarsGPU-‐3

MarsCPU

SCALE
30
SCALE
29
SCALE
28
SCALE
27
Becer
•  W/O
load
balance
op;miza;on

87.04
ME/s

(256
nodes)
1.52x
speedup

(3
GPU
v
CPU)
Performance

Breakdown

26
0

1000

2000

3000

4000

5000

6000

7000

8000

9000

MarsCPU
MarsGPU-‐1
MarsGPU-‐2
MarsGPU-‐3

Elapsed
Time
[ms]
Map
MPI-‐Comm

PCI-‐Comm
Hash

Sort
Reduce

Performance
Breakdown:

MarsGPU
and
MarsCPU
Becer
SCALE
28

27
0

1000

2000

3000

4000

5000

6000

7000

8000

9000

MarsCPU
MarsGPU-‐1
MarsGPU-‐2
MarsGPU-‐3

Elapsed
Time
[ms]
Map
MPI-‐Comm

PCI-‐Comm
Hash

Sort
Reduce

Performance
Breakdown:

MarsGPU
and
MarsCPU
8.93x

(Map)
2.53x

(Sort)
Becer
SCALE
28

28
0

1000

2000

3000

4000

5000

6000

7000

8000

9000

MarsCPU
MarsGPU-‐1
MarsGPU-‐2
MarsGPU-‐3

Elapsed
Time
[ms]
Map
MPI-‐Comm

PCI-‐Comm
Hash

Sort
Reduce

Performance
Breakdown:

MarsGPU
and
MarsCPU
Becer
SCALE
28
PCI-‐E
overhead

Eﬃciency
of
GIM-‐V
Op;miza;ons
•  Data
structure

(Map,
Sort,
Reduce)

•  Thread
alloca.on

(Reduce)

29
1

10

100

1000

10000

Map
Sort
Reduce

Elapsed
Time
[ms]
Naive

Op;mized

1.92x
1.64x
66.8x
SCALE
26,
128
nodes
on
MarsGPU-‐3

Becer

30
0

10

20

30

40

50

60

70

80

90

0
20
40
60
80
100
120
140

MEdges
/
Sec
#
Compute
Nodes
MarsGPU-‐3

MarsGPU-‐3
LPT

1.16x

Speedup
Round
Robin
vs.
LPT
Schedule
•  Similar
except
for
on
128
nodes

–  Input
graphs
are
rela;vely
well-‐balanced
(Graph500)
Weak
Scaling
Performance
Becer

31
0

10

20

30

40

50

60

70

80

90

0
20
40
60
80
100
120
140

MEdges
/
Sec
#
Compute
Nodes
MarsGPU-‐3

MarsGPU-‐3
LPT

1.16x

Speedup
Performance

Breakdown

•  Similar
except
for
on
128
nodes

–  Input
graphs
are
rela;vely
well-‐balanced
(Graph500)
Weak
Scaling
Performance
Round
Robin
vs.
LPT
Schedule
Becer

Performance
Breakdown

Round
robin
vs.
LPT
Schedule
•  Bitonic
sort
calculates
power-‐of-‐two
key-‐value
pairs

–  Load
balancing
reduced
the
number
of
sor;ng
elements

32
0

500

1000

1500

2000

2500

3000

MarsGPU-‐3
MarsGPU-‐3
LPT

Elapsed
Time
[ms]
Map

MPI-‐Comm

PCI-‐Comm

Hash

Sort

Reduce

Speedup

in
Sort
Becer

33
1

10

100

1000

10000

100000

PEGASUS
MarsCPU
MarsGPU-‐3

KEdges

/
Sec
Outperform
Hadoop-‐based
Implementa;on
•  PEGASUS:
a
Hadoop-‐based
GIM-‐V
implementa;on

–  Hadoop
0.21.0

–  Lustre
for
underlying
Hadoop’s
ﬁle
system

186.8x

Speedup
SCALE
27,
128
nodes
Becer

Related
Work
•  Graph
processing
using
GPU

–  Shortest
path
algorithms
for
GPU
(BFS，SSSP,
and

APSP)*1

→
Not
achieve
compe;;ve
performance

•  MapReduce
implementa;ons
on
GPUs

–  GPMR*2
:
MapReduce
implementa;on
on
mul;
GPUs

→
Not
show
scalability
for
large-‐scale
processing

•  Graph
processing
with
load
balancing

–  Load
balancing
while
keeping
communica;on
low
on

R-‐MAT
graphs*3

→
We
show
the
task
scheduling-‐based
load-‐balancing

34
*1
:
Harish,
P.
et
al,
“Accelera;ng
Large
Graph
Algorithms
on
the
GPU
using
CUDA”,
HiPC
2007.

*2
:
Stuart,
J.A.
et
al,
“Mul;-‐GPU
MapReduce
on
GPU
Clusters”,
IPDPS
2011.

*3
:
J.
Chhugani,
N.
Sa;sh,
C.
Kim,
J.
Sewall,
and
P.
Dubey,
“Fast
and
Eﬃcient
Graph
Traversal
Algorithm
for
CPUs:

Maximizing
single-‐node
eﬃciency,”
in
Parallel
Distributed
Processing
Symposium
(IPDPS),
2012

Conclusions
•  A
scalable
MapReduce-‐based
GIM-‐V

implementa.on
using
mul.-‐GPU

–  Methodology

•  Extend
Mars
to
support
mul;-‐GPU

•  GIM-‐V
using
mul;-‐GPU
MapReduce

•  Load
balance
op;miza;on

–  Performance

•  87.04
ME/s
on
SCALE
30
(256
nodes,
768
GPUs)

•  1.52x
speedup
than
the
CPU-‐based
implementa;on

•  Future
work

–  Op;miza;on
of
our
implementa;on

•  Improve
communica;on,
locality

–  Data
handling
larger
than
GPU
memory
capacity

•  Memory
hierarchy
management
(GPU,
DRAM,
NVM,
SSD)

35

Comparison
with
Load
Balance
Algorithm

(Simula;on,
Weak
Scaling)
•  Compare
between
naive
(Round
robin)
and
load
balancing

op;miza;on
(LPT
schedule)

•  Similar
except
for
128
nodes
(3.98%
on
SCALE
25,
64
nodes)

–  Performance
improvement:
13.8%
(SCALE
26,
128
nodes)

36
0

5

10

15

20

25

30

35

40

2
4
8
16
32
64
128

Load
Imbalance
[%]

#
Compute
Nodes
Round
Robin

LPT

1.67x

BeXer

Large-‐scale
Graphs
in
Real
World
•  Graphs
in
real
world

–  Health
care,
SNS,
Biology,
Electric
power
grid
etc.

–  Millions
to
trillions
of
ver;ces
and
100
millions
to
100
trillions
of

edges

–  Similar
proper;es

•  Scale-‐free
(power-‐low
degree
distribu;on)

•  Small
diameter

•  Kronecker
Graph

–  Similar
proper;es
as
real
world
graphs

–  Widely
used
(e.g.
the
Graph500
benchmark*1)
since
obtained

easily
by
simply
applying
itera;ve
products
on
a
base
matrix

37
*1
:
D.
A.
Bader
et
al.
The
graph500
list.
Graph500.org.
hXp://www.graph500.org/

16
4
3
2
1
8
12
4
8
6
4
2

12
3
2
1
4
3
2
1

A Scalable Implementation of a MapReduce-based Graph Processing Algorithm for Large-scale Heterogeneous Supercomputers

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A Scalable Implementation of a MapReduce-based Graph Processing Algorithm for Large-scale Heterogeneous Supercomputers

Similar to A Scalable Implementation of a MapReduce-based Graph Processing Algorithm for Large-scale Heterogeneous Supercomputers (20)

More from Koichi Shirahata

More from Koichi Shirahata (7)

Recently uploaded

Recently uploaded (20)

A Scalable Implementation of a MapReduce-based Graph Processing Algorithm for Large-scale Heterogeneous Supercomputers