SlideShare a Scribd company logo
1 of 37
Download to read offline
A	
  Scalable	
  Implementa.on	
  of	
  a	
  MapReduce-­‐
based	
  Graph	
  Processing	
  Algorithm	
  for	
  Large-­‐
scale	
  Heterogeneous	
  Supercomputers	
Koichi	
  Shirahata*1,Hitoshi	
  Sato*1,*2,	
  
Toyotaro	
  Suzumura*1,*2,*3,Satoshi	
  Matsuoka*1	
  
	
  
*1	
  Tokyo	
  Ins;tute	
  of	
  Technology	
  
*2	
  CREST,	
  Japan	
  Science	
  and	
  Technology	
  Agency	
  
*3	
  IBM	
  Research	
  -­‐	
  Tokyo	
  
1
Emergence	
  of	
  Large	
  Scale	
  Graphs	
  
Need	
  fast	
  and	
  scalable	
  analysis	
  using	
  HPC	
  	
  
	
  
	
  
	
  
2	
900	
  Million	
  Ver;ces	
  
100	
  Billion	
  Edges
GPU-­‐based	
  Heterogeneous	
  
supercomputers	
3	
Fast	
  Large	
  Graph	
  Processing	
  with	
  GPGPU	
  
High	
  peak	
  performance	
  
High	
  memory	
  bandwidth	
  
GPGPU	
Mo.va.on	
TSUBAME	
  2.0	
  	
  
1408	
  compute	
  nodes	
  (3	
  GPUs	
  /	
  node)
Problems	
  of	
  Large	
  Scale	
  Graph	
  
Processing	
  with	
  GPGPU	
•  How	
  much	
  do	
  GPUs	
  accelerate	
  
large	
  scale	
  graph	
  processing	
  ?	
  
– Applicability	
  to	
  graph	
  applica;ons	
  
•  Computa;on	
  paXerns	
  of	
  graph	
  
algorithm	
  affects	
  performance	
  
•  Tradeoff	
  between	
  computa;on	
  and	
  
CPU-­‐GPU	
  data	
  transfer	
  overhead	
  
– How	
  to	
  distribute	
  graph	
  data	
  to	
  
each	
  GPU	
  in	
  order	
  to	
  exploit	
  
mul;ple	
  GPUs	
  
4	
GPU	
  memory	
CPU	
  memory	
Scalability	
Load	
  
balancing	
Communica;on
Motivating Example: 

CPU-based Graph Processing	
•  How	
  much	
  is	
  the	
  graph	
  applica.on	
  accelerated	
  using	
  GPU	
  ?	
  
–  Simple	
  computa;on	
  paXerns,High	
  memory	
  bandwidth	
  
–  Complex	
  computa;on	
  paXerns,	
  PCI-­‐E	
  overhead	
  
5	
0	
  
2000	
  
4000	
  
6000	
  
8000	
  
10000	
  
12000	
  
14000	
  
1	
   2	
   4	
   8	
   16	
   32	
   64	
   128	
  
Elapsed	
  Time	
  [ms]	
#	
  Compute	
  Nodes	
Reduce	
   Sort	
  
Copy	
   Map	
  
Contribu;ons	
•  Implemented	
  a	
  scalable	
  mul.-­‐GPU-­‐based	
  
PageRank	
  applica.on	
  
–  Extend	
  Mars	
  (an	
  exis;ng	
  GPU	
  MapReduce	
  framework)	
  
•  Using	
  the	
  MPI	
  library	
  
–  Implement	
  GIM-­‐V	
  on	
  mul;-­‐GPU	
  MapReduce	
  
•  GIM-­‐V:	
  a	
  graph	
  processing	
  algorithm	
  
–  Load	
  balance	
  op;miza;on	
  between	
  GPU	
  devices	
  for	
  large-­‐scale	
  
graphs	
  
•  Task	
  scheduling-­‐based	
  graph	
  par;;oning	
  
6	
•  Scale	
  well	
  up	
  to	
  256	
  nodes	
  (768	
  GPUs)	
  
•  1.52x	
  speedup	
  compared	
  with	
  on	
  CPUs	
  
Performance	
  on	
  TSUBAME2.0	
  supercomputer
Proposal:	
  Mul;-­‐GPU	
  GIM-­‐V	
  with	
  	
  
Load	
  Balance	
  Op;miza;on	
  	
7	
Graph	
  Applica.on	
  
PageRank	
  
Graph	
  Algorithm	
  
Mul.-­‐GPU	
  GIM-­‐V	
MapReduce	
  Framework	
  
Mul.-­‐GPU	
  Mars	
PlaZorm	
  
CUDA,	
  MPI	
Implement	
  GIM-­‐V	
  on	
  
mul.-­‐GPUs	
  MapReduce	
  
-­‐  Op;miza;on	
  for	
  GIM-­‐V	
  
-­‐  Load	
  balance	
  op;miza;on	
  
Extend	
  an	
  exis.ng	
  GPU	
  
MapReduce	
  framework	
  
(Mars)	
  for	
  mul.-­‐GPU	
  
Proposal:	
  Mul;-­‐GPU	
  GIM-­‐V	
  with	
  	
  
Load	
  Balance	
  Op;miza;on	
  	
8	
Graph	
  Applica.on	
  
PageRank	
  
Graph	
  Algorithm	
  
Mul.-­‐GPU	
  GIM-­‐V	
MapReduce	
  Framework	
  
Mul.-­‐GPU	
  Mars	
PlaZorm	
  
CUDA,	
  MPI	
Implement	
  GIM-­‐V	
  on	
  
mul.-­‐GPUs	
  MapReduce	
  
-­‐  Op;miza;on	
  for	
  GIM-­‐V	
  
-­‐  Load	
  balance	
  op;miza;on	
  
Extend	
  an	
  exis.ng	
  GPU	
  
MapReduce	
  framework	
  
(Mars)	
  for	
  mul.-­‐GPU	
  
Structure	
  of	
  Mars	
  	
•  Mars*1	
  :	
  an	
  exis;ng	
  GPU-­‐based	
  MapReduce	
  
framework	
  
–  CPU-­‐GPU	
  data	
  transfer	
  (Map)	
  
–  GPU-­‐based	
  Bitonic	
  Sort	
  (Shuffle)	
  
–  Allocates	
  one	
  CUDA	
  thread	
  /	
  key	
  (Map,	
  Reduce)	
  
9	
*1	
  :	
  Bingsheng	
  He	
  et	
  al.	
  Mars:	
  A	
  MapReduce	
  Framework	
  on	
  Graphics	
  Processors.	
  
PACT	
  2008	
Preprocess	
GPU	
  Processing	
Map	
 Sort	
 Reduce	
Scheduler
Structure	
  of	
  Mars	
  	
•  Mars*1	
  :	
  an	
  exis;ng	
  GPU-­‐based	
  MapReduce	
  
framework	
  
–  CPU-­‐GPU	
  data	
  transfer	
  (Map)	
  
–  GPU-­‐based	
  Bitonic	
  Sort	
  (Shuffle)	
  
–  Allocates	
  one	
  CUDA	
  thread	
  /	
  key	
  (Map,	
  Reduce)	
  
10	
*1	
  :	
  Bingsheng	
  He	
  et	
  al.	
  Mars:	
  A	
  MapReduce	
  Framework	
  on	
  Graphics	
  Processors.	
  
PACT	
  2008	
→	
  We	
  extend	
  Mars	
  for	
  mul.-­‐GPU	
  support	
  
Preprocess	
GPU	
  Processing	
Map	
 Sort	
 Reduce	
Scheduler
Proposal:	
  	
  
Mars	
  Extension	
  for	
  Mul.-­‐GPU	
  using	
  MPI	
Map	
 Sort	
Map	
 Sort	
Reduce	
Reduce	
GPU	
  Processing	
 Scheduler	
Upload	
  	
  
CPU	
  →	
  GPU	
Download	
  
GPU	
  →	
  CPU	
Download	
  
GPU	
  →	
  CPU	
Upload	
  
CPU	
  →	
  GPU	
•  Inter-­‐GPU	
  communica;ons	
  in	
  Shuffle	
  
–  G2C	
  →	
  MPI_Alltoallv	
  	
  →	
  C2G	
  →	
  local	
  Sort	
  
•  Parallel	
  I/O	
  feature	
  using	
  MPI-­‐IO	
  
–  Improve	
  I/O	
  throughput	
  between	
  memory	
  and	
  storage	
  
11	
Map	
 Copy	
 Sort	
 Reduce
Proposal:	
  Mul;-­‐GPU	
  GIM-­‐V	
  with	
  	
  
Load	
  Balance	
  Op;miza;on	
  	
12	
Graph	
  Applica.on	
  
PageRank	
  
Graph	
  Algorithm	
  
Mul.-­‐GPU	
  GIM-­‐V	
MapReduce	
  Framework	
  
Mul.-­‐GPU	
  Mars	
PlaZorm	
  
CUDA,	
  MPI	
Implement	
  GIM-­‐V	
  on	
  
mul.-­‐GPUs	
  MapReduce	
  
-­‐  Op;miza;on	
  for	
  GIM-­‐V	
  
-­‐  Load	
  balance	
  op;miza;on	
  
Extend	
  an	
  exis.ng	
  GPU	
  
MapReduce	
  framework	
  
(Mars)	
  for	
  mul.-­‐GPU	
  
Large	
  graph	
  processing	
  algorithm	
  GIM-­‐V	
13	
*1	
  :	
  Kang,	
  U.	
  et	
  al,	
  “PEGASUS:	
  A	
  Peta-­‐Scale	
  Graph	
  Mining	
  System-­‐	
  Implementa;on	
  	
  
and	
  Observa;ons”,	
  IEEE	
  INTERNATIONAL	
  CONFERENCE	
  ON	
  DATA	
  MINING	
  2009	
•  Generalized	
  Itera.ve	
  Matrix-­‐Vector	
  mul.plica.on*1	
  
–  Graph	
  applica;ons	
  are	
  implemented	
  by	
  defining	
  3	
  func;ons	
  
–  v’	
  =	
  M	
  ×G	
  v	
  	
  	
  where	
  
	
  
	
  	
  	
  	
  	
  	
  	
  	
  v’i	
  =	
  Assign(vj	
  ,	
  CombineAllj	
  ({xj	
  |	
  j	
  =	
  1..n,	
  xj	
  =	
  Combine2(mi,j,	
  vj)}))	
  	
  (i	
  =	
  1..n)	
  
	
  
×G	
Vj	
V	
M	
×G	
V
Large	
  graph	
  processing	
  algorithm	
  GIM-­‐V	
14	
*1	
  :	
  Kang,	
  U.	
  et	
  al,	
  “PEGASUS:	
  A	
  Peta-­‐Scale	
  Graph	
  Mining	
  System-­‐	
  Implementa;on	
  	
  
and	
  Observa;ons”,	
  IEEE	
  INTERNATIONAL	
  CONFERENCE	
  ON	
  DATA	
  MINING	
  2009	
×G	
Vj	
V	
M	
×G	
Combine2	
V	
•  Generalized	
  Itera.ve	
  Matrix-­‐Vector	
  mul.plica.on*1	
  
–  Graph	
  applica;ons	
  are	
  implemented	
  by	
  defining	
  3	
  func;ons	
  
–  v’	
  =	
  M	
  ×G	
  v	
  	
  	
  where	
  
	
  
	
  	
  	
  	
  	
  	
  	
  	
  v’i	
  =	
  Assign(vj	
  ,	
  CombineAllj	
  ({xj	
  |	
  j	
  =	
  1..n,	
  xj	
  =	
  Combine2(mi,j,	
  vj)}))	
  	
  (i	
  =	
  1..n)	
  
	
  
Large	
  graph	
  processing	
  algorithm	
  GIM-­‐V	
15	
*1	
  :	
  Kang,	
  U.	
  et	
  al,	
  “PEGASUS:	
  A	
  Peta-­‐Scale	
  Graph	
  Mining	
  System-­‐	
  Implementa;on	
  	
  
and	
  Observa;ons”,	
  IEEE	
  INTERNATIONAL	
  CONFERENCE	
  ON	
  DATA	
  MINING	
  2009	
×G	
×G	
Vj	
V	
M	
 CombineAll	
V	
Combine2	
•  Generalized	
  Itera.ve	
  Matrix-­‐Vector	
  mul.plica.on*1	
  
–  Graph	
  applica;ons	
  are	
  implemented	
  by	
  defining	
  3	
  func;ons	
  
–  v’	
  =	
  M	
  ×G	
  v	
  	
  	
  where	
  
	
  
	
  	
  	
  	
  	
  	
  	
  	
  v’i	
  =	
  Assign(vj	
  ,	
  CombineAllj	
  ({xj	
  |	
  j	
  =	
  1..n,	
  xj	
  =	
  Combine2(mi,j,	
  vj)}))	
  	
  (i	
  =	
  1..n)	
  
	
  
Large	
  graph	
  processing	
  algorithm	
  GIM-­‐V	
16	
*1	
  :	
  Kang,	
  U.	
  et	
  al,	
  “PEGASUS:	
  A	
  Peta-­‐Scale	
  Graph	
  Mining	
  System-­‐	
  Implementa;on	
  	
  
and	
  Observa;ons”,	
  IEEE	
  INTERNATIONAL	
  CONFERENCE	
  ON	
  DATA	
  MINING	
  2009	
×G	
×G	
Vj	
V	
M	
 CombineAll	
V	
Assign	
Combine2	
•  Generalized	
  Itera.ve	
  Matrix-­‐Vector	
  mul.plica.on*1	
  
–  Graph	
  applica;ons	
  are	
  implemented	
  by	
  defining	
  3	
  func;ons	
  
–  v’	
  =	
  M	
  ×G	
  v	
  	
  	
  where	
  
	
  
	
  	
  	
  	
  	
  	
  	
  	
  v’i	
  =	
  Assign(vj	
  ,	
  CombineAllj	
  ({xj	
  |	
  j	
  =	
  1..n,	
  xj	
  =	
  Combine2(mi,j,	
  vj)}))	
  	
  (i	
  =	
  1..n)	
  
	
  
•  Generalized	
  Itera.ve	
  Matrix-­‐Vector	
  mul.plica.on*1	
  
–  Graph	
  applica;ons	
  are	
  implemented	
  by	
  defining	
  3	
  func;ons	
  
–  v’	
  =	
  M	
  ×G	
  v	
  	
  	
  where	
  
	
  
	
  	
  	
  	
  	
  	
  	
  	
  v’i	
  =	
  Assign(vj	
  ,	
  CombineAllj	
  ({xj	
  |	
  j	
  =	
  1..n,	
  xj	
  =	
  Combine2(mi,j,	
  vj)}))	
  	
  (i	
  =	
  1..n)	
  
	
  
Large	
  graph	
  processing	
  algorithm	
  GIM-­‐V	
17	
*1	
  :	
  Kang,	
  U.	
  et	
  al,	
  “PEGASUS:	
  A	
  Peta-­‐Scale	
  Graph	
  Mining	
  System-­‐	
  Implementa;on	
  	
  
and	
  Observa;ons”,	
  IEEE	
  INTERNATIONAL	
  CONFERENCE	
  ON	
  DATA	
  MINING	
  2009	
×G	
×G	
Vj	
V	
M	
 CombineAll	
V	
Assign	
Combine2	
GIM-­‐V	
  can	
  be	
  implemented	
  by	
  2-­‐stage	
  MapReduce	
  
→	
  Implement	
  on	
  mul.-­‐GPU	
  environment	
  
Proposal:	
  	
  
GIM-­‐V	
  implementa.on	
  on	
  mul.-­‐GPU	
•  Con.nuous	
  execu.on	
  feature	
  for	
  itera.ons	
  
–  2	
  MapReduce	
  stages	
  /	
  itera;on	
  
–  Graph	
  par;;on	
  at	
  Pre-­‐processing	
  
•  Divide	
  the	
  input	
  graph	
  ver;ces/edges	
  among	
  GPUs	
  
–  Parallel	
  Convergence	
  test	
  at	
  Post-­‐processing	
  
•  Locally	
  on	
  each	
  process	
  -­‐>	
  globally	
  using	
  MPI_Allreduce	
  
18	
Graph	
  
Par;;on	
Stage	
  1	
  
	
  
	
  
Stage	
  2	
  
	
  
	
GPU	
  Processing	
 Scheduler	
Pre-­‐process	
Convergence	
  
Test	
Post-­‐process	
Mul.-­‐GPU	
  GIM-­‐V	
Combine2	
 CombineAll
Eliminate	
  metadata	
  and	
  
use	
  fixed	
  size	
  payload	
  
Op;miza;ons	
  for	
  mul;-­‐GPU	
  GIM-­‐V	
•  Data	
  structure	
  
–  Mars	
  handles	
  	
  
metadata	
  and	
  payload	
  
•  Thread	
  alloca.on	
  
–  Mars	
  handles	
  one	
  key	
  	
  
per	
  thread	
  
•  Load	
  balance	
  
op.miza.on	
  
–  Scale-­‐free	
  property	
  
•  Small	
  number	
  of	
  ver;ces	
  
have	
  many	
  edges	
  
19	
In	
  Reduce	
  stage,	
  allocate	
  
mul.	
  CUDA	
  threads	
  to	
  a	
  
single	
  key	
  according	
  to	
  value	
  
size	
  
Minimize	
  load	
  imbalance	
  
among	
  GPUS	
  
Mars	
 Our	
  Implementa.on
Eliminate	
  metadata	
  and	
  
use	
  fixed	
  size	
  payload	
  
Op;miza;ons	
  for	
  mul;-­‐GPU	
  GIM-­‐V	
•  Data	
  structure	
  
–  Mars	
  handles	
  	
  
metadata	
  and	
  payload	
  
•  Thread	
  alloca.on	
  
–  Mars	
  handles	
  one	
  key	
  	
  
per	
  thread	
  
•  Load	
  balance	
  
op.miza.on	
  
–  Scale-­‐free	
  property	
  
•  Small	
  number	
  of	
  ver;ces	
  
have	
  many	
  edges	
  
20	
In	
  Reduce	
  stage,	
  allocate	
  
mul.	
  CUDA	
  threads	
  to	
  a	
  
single	
  key	
  according	
  to	
  value	
  
size	
  
Mars	
 Our	
  Implementa.on	
Minimize	
  load	
  imbalance	
  
among	
  GPUS	
  
Apply	
  Load	
  Balancing	
  Op;miza;on	
•  Par..on	
  the	
  graph	
  in	
  order	
  to	
  minimize	
  load	
  
imbalance	
  among	
  GPUs	
  
–  Applying	
  a	
  task	
  scheduling	
  algorithm	
  
•  Regard	
  Vertex/Edges	
  as	
  Task	
  
•  TaskSize	
  i	
  =	
  1	
  +	
  Σ	
  Outgoing	
  Edges	
  
	
  
–  LPT	
  (Least	
  Processing	
  Time)	
  schedule	
  *1	
  	
  
•  Assign	
  tasks	
  in	
  decreasing	
  order	
  of	
  task	
  size	
  
*1	
  :	
  R.	
  L.	
  Graham,	
  “Bounds	
  on	
  mul;processing	
  anomalies	
  and	
  related	
  packing	
  algorithms,”	
  in	
  
Proceedings	
  of	
  the	
  May	
  16-­‐18,	
  1972,	
  spring	
  joint	
  computer	
  conference,	
  ser.	
  AFIPS	
  ’72	
  (Spring)	
  	
  
P3	
P2	
P1	
4	
 5	
 6	
 8	
7	
Tasks	
  =	
  {8,	
  5,	
  4,	
  3,	
  1}	
Minimize	
  the	
  maximum	
  amount	
Vertex	
  i	
i	
TaskSize	
  i	
  =	
  1	
  +	
  3	
V	
 Eout	
21
Experiments	
•  Methods	
  
–  A	
  single	
  round	
  of	
  itera;ons	
  (w/o	
  Preprocessing)	
  
–  PageRank	
  applica;on	
  
•  Measures	
  rela;ve	
  	
  
importance	
  of	
  web	
  pages	
  
–  Input	
  data	
  	
  
•  Ar;ficial	
  Kronecker	
  graphs	
  
–  Generated	
  by	
  generator	
  in	
  Graph	
  500	
  	
  
•  Parameters	
  
–  SCALE:	
  log	
  2	
  of	
  #ver;ces	
  (#ver;ces	
  =	
  2SCALE)	
  
–  Edge_factor:	
  16	
  (#edges	
  =	
  Edge_factor	
  ×	
  #ver;ces)	
  
22	
16	
4	
 3	
2	
 1	
8	
12	
4	
8	
 6	
4	
   2	
  
12	
 3	
2	
 1	
4	
 3	
2	
 1	
G2 = G1 ⊗ G1G1
Study	
  the	
  performance	
  of	
  our	
  mul.-­‐GPU	
  GIM-­‐V	
  	
  
•  Scalability	
  
•  Comparison	
  w/	
  a	
  CPU-­‐based	
  implementa.on	
  
•  Validity	
  of	
  the	
  load	
  balance	
  op.miza.on	
  
Experimental	
  environments	
•  TSUBAME	
  2.0	
  supercomputer	
  	
  
–  We	
  use	
  256	
  nodes	
  (768	
  GPUs)	
  
•  CPU-­‐GPU:	
  	
  PCI-­‐E	
  2.0	
  x16	
  
•  Internode:	
  QDR	
  IB	
  (40	
  Gbps)	
  dual	
  rail	
  
•  Mars	
  
–  MarsGPU-­‐n	
  
•  n	
  GPUs	
  /	
  node	
  	
  
	
  	
  	
  	
  (n:	
  1,	
  2,	
  3)	
  
–  MarsCPU	
  
•  12	
  threads	
  /	
  node	
  
•  MPI	
  and	
  pthread	
  
•  Parallel	
  quick	
  sort	
  
23	
CPU	
 GPU	
Model	
  	
 Intel®	
  Xeon®	
  
X5670	
Tesla	
  M2050	
  	
#	
  Cores	
 6	
 448	
Frequency	
  	
 2.93	
  GHz	
 1.15	
  GHz	
Memory	
 54	
  GB	
 2.7	
  GB	
Compiler	
  	
 gcc	
  4.3.4	
 nvcc	
  4.0
24	
0	
  
10	
  
20	
  
30	
  
40	
  
50	
  
60	
  
70	
  
80	
  
90	
  
100	
  
0	
   50	
   100	
   150	
   200	
   250	
   300	
  
MEgdes	
  /	
  sec	
#	
  Compute	
  Nodes	
MarsGPU-­‐1	
  
MarsGPU-­‐2	
  
MarsGPU-­‐3	
  
MarsCPU	
  
SCALE	
  30	
SCALE	
  29	
SCALE	
  28	
SCALE	
  27	
87.04	
  ME/s	
  
(256	
  nodes)	
1.52x	
  speedup	
  
(3	
  GPU	
  v	
  CPU)	
Weak	
  Scaling	
  Performance:	
  	
  
MarsGPU	
  vs.	
  MarsCPU	
Becer	
•  W/O	
  load	
  balance	
  op;miza;on	
  
Weak	
  Scaling	
  Performance:	
  	
  
MarsGPU	
  vs.	
  MarsCPU	
25	
0	
  
10	
  
20	
  
30	
  
40	
  
50	
  
60	
  
70	
  
80	
  
90	
  
100	
  
0	
   50	
   100	
   150	
   200	
   250	
   300	
  
MEgdes	
  /	
  sec	
#	
  Compute	
  Nodes	
MarsGPU-­‐1	
  
MarsGPU-­‐2	
  
MarsGPU-­‐3	
  
MarsCPU	
  
SCALE	
  30	
SCALE	
  29	
SCALE	
  28	
SCALE	
  27	
Becer	
•  W/O	
  load	
  balance	
  op;miza;on	
  
87.04	
  ME/s	
  
(256	
  nodes)	
1.52x	
  speedup	
  
(3	
  GPU	
  v	
  CPU)	
Performance	
  
Breakdown	
  
26	
0	
  
1000	
  
2000	
  
3000	
  
4000	
  
5000	
  
6000	
  
7000	
  
8000	
  
9000	
  
MarsCPU	
   MarsGPU-­‐1	
   MarsGPU-­‐2	
   MarsGPU-­‐3	
  
Elapsed	
  Time	
  [ms]	
Map	
   MPI-­‐Comm	
  
PCI-­‐Comm	
   Hash	
  
Sort	
   Reduce	
  
Performance	
  Breakdown:	
  	
  
MarsGPU	
  and	
  MarsCPU	
Becer	
SCALE	
  28
27	
0	
  
1000	
  
2000	
  
3000	
  
4000	
  
5000	
  
6000	
  
7000	
  
8000	
  
9000	
  
MarsCPU	
   MarsGPU-­‐1	
   MarsGPU-­‐2	
   MarsGPU-­‐3	
  
Elapsed	
  Time	
  [ms]	
Map	
   MPI-­‐Comm	
  
PCI-­‐Comm	
   Hash	
  
Sort	
   Reduce	
  
Performance	
  Breakdown:	
  	
  
MarsGPU	
  and	
  MarsCPU	
8.93x	
  	
  
(Map)	
2.53x	
  	
  
(Sort)	
Becer	
SCALE	
  28
28	
0	
  
1000	
  
2000	
  
3000	
  
4000	
  
5000	
  
6000	
  
7000	
  
8000	
  
9000	
  
MarsCPU	
   MarsGPU-­‐1	
   MarsGPU-­‐2	
   MarsGPU-­‐3	
  
Elapsed	
  Time	
  [ms]	
Map	
   MPI-­‐Comm	
  
PCI-­‐Comm	
   Hash	
  
Sort	
   Reduce	
  
Performance	
  Breakdown:	
  	
  
MarsGPU	
  and	
  MarsCPU	
Becer	
SCALE	
  28	
PCI-­‐E	
  overhead
Efficiency	
  of	
  GIM-­‐V	
  Op;miza;ons	
•  Data	
  structure	
  	
  	
  	
  	
  	
  (Map,	
  Sort,	
  Reduce)	
  
•  Thread	
  alloca.on	
  	
  	
  (Reduce)	
  
29	
1	
  
10	
  
100	
  
1000	
  
10000	
  
Map	
   Sort	
   Reduce	
  
Elapsed	
  Time	
  [ms]	
Naive	
  	
  
Op;mized	
  
1.92x	
1.64x	
 66.8x	
SCALE	
  26,	
  128	
  nodes	
  on	
  MarsGPU-­‐3	
  
Becer
30	
0	
  
10	
  
20	
  
30	
  
40	
  
50	
  
60	
  
70	
  
80	
  
90	
  
0	
   20	
   40	
   60	
   80	
   100	
   120	
   140	
  
MEdges	
  /	
  Sec	
#	
  Compute	
  Nodes	
MarsGPU-­‐3	
  
MarsGPU-­‐3	
  LPT	
  
1.16x	
  
Speedup	
Round	
  Robin	
  vs.	
  LPT	
  Schedule	
•  Similar	
  except	
  for	
  on	
  128	
  nodes	
  
–  Input	
  graphs	
  are	
  rela;vely	
  well-­‐balanced	
  (Graph500)	
Weak	
  Scaling	
  Performance	
Becer
31	
0	
  
10	
  
20	
  
30	
  
40	
  
50	
  
60	
  
70	
  
80	
  
90	
  
0	
   20	
   40	
   60	
   80	
   100	
   120	
   140	
  
MEdges	
  /	
  Sec	
#	
  Compute	
  Nodes	
MarsGPU-­‐3	
  
MarsGPU-­‐3	
  LPT	
  
1.16x	
  
Speedup	
Performance	
  
Breakdown	
  
•  Similar	
  except	
  for	
  on	
  128	
  nodes	
  
–  Input	
  graphs	
  are	
  rela;vely	
  well-­‐balanced	
  (Graph500)	
Weak	
  Scaling	
  Performance	
Round	
  Robin	
  vs.	
  LPT	
  Schedule	
Becer
Performance	
  Breakdown	
  	
  
Round	
  robin	
  vs.	
  LPT	
  Schedule	
•  Bitonic	
  sort	
  calculates	
  power-­‐of-­‐two	
  key-­‐value	
  pairs	
  
–  Load	
  balancing	
  reduced	
  the	
  number	
  of	
  sor;ng	
  elements	
  
32	
0	
  
500	
  
1000	
  
1500	
  
2000	
  
2500	
  
3000	
  
MarsGPU-­‐3	
   MarsGPU-­‐3	
  LPT	
  
Elapsed	
  Time	
  [ms]	
Map	
  
MPI-­‐Comm	
  
PCI-­‐Comm	
  
Hash	
  
Sort	
  
Reduce	
  
Speedup	
  	
  
in	
  Sort	
Becer
33	
1	
  
10	
  
100	
  
1000	
  
10000	
  
100000	
  
PEGASUS	
   MarsCPU	
   MarsGPU-­‐3	
  
KEdges	
  	
  /	
  Sec	
Outperform	
  Hadoop-­‐based	
  Implementa;on	
•  PEGASUS:	
  a	
  Hadoop-­‐based	
  GIM-­‐V	
  implementa;on	
  
–  Hadoop	
  0.21.0	
  
–  Lustre	
  for	
  underlying	
  Hadoop’s	
  file	
  system	
  
186.8x	
  
Speedup	
SCALE	
  27,	
  128	
  nodes	
Becer
Related	
  Work	
•  Graph	
  processing	
  using	
  GPU	
  
–  Shortest	
  path	
  algorithms	
  for	
  GPU	
  (BFS,SSSP,	
  and	
  
APSP)*1	
  
→	
  Not	
  achieve	
  compe;;ve	
  performance	
  
•  MapReduce	
  implementa;ons	
  on	
  GPUs	
  
–  GPMR*2	
  :	
  MapReduce	
  implementa;on	
  on	
  mul;	
  GPUs	
  
→	
  Not	
  show	
  scalability	
  for	
  large-­‐scale	
  processing	
  
•  Graph	
  processing	
  with	
  load	
  balancing	
  	
  
–  Load	
  balancing	
  while	
  keeping	
  communica;on	
  low	
  on	
  
R-­‐MAT	
  graphs*3	
  
→	
  We	
  show	
  the	
  task	
  scheduling-­‐based	
  load-­‐balancing	
  
34	
*1	
  :	
  Harish,	
  P.	
  et	
  al,	
  “Accelera;ng	
  Large	
  Graph	
  Algorithms	
  on	
  the	
  GPU	
  using	
  CUDA”,	
  HiPC	
  2007.	
  
*2	
  :	
  Stuart,	
  J.A.	
  et	
  al,	
  “Mul;-­‐GPU	
  MapReduce	
  on	
  GPU	
  Clusters”,	
  IPDPS	
  2011.	
  
*3	
  :	
  J.	
  Chhugani,	
  N.	
  Sa;sh,	
  C.	
  Kim,	
  J.	
  Sewall,	
  and	
  P.	
  Dubey,	
  “Fast	
  and	
  Efficient	
  Graph	
  Traversal	
  Algorithm	
  for	
  CPUs:	
  	
  
Maximizing	
  single-­‐node	
  efficiency,”	
  in	
  Parallel	
  Distributed	
  Processing	
  Symposium	
  (IPDPS),	
  2012	
  	
  
Conclusions	
•  A	
  scalable	
  MapReduce-­‐based	
  GIM-­‐V	
  
implementa.on	
  using	
  mul.-­‐GPU	
  
–  Methodology	
  
•  Extend	
  Mars	
  to	
  support	
  mul;-­‐GPU	
  
•  GIM-­‐V	
  using	
  mul;-­‐GPU	
  MapReduce	
  
•  Load	
  balance	
  op;miza;on	
  
–  Performance	
  
•  87.04	
  ME/s	
  on	
  SCALE	
  30	
  (256	
  nodes,	
  768	
  GPUs)	
  
•  1.52x	
  speedup	
  than	
  the	
  CPU-­‐based	
  implementa;on	
  
•  Future	
  work	
  
–  Op;miza;on	
  of	
  our	
  implementa;on	
  
•  Improve	
  communica;on,	
  locality	
  	
  
–  Data	
  handling	
  larger	
  than	
  GPU	
  memory	
  capacity	
  
•  Memory	
  hierarchy	
  management	
  (GPU,	
  DRAM,	
  NVM,	
  SSD)	
  
35
Comparison	
  with	
  Load	
  Balance	
  Algorithm	
  
(Simula;on,	
  Weak	
  Scaling)	
•  Compare	
  between	
  naive	
  (Round	
  robin)	
  and	
  load	
  balancing	
  
op;miza;on	
  (LPT	
  schedule)	
  
•  Similar	
  except	
  for	
  128	
  nodes	
  (3.98%	
  on	
  SCALE	
  25,	
  64	
  nodes)	
  
–  Performance	
  improvement:	
  13.8%	
  (SCALE	
  26,	
  128	
  nodes)	
  
36	
0	
  
5	
  
10	
  
15	
  
20	
  
25	
  
30	
  
35	
  
40	
  
2	
   4	
   8	
   16	
   32	
   64	
   128	
  
Load	
  Imbalance	
  [%]	
  
#	
  Compute	
  Nodes	
Round	
  Robin	
  
LPT	
  
1.67x	
  
BeXer
Large-­‐scale	
  Graphs	
  in	
  Real	
  World	
•  Graphs	
  in	
  real	
  world	
  
–  Health	
  care,	
  SNS,	
  Biology,	
  Electric	
  power	
  grid	
  etc.	
  
–  Millions	
  to	
  trillions	
  of	
  ver;ces	
  and	
  100	
  millions	
  to	
  100	
  trillions	
  of	
  
edges	
  	
  
–  Similar	
  proper;es	
  
•  Scale-­‐free	
  (power-­‐low	
  degree	
  distribu;on)	
  
•  Small	
  diameter	
  
•  Kronecker	
  Graph	
  
–  Similar	
  proper;es	
  as	
  real	
  world	
  graphs	
  
–  Widely	
  used	
  (e.g.	
  the	
  Graph500	
  benchmark*1)	
  since	
  obtained	
  
easily	
  by	
  simply	
  applying	
  itera;ve	
  products	
  on	
  a	
  base	
  matrix	
  
37	
*1	
  :	
  D.	
  A.	
  Bader	
  et	
  al.	
  The	
  graph500	
  list.	
  Graph500.org.	
  hXp://www.graph500.org/	
  
16	
4	
 3	
2	
 1	
8	
12	
4	
8	
 6	
4	
   2	
  
12	
 3	
2	
 1	
4	
 3	
2	
 1

More Related Content

What's hot

The Rise of Small Satellites
The Rise of Small SatellitesThe Rise of Small Satellites
The Rise of Small Satellitesmooctu9
 
Application of Vision based Techniques for Position Estimation
Application of Vision based Techniques for Position EstimationApplication of Vision based Techniques for Position Estimation
Application of Vision based Techniques for Position EstimationIRJET Journal
 
Mercury CubeSat Presentation for ASAT2016
Mercury CubeSat Presentation for ASAT2016Mercury CubeSat Presentation for ASAT2016
Mercury CubeSat Presentation for ASAT2016Karen Grothe
 
[Paper Presentation] Unbounded High Dynamic Range Photography using a Modulo ...
[Paper Presentation] Unbounded High Dynamic Range Photography using a Modulo ...[Paper Presentation] Unbounded High Dynamic Range Photography using a Modulo ...
[Paper Presentation] Unbounded High Dynamic Range Photography using a Modulo ...Pei-Hsuan (Ike) Tsai
 
Camera And Email
Camera And EmailCamera And Email
Camera And EmailSV.CO
 
Tracking emerging diseases from space: Geoinformatics for human health
Tracking emerging diseases from space: Geoinformatics for human healthTracking emerging diseases from space: Geoinformatics for human health
Tracking emerging diseases from space: Geoinformatics for human healthMarkus Neteler
 
Processing Landsat 8 Multi-Spectral Images with GRASS Tools & the potential o...
Processing Landsat 8 Multi-Spectral Images with GRASS Tools & the potential o...Processing Landsat 8 Multi-Spectral Images with GRASS Tools & the potential o...
Processing Landsat 8 Multi-Spectral Images with GRASS Tools & the potential o...Shaun Lewis
 
Mm6 Vs Juno Wp En June 2009
Mm6 Vs Juno Wp En June 2009Mm6 Vs Juno Wp En June 2009
Mm6 Vs Juno Wp En June 2009guesta2a0eba
 
Miniaturizing Space: Small-satellites
Miniaturizing Space: Small-satellitesMiniaturizing Space: Small-satellites
Miniaturizing Space: Small-satellitesX. Breogan COSTA
 
Design and performance evaluation of a solar tracking panel of single axis in...
Design and performance evaluation of a solar tracking panel of single axis in...Design and performance evaluation of a solar tracking panel of single axis in...
Design and performance evaluation of a solar tracking panel of single axis in...IJECEIAES
 
GRASS as a Temporal GIS - Sören Gebbert
GRASS as a Temporal GIS - Sören GebbertGRASS as a Temporal GIS - Sören Gebbert
GRASS as a Temporal GIS - Sören GebbertLuis_de_Sousa
 
Final thesis presentation
Final thesis presentationFinal thesis presentation
Final thesis presentationPawan Singh
 
FME Around the World (FME Trek Part 1): Ken Bragg - Safe Software FME World T...
FME Around the World (FME Trek Part 1): Ken Bragg - Safe Software FME World T...FME Around the World (FME Trek Part 1): Ken Bragg - Safe Software FME World T...
FME Around the World (FME Trek Part 1): Ken Bragg - Safe Software FME World T...IMGS
 
Optimization of Parallel K-means for Java Paddy Mapping Using Time-series Sat...
Optimization of Parallel K-means for Java Paddy Mapping Using Time-series Sat...Optimization of Parallel K-means for Java Paddy Mapping Using Time-series Sat...
Optimization of Parallel K-means for Java Paddy Mapping Using Time-series Sat...TELKOMNIKA JOURNAL
 
KGrothe Capstone Project Final Presentation
KGrothe Capstone Project Final PresentationKGrothe Capstone Project Final Presentation
KGrothe Capstone Project Final PresentationKaren Grothe
 

What's hot (20)

The Rise of Small Satellites
The Rise of Small SatellitesThe Rise of Small Satellites
The Rise of Small Satellites
 
Application of Vision based Techniques for Position Estimation
Application of Vision based Techniques for Position EstimationApplication of Vision based Techniques for Position Estimation
Application of Vision based Techniques for Position Estimation
 
Mercury CubeSat Presentation for ASAT2016
Mercury CubeSat Presentation for ASAT2016Mercury CubeSat Presentation for ASAT2016
Mercury CubeSat Presentation for ASAT2016
 
CLIM: Transition Workshop - Optimization Methods in Remote Sensing - Jessica...
CLIM: Transition Workshop - Optimization Methods in Remote Sensing  - Jessica...CLIM: Transition Workshop - Optimization Methods in Remote Sensing  - Jessica...
CLIM: Transition Workshop - Optimization Methods in Remote Sensing - Jessica...
 
[Paper Presentation] Unbounded High Dynamic Range Photography using a Modulo ...
[Paper Presentation] Unbounded High Dynamic Range Photography using a Modulo ...[Paper Presentation] Unbounded High Dynamic Range Photography using a Modulo ...
[Paper Presentation] Unbounded High Dynamic Range Photography using a Modulo ...
 
Camera And Email
Camera And EmailCamera And Email
Camera And Email
 
Tracking emerging diseases from space: Geoinformatics for human health
Tracking emerging diseases from space: Geoinformatics for human healthTracking emerging diseases from space: Geoinformatics for human health
Tracking emerging diseases from space: Geoinformatics for human health
 
QGIS training class 3
QGIS training class 3QGIS training class 3
QGIS training class 3
 
Processing Landsat 8 Multi-Spectral Images with GRASS Tools & the potential o...
Processing Landsat 8 Multi-Spectral Images with GRASS Tools & the potential o...Processing Landsat 8 Multi-Spectral Images with GRASS Tools & the potential o...
Processing Landsat 8 Multi-Spectral Images with GRASS Tools & the potential o...
 
Mm6 Vs Juno Wp En June 2009
Mm6 Vs Juno Wp En June 2009Mm6 Vs Juno Wp En June 2009
Mm6 Vs Juno Wp En June 2009
 
Miniaturizing Space: Small-satellites
Miniaturizing Space: Small-satellitesMiniaturizing Space: Small-satellites
Miniaturizing Space: Small-satellites
 
Dsbr
DsbrDsbr
Dsbr
 
Design and performance evaluation of a solar tracking panel of single axis in...
Design and performance evaluation of a solar tracking panel of single axis in...Design and performance evaluation of a solar tracking panel of single axis in...
Design and performance evaluation of a solar tracking panel of single axis in...
 
GRASS as a Temporal GIS - Sören Gebbert
GRASS as a Temporal GIS - Sören GebbertGRASS as a Temporal GIS - Sören Gebbert
GRASS as a Temporal GIS - Sören Gebbert
 
Final thesis presentation
Final thesis presentationFinal thesis presentation
Final thesis presentation
 
FME Around the World (FME Trek Part 1): Ken Bragg - Safe Software FME World T...
FME Around the World (FME Trek Part 1): Ken Bragg - Safe Software FME World T...FME Around the World (FME Trek Part 1): Ken Bragg - Safe Software FME World T...
FME Around the World (FME Trek Part 1): Ken Bragg - Safe Software FME World T...
 
Optimization of Parallel K-means for Java Paddy Mapping Using Time-series Sat...
Optimization of Parallel K-means for Java Paddy Mapping Using Time-series Sat...Optimization of Parallel K-means for Java Paddy Mapping Using Time-series Sat...
Optimization of Parallel K-means for Java Paddy Mapping Using Time-series Sat...
 
20210226 esa-science-coffee-v2.0
20210226 esa-science-coffee-v2.020210226 esa-science-coffee-v2.0
20210226 esa-science-coffee-v2.0
 
KGrothe Capstone Project Final Presentation
KGrothe Capstone Project Final PresentationKGrothe Capstone Project Final Presentation
KGrothe Capstone Project Final Presentation
 
Esa act talk
Esa act talkEsa act talk
Esa act talk
 

Similar to A Scalable Implementation of a MapReduce-based Graph Processing Algorithm for Large-scale Heterogeneous Supercomputers

Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
Hybrid Map Task Scheduling for GPU-based Heterogeneous ClustersHybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
Hybrid Map Task Scheduling for GPU-based Heterogeneous ClustersKoichi Shirahata
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdfTigabu Yaya
 
Landset 8 的雲層去除技巧實作
Landset 8 的雲層去除技巧實作Landset 8 的雲層去除技巧實作
Landset 8 的雲層去除技巧實作鈵斯 倪
 
Presentation on progress report of final year project(gps
Presentation on progress report of final year project(gpsPresentation on progress report of final year project(gps
Presentation on progress report of final year project(gpsWasim Akram
 
Scalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceScalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceKyong-Ha Lee
 
Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
Mizan: A System for Dynamic Load Balancing in Large-scale Graph ProcessingMizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
Mizan: A System for Dynamic Load Balancing in Large-scale Graph ProcessingZuhair khayyat
 
Spatial visualization with ggplot2
Spatial visualization with ggplot2Spatial visualization with ggplot2
Spatial visualization with ggplot2Joaquim Silva
 
sexy maps comes to R - ggplot+ google maps= ggmap #rstats
sexy maps comes to R - ggplot+ google maps= ggmap #rstatssexy maps comes to R - ggplot+ google maps= ggmap #rstats
sexy maps comes to R - ggplot+ google maps= ggmap #rstatsAjay Ohri
 
IRJET- Study of Real Time Kinematica Survey with Differential Global Position...
IRJET- Study of Real Time Kinematica Survey with Differential Global Position...IRJET- Study of Real Time Kinematica Survey with Differential Global Position...
IRJET- Study of Real Time Kinematica Survey with Differential Global Position...IRJET Journal
 
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~Kohei KaiGai
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit pptSandeep Singh
 
EGU 2012 ESSI: The FOSS GIS Workbench on the GFZ Load Sharing Facility compu...
EGU 2012 ESSI:  The FOSS GIS Workbench on the GFZ Load Sharing Facility compu...EGU 2012 ESSI:  The FOSS GIS Workbench on the GFZ Load Sharing Facility compu...
EGU 2012 ESSI: The FOSS GIS Workbench on the GFZ Load Sharing Facility compu...Peter Löwe
 

Similar to A Scalable Implementation of a MapReduce-based Graph Processing Algorithm for Large-scale Heterogeneous Supercomputers (20)

Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
Hybrid Map Task Scheduling for GPU-based Heterogeneous ClustersHybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
 
I017425763
I017425763I017425763
I017425763
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
 
IMQA Paper
IMQA PaperIMQA Paper
IMQA Paper
 
post119s1-file3
post119s1-file3post119s1-file3
post119s1-file3
 
Landset 8 的雲層去除技巧實作
Landset 8 的雲層去除技巧實作Landset 8 的雲層去除技巧實作
Landset 8 的雲層去除技巧實作
 
Presentation on progress report of final year project(gps
Presentation on progress report of final year project(gpsPresentation on progress report of final year project(gps
Presentation on progress report of final year project(gps
 
GPU Computing
GPU ComputingGPU Computing
GPU Computing
 
Scalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceScalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduce
 
A0270107
A0270107A0270107
A0270107
 
Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
Mizan: A System for Dynamic Load Balancing in Large-scale Graph ProcessingMizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
 
Spatial visualization with ggplot2
Spatial visualization with ggplot2Spatial visualization with ggplot2
Spatial visualization with ggplot2
 
sexy maps comes to R - ggplot+ google maps= ggmap #rstats
sexy maps comes to R - ggplot+ google maps= ggmap #rstatssexy maps comes to R - ggplot+ google maps= ggmap #rstats
sexy maps comes to R - ggplot+ google maps= ggmap #rstats
 
IRJET- Study of Real Time Kinematica Survey with Differential Global Position...
IRJET- Study of Real Time Kinematica Survey with Differential Global Position...IRJET- Study of Real Time Kinematica Survey with Differential Global Position...
IRJET- Study of Real Time Kinematica Survey with Differential Global Position...
 
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit ppt
 
T180304125129
T180304125129T180304125129
T180304125129
 
EGU 2012 ESSI: The FOSS GIS Workbench on the GFZ Load Sharing Facility compu...
EGU 2012 ESSI:  The FOSS GIS Workbench on the GFZ Load Sharing Facility compu...EGU 2012 ESSI:  The FOSS GIS Workbench on the GFZ Load Sharing Facility compu...
EGU 2012 ESSI: The FOSS GIS Workbench on the GFZ Load Sharing Facility compu...
 
Ultra Fast SOM using CUDA
Ultra Fast SOM using CUDAUltra Fast SOM using CUDA
Ultra Fast SOM using CUDA
 
6. Implementation
6. Implementation6. Implementation
6. Implementation
 

More from Koichi Shirahata

Performance Analysis of Lattice QCD with APGAS Programming Model
Performance Analysis of Lattice QCD with APGAS Programming ModelPerformance Analysis of Lattice QCD with APGAS Programming Model
Performance Analysis of Lattice QCD with APGAS Programming ModelKoichi Shirahata
 
GPGPUを用いた大規模高速グラフ処理に向けて
GPGPUを用いた大規模高速グラフ処理に向けてGPGPUを用いた大規模高速グラフ処理に向けて
GPGPUを用いた大規模高速グラフ処理に向けてKoichi Shirahata
 
GPUを考慮したMapReduceのタスクスケジューリング
GPUを考慮したMapReduceのタスクスケジューリングGPUを考慮したMapReduceのタスクスケジューリング
GPUを考慮したMapReduceのタスクスケジューリングKoichi Shirahata
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelKoichi Shirahata
 
GPUアクセラレータと不揮発性メモリを考慮したI/O性能の予備評価
GPUアクセラレータと不揮発性メモリを考慮したI/O性能の予備評価GPUアクセラレータと不揮発性メモリを考慮したI/O性能の予備評価
GPUアクセラレータと不揮発性メモリを考慮したI/O性能の予備評価Koichi Shirahata
 
汎用グラフ処理モデルGIM-Vの複数GPUによる大規模計算とデータ転送の最適化
汎用グラフ処理モデルGIM-Vの複数GPUによる大規模計算とデータ転送の最適化汎用グラフ処理モデルGIM-Vの複数GPUによる大規模計算とデータ転送の最適化
汎用グラフ処理モデルGIM-Vの複数GPUによる大規模計算とデータ転送の最適化Koichi Shirahata
 
Performance Analysis of MapReduce Implementations on High Performance Homolog...
Performance Analysis of MapReduce Implementations on High Performance Homolog...Performance Analysis of MapReduce Implementations on High Performance Homolog...
Performance Analysis of MapReduce Implementations on High Performance Homolog...Koichi Shirahata
 

More from Koichi Shirahata (7)

Performance Analysis of Lattice QCD with APGAS Programming Model
Performance Analysis of Lattice QCD with APGAS Programming ModelPerformance Analysis of Lattice QCD with APGAS Programming Model
Performance Analysis of Lattice QCD with APGAS Programming Model
 
GPGPUを用いた大規模高速グラフ処理に向けて
GPGPUを用いた大規模高速グラフ処理に向けてGPGPUを用いた大規模高速グラフ処理に向けて
GPGPUを用いた大規模高速グラフ処理に向けて
 
GPUを考慮したMapReduceのタスクスケジューリング
GPUを考慮したMapReduceのタスクスケジューリングGPUを考慮したMapReduceのタスクスケジューリング
GPUを考慮したMapReduceのタスクスケジューリング
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
 
GPUアクセラレータと不揮発性メモリを考慮したI/O性能の予備評価
GPUアクセラレータと不揮発性メモリを考慮したI/O性能の予備評価GPUアクセラレータと不揮発性メモリを考慮したI/O性能の予備評価
GPUアクセラレータと不揮発性メモリを考慮したI/O性能の予備評価
 
汎用グラフ処理モデルGIM-Vの複数GPUによる大規模計算とデータ転送の最適化
汎用グラフ処理モデルGIM-Vの複数GPUによる大規模計算とデータ転送の最適化汎用グラフ処理モデルGIM-Vの複数GPUによる大規模計算とデータ転送の最適化
汎用グラフ処理モデルGIM-Vの複数GPUによる大規模計算とデータ転送の最適化
 
Performance Analysis of MapReduce Implementations on High Performance Homolog...
Performance Analysis of MapReduce Implementations on High Performance Homolog...Performance Analysis of MapReduce Implementations on High Performance Homolog...
Performance Analysis of MapReduce Implementations on High Performance Homolog...
 

Recently uploaded

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 

Recently uploaded (20)

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 

A Scalable Implementation of a MapReduce-based Graph Processing Algorithm for Large-scale Heterogeneous Supercomputers

  • 1. A  Scalable  Implementa.on  of  a  MapReduce-­‐ based  Graph  Processing  Algorithm  for  Large-­‐ scale  Heterogeneous  Supercomputers Koichi  Shirahata*1,Hitoshi  Sato*1,*2,   Toyotaro  Suzumura*1,*2,*3,Satoshi  Matsuoka*1     *1  Tokyo  Ins;tute  of  Technology   *2  CREST,  Japan  Science  and  Technology  Agency   *3  IBM  Research  -­‐  Tokyo   1
  • 2. Emergence  of  Large  Scale  Graphs   Need  fast  and  scalable  analysis  using  HPC           2 900  Million  Ver;ces   100  Billion  Edges
  • 3. GPU-­‐based  Heterogeneous   supercomputers 3 Fast  Large  Graph  Processing  with  GPGPU   High  peak  performance   High  memory  bandwidth   GPGPU Mo.va.on TSUBAME  2.0     1408  compute  nodes  (3  GPUs  /  node)
  • 4. Problems  of  Large  Scale  Graph   Processing  with  GPGPU •  How  much  do  GPUs  accelerate   large  scale  graph  processing  ?   – Applicability  to  graph  applica;ons   •  Computa;on  paXerns  of  graph   algorithm  affects  performance   •  Tradeoff  between  computa;on  and   CPU-­‐GPU  data  transfer  overhead   – How  to  distribute  graph  data  to   each  GPU  in  order  to  exploit   mul;ple  GPUs   4 GPU  memory CPU  memory Scalability Load   balancing Communica;on
  • 5. Motivating Example: 
 CPU-based Graph Processing •  How  much  is  the  graph  applica.on  accelerated  using  GPU  ?   –  Simple  computa;on  paXerns,High  memory  bandwidth   –  Complex  computa;on  paXerns,  PCI-­‐E  overhead   5 0   2000   4000   6000   8000   10000   12000   14000   1   2   4   8   16   32   64   128   Elapsed  Time  [ms] #  Compute  Nodes Reduce   Sort   Copy   Map  
  • 6. Contribu;ons •  Implemented  a  scalable  mul.-­‐GPU-­‐based   PageRank  applica.on   –  Extend  Mars  (an  exis;ng  GPU  MapReduce  framework)   •  Using  the  MPI  library   –  Implement  GIM-­‐V  on  mul;-­‐GPU  MapReduce   •  GIM-­‐V:  a  graph  processing  algorithm   –  Load  balance  op;miza;on  between  GPU  devices  for  large-­‐scale   graphs   •  Task  scheduling-­‐based  graph  par;;oning   6 •  Scale  well  up  to  256  nodes  (768  GPUs)   •  1.52x  speedup  compared  with  on  CPUs   Performance  on  TSUBAME2.0  supercomputer
  • 7. Proposal:  Mul;-­‐GPU  GIM-­‐V  with     Load  Balance  Op;miza;on   7 Graph  Applica.on   PageRank   Graph  Algorithm   Mul.-­‐GPU  GIM-­‐V MapReduce  Framework   Mul.-­‐GPU  Mars PlaZorm   CUDA,  MPI Implement  GIM-­‐V  on   mul.-­‐GPUs  MapReduce   -­‐  Op;miza;on  for  GIM-­‐V   -­‐  Load  balance  op;miza;on   Extend  an  exis.ng  GPU   MapReduce  framework   (Mars)  for  mul.-­‐GPU  
  • 8. Proposal:  Mul;-­‐GPU  GIM-­‐V  with     Load  Balance  Op;miza;on   8 Graph  Applica.on   PageRank   Graph  Algorithm   Mul.-­‐GPU  GIM-­‐V MapReduce  Framework   Mul.-­‐GPU  Mars PlaZorm   CUDA,  MPI Implement  GIM-­‐V  on   mul.-­‐GPUs  MapReduce   -­‐  Op;miza;on  for  GIM-­‐V   -­‐  Load  balance  op;miza;on   Extend  an  exis.ng  GPU   MapReduce  framework   (Mars)  for  mul.-­‐GPU  
  • 9. Structure  of  Mars   •  Mars*1  :  an  exis;ng  GPU-­‐based  MapReduce   framework   –  CPU-­‐GPU  data  transfer  (Map)   –  GPU-­‐based  Bitonic  Sort  (Shuffle)   –  Allocates  one  CUDA  thread  /  key  (Map,  Reduce)   9 *1  :  Bingsheng  He  et  al.  Mars:  A  MapReduce  Framework  on  Graphics  Processors.   PACT  2008 Preprocess GPU  Processing Map Sort Reduce Scheduler
  • 10. Structure  of  Mars   •  Mars*1  :  an  exis;ng  GPU-­‐based  MapReduce   framework   –  CPU-­‐GPU  data  transfer  (Map)   –  GPU-­‐based  Bitonic  Sort  (Shuffle)   –  Allocates  one  CUDA  thread  /  key  (Map,  Reduce)   10 *1  :  Bingsheng  He  et  al.  Mars:  A  MapReduce  Framework  on  Graphics  Processors.   PACT  2008 →  We  extend  Mars  for  mul.-­‐GPU  support   Preprocess GPU  Processing Map Sort Reduce Scheduler
  • 11. Proposal:     Mars  Extension  for  Mul.-­‐GPU  using  MPI Map Sort Map Sort Reduce Reduce GPU  Processing Scheduler Upload     CPU  →  GPU Download   GPU  →  CPU Download   GPU  →  CPU Upload   CPU  →  GPU •  Inter-­‐GPU  communica;ons  in  Shuffle   –  G2C  →  MPI_Alltoallv    →  C2G  →  local  Sort   •  Parallel  I/O  feature  using  MPI-­‐IO   –  Improve  I/O  throughput  between  memory  and  storage   11 Map Copy Sort Reduce
  • 12. Proposal:  Mul;-­‐GPU  GIM-­‐V  with     Load  Balance  Op;miza;on   12 Graph  Applica.on   PageRank   Graph  Algorithm   Mul.-­‐GPU  GIM-­‐V MapReduce  Framework   Mul.-­‐GPU  Mars PlaZorm   CUDA,  MPI Implement  GIM-­‐V  on   mul.-­‐GPUs  MapReduce   -­‐  Op;miza;on  for  GIM-­‐V   -­‐  Load  balance  op;miza;on   Extend  an  exis.ng  GPU   MapReduce  framework   (Mars)  for  mul.-­‐GPU  
  • 13. Large  graph  processing  algorithm  GIM-­‐V 13 *1  :  Kang,  U.  et  al,  “PEGASUS:  A  Peta-­‐Scale  Graph  Mining  System-­‐  Implementa;on     and  Observa;ons”,  IEEE  INTERNATIONAL  CONFERENCE  ON  DATA  MINING  2009 •  Generalized  Itera.ve  Matrix-­‐Vector  mul.plica.on*1   –  Graph  applica;ons  are  implemented  by  defining  3  func;ons   –  v’  =  M  ×G  v      where                    v’i  =  Assign(vj  ,  CombineAllj  ({xj  |  j  =  1..n,  xj  =  Combine2(mi,j,  vj)}))    (i  =  1..n)     ×G Vj V M ×G V
  • 14. Large  graph  processing  algorithm  GIM-­‐V 14 *1  :  Kang,  U.  et  al,  “PEGASUS:  A  Peta-­‐Scale  Graph  Mining  System-­‐  Implementa;on     and  Observa;ons”,  IEEE  INTERNATIONAL  CONFERENCE  ON  DATA  MINING  2009 ×G Vj V M ×G Combine2 V •  Generalized  Itera.ve  Matrix-­‐Vector  mul.plica.on*1   –  Graph  applica;ons  are  implemented  by  defining  3  func;ons   –  v’  =  M  ×G  v      where                    v’i  =  Assign(vj  ,  CombineAllj  ({xj  |  j  =  1..n,  xj  =  Combine2(mi,j,  vj)}))    (i  =  1..n)    
  • 15. Large  graph  processing  algorithm  GIM-­‐V 15 *1  :  Kang,  U.  et  al,  “PEGASUS:  A  Peta-­‐Scale  Graph  Mining  System-­‐  Implementa;on     and  Observa;ons”,  IEEE  INTERNATIONAL  CONFERENCE  ON  DATA  MINING  2009 ×G ×G Vj V M CombineAll V Combine2 •  Generalized  Itera.ve  Matrix-­‐Vector  mul.plica.on*1   –  Graph  applica;ons  are  implemented  by  defining  3  func;ons   –  v’  =  M  ×G  v      where                    v’i  =  Assign(vj  ,  CombineAllj  ({xj  |  j  =  1..n,  xj  =  Combine2(mi,j,  vj)}))    (i  =  1..n)    
  • 16. Large  graph  processing  algorithm  GIM-­‐V 16 *1  :  Kang,  U.  et  al,  “PEGASUS:  A  Peta-­‐Scale  Graph  Mining  System-­‐  Implementa;on     and  Observa;ons”,  IEEE  INTERNATIONAL  CONFERENCE  ON  DATA  MINING  2009 ×G ×G Vj V M CombineAll V Assign Combine2 •  Generalized  Itera.ve  Matrix-­‐Vector  mul.plica.on*1   –  Graph  applica;ons  are  implemented  by  defining  3  func;ons   –  v’  =  M  ×G  v      where                    v’i  =  Assign(vj  ,  CombineAllj  ({xj  |  j  =  1..n,  xj  =  Combine2(mi,j,  vj)}))    (i  =  1..n)    
  • 17. •  Generalized  Itera.ve  Matrix-­‐Vector  mul.plica.on*1   –  Graph  applica;ons  are  implemented  by  defining  3  func;ons   –  v’  =  M  ×G  v      where                    v’i  =  Assign(vj  ,  CombineAllj  ({xj  |  j  =  1..n,  xj  =  Combine2(mi,j,  vj)}))    (i  =  1..n)     Large  graph  processing  algorithm  GIM-­‐V 17 *1  :  Kang,  U.  et  al,  “PEGASUS:  A  Peta-­‐Scale  Graph  Mining  System-­‐  Implementa;on     and  Observa;ons”,  IEEE  INTERNATIONAL  CONFERENCE  ON  DATA  MINING  2009 ×G ×G Vj V M CombineAll V Assign Combine2 GIM-­‐V  can  be  implemented  by  2-­‐stage  MapReduce   →  Implement  on  mul.-­‐GPU  environment  
  • 18. Proposal:     GIM-­‐V  implementa.on  on  mul.-­‐GPU •  Con.nuous  execu.on  feature  for  itera.ons   –  2  MapReduce  stages  /  itera;on   –  Graph  par;;on  at  Pre-­‐processing   •  Divide  the  input  graph  ver;ces/edges  among  GPUs   –  Parallel  Convergence  test  at  Post-­‐processing   •  Locally  on  each  process  -­‐>  globally  using  MPI_Allreduce   18 Graph   Par;;on Stage  1       Stage  2     GPU  Processing Scheduler Pre-­‐process Convergence   Test Post-­‐process Mul.-­‐GPU  GIM-­‐V Combine2 CombineAll
  • 19. Eliminate  metadata  and   use  fixed  size  payload   Op;miza;ons  for  mul;-­‐GPU  GIM-­‐V •  Data  structure   –  Mars  handles     metadata  and  payload   •  Thread  alloca.on   –  Mars  handles  one  key     per  thread   •  Load  balance   op.miza.on   –  Scale-­‐free  property   •  Small  number  of  ver;ces   have  many  edges   19 In  Reduce  stage,  allocate   mul.  CUDA  threads  to  a   single  key  according  to  value   size   Minimize  load  imbalance   among  GPUS   Mars Our  Implementa.on
  • 20. Eliminate  metadata  and   use  fixed  size  payload   Op;miza;ons  for  mul;-­‐GPU  GIM-­‐V •  Data  structure   –  Mars  handles     metadata  and  payload   •  Thread  alloca.on   –  Mars  handles  one  key     per  thread   •  Load  balance   op.miza.on   –  Scale-­‐free  property   •  Small  number  of  ver;ces   have  many  edges   20 In  Reduce  stage,  allocate   mul.  CUDA  threads  to  a   single  key  according  to  value   size   Mars Our  Implementa.on Minimize  load  imbalance   among  GPUS  
  • 21. Apply  Load  Balancing  Op;miza;on •  Par..on  the  graph  in  order  to  minimize  load   imbalance  among  GPUs   –  Applying  a  task  scheduling  algorithm   •  Regard  Vertex/Edges  as  Task   •  TaskSize  i  =  1  +  Σ  Outgoing  Edges     –  LPT  (Least  Processing  Time)  schedule  *1     •  Assign  tasks  in  decreasing  order  of  task  size   *1  :  R.  L.  Graham,  “Bounds  on  mul;processing  anomalies  and  related  packing  algorithms,”  in   Proceedings  of  the  May  16-­‐18,  1972,  spring  joint  computer  conference,  ser.  AFIPS  ’72  (Spring)     P3 P2 P1 4 5 6 8 7 Tasks  =  {8,  5,  4,  3,  1} Minimize  the  maximum  amount Vertex  i i TaskSize  i  =  1  +  3 V Eout 21
  • 22. Experiments •  Methods   –  A  single  round  of  itera;ons  (w/o  Preprocessing)   –  PageRank  applica;on   •  Measures  rela;ve     importance  of  web  pages   –  Input  data     •  Ar;ficial  Kronecker  graphs   –  Generated  by  generator  in  Graph  500     •  Parameters   –  SCALE:  log  2  of  #ver;ces  (#ver;ces  =  2SCALE)   –  Edge_factor:  16  (#edges  =  Edge_factor  ×  #ver;ces)   22 16 4 3 2 1 8 12 4 8 6 4   2   12 3 2 1 4 3 2 1 G2 = G1 ⊗ G1G1 Study  the  performance  of  our  mul.-­‐GPU  GIM-­‐V     •  Scalability   •  Comparison  w/  a  CPU-­‐based  implementa.on   •  Validity  of  the  load  balance  op.miza.on  
  • 23. Experimental  environments •  TSUBAME  2.0  supercomputer     –  We  use  256  nodes  (768  GPUs)   •  CPU-­‐GPU:    PCI-­‐E  2.0  x16   •  Internode:  QDR  IB  (40  Gbps)  dual  rail   •  Mars   –  MarsGPU-­‐n   •  n  GPUs  /  node            (n:  1,  2,  3)   –  MarsCPU   •  12  threads  /  node   •  MPI  and  pthread   •  Parallel  quick  sort   23 CPU GPU Model   Intel®  Xeon®   X5670 Tesla  M2050   #  Cores 6 448 Frequency   2.93  GHz 1.15  GHz Memory 54  GB 2.7  GB Compiler   gcc  4.3.4 nvcc  4.0
  • 24. 24 0   10   20   30   40   50   60   70   80   90   100   0   50   100   150   200   250   300   MEgdes  /  sec #  Compute  Nodes MarsGPU-­‐1   MarsGPU-­‐2   MarsGPU-­‐3   MarsCPU   SCALE  30 SCALE  29 SCALE  28 SCALE  27 87.04  ME/s   (256  nodes) 1.52x  speedup   (3  GPU  v  CPU) Weak  Scaling  Performance:     MarsGPU  vs.  MarsCPU Becer •  W/O  load  balance  op;miza;on  
  • 25. Weak  Scaling  Performance:     MarsGPU  vs.  MarsCPU 25 0   10   20   30   40   50   60   70   80   90   100   0   50   100   150   200   250   300   MEgdes  /  sec #  Compute  Nodes MarsGPU-­‐1   MarsGPU-­‐2   MarsGPU-­‐3   MarsCPU   SCALE  30 SCALE  29 SCALE  28 SCALE  27 Becer •  W/O  load  balance  op;miza;on   87.04  ME/s   (256  nodes) 1.52x  speedup   (3  GPU  v  CPU) Performance   Breakdown  
  • 26. 26 0   1000   2000   3000   4000   5000   6000   7000   8000   9000   MarsCPU   MarsGPU-­‐1   MarsGPU-­‐2   MarsGPU-­‐3   Elapsed  Time  [ms] Map   MPI-­‐Comm   PCI-­‐Comm   Hash   Sort   Reduce   Performance  Breakdown:     MarsGPU  and  MarsCPU Becer SCALE  28
  • 27. 27 0   1000   2000   3000   4000   5000   6000   7000   8000   9000   MarsCPU   MarsGPU-­‐1   MarsGPU-­‐2   MarsGPU-­‐3   Elapsed  Time  [ms] Map   MPI-­‐Comm   PCI-­‐Comm   Hash   Sort   Reduce   Performance  Breakdown:     MarsGPU  and  MarsCPU 8.93x     (Map) 2.53x     (Sort) Becer SCALE  28
  • 28. 28 0   1000   2000   3000   4000   5000   6000   7000   8000   9000   MarsCPU   MarsGPU-­‐1   MarsGPU-­‐2   MarsGPU-­‐3   Elapsed  Time  [ms] Map   MPI-­‐Comm   PCI-­‐Comm   Hash   Sort   Reduce   Performance  Breakdown:     MarsGPU  and  MarsCPU Becer SCALE  28 PCI-­‐E  overhead
  • 29. Efficiency  of  GIM-­‐V  Op;miza;ons •  Data  structure            (Map,  Sort,  Reduce)   •  Thread  alloca.on      (Reduce)   29 1   10   100   1000   10000   Map   Sort   Reduce   Elapsed  Time  [ms] Naive     Op;mized   1.92x 1.64x 66.8x SCALE  26,  128  nodes  on  MarsGPU-­‐3   Becer
  • 30. 30 0   10   20   30   40   50   60   70   80   90   0   20   40   60   80   100   120   140   MEdges  /  Sec #  Compute  Nodes MarsGPU-­‐3   MarsGPU-­‐3  LPT   1.16x   Speedup Round  Robin  vs.  LPT  Schedule •  Similar  except  for  on  128  nodes   –  Input  graphs  are  rela;vely  well-­‐balanced  (Graph500) Weak  Scaling  Performance Becer
  • 31. 31 0   10   20   30   40   50   60   70   80   90   0   20   40   60   80   100   120   140   MEdges  /  Sec #  Compute  Nodes MarsGPU-­‐3   MarsGPU-­‐3  LPT   1.16x   Speedup Performance   Breakdown   •  Similar  except  for  on  128  nodes   –  Input  graphs  are  rela;vely  well-­‐balanced  (Graph500) Weak  Scaling  Performance Round  Robin  vs.  LPT  Schedule Becer
  • 32. Performance  Breakdown     Round  robin  vs.  LPT  Schedule •  Bitonic  sort  calculates  power-­‐of-­‐two  key-­‐value  pairs   –  Load  balancing  reduced  the  number  of  sor;ng  elements   32 0   500   1000   1500   2000   2500   3000   MarsGPU-­‐3   MarsGPU-­‐3  LPT   Elapsed  Time  [ms] Map   MPI-­‐Comm   PCI-­‐Comm   Hash   Sort   Reduce   Speedup     in  Sort Becer
  • 33. 33 1   10   100   1000   10000   100000   PEGASUS   MarsCPU   MarsGPU-­‐3   KEdges    /  Sec Outperform  Hadoop-­‐based  Implementa;on •  PEGASUS:  a  Hadoop-­‐based  GIM-­‐V  implementa;on   –  Hadoop  0.21.0   –  Lustre  for  underlying  Hadoop’s  file  system   186.8x   Speedup SCALE  27,  128  nodes Becer
  • 34. Related  Work •  Graph  processing  using  GPU   –  Shortest  path  algorithms  for  GPU  (BFS,SSSP,  and   APSP)*1   →  Not  achieve  compe;;ve  performance   •  MapReduce  implementa;ons  on  GPUs   –  GPMR*2  :  MapReduce  implementa;on  on  mul;  GPUs   →  Not  show  scalability  for  large-­‐scale  processing   •  Graph  processing  with  load  balancing     –  Load  balancing  while  keeping  communica;on  low  on   R-­‐MAT  graphs*3   →  We  show  the  task  scheduling-­‐based  load-­‐balancing   34 *1  :  Harish,  P.  et  al,  “Accelera;ng  Large  Graph  Algorithms  on  the  GPU  using  CUDA”,  HiPC  2007.   *2  :  Stuart,  J.A.  et  al,  “Mul;-­‐GPU  MapReduce  on  GPU  Clusters”,  IPDPS  2011.   *3  :  J.  Chhugani,  N.  Sa;sh,  C.  Kim,  J.  Sewall,  and  P.  Dubey,  “Fast  and  Efficient  Graph  Traversal  Algorithm  for  CPUs:     Maximizing  single-­‐node  efficiency,”  in  Parallel  Distributed  Processing  Symposium  (IPDPS),  2012    
  • 35. Conclusions •  A  scalable  MapReduce-­‐based  GIM-­‐V   implementa.on  using  mul.-­‐GPU   –  Methodology   •  Extend  Mars  to  support  mul;-­‐GPU   •  GIM-­‐V  using  mul;-­‐GPU  MapReduce   •  Load  balance  op;miza;on   –  Performance   •  87.04  ME/s  on  SCALE  30  (256  nodes,  768  GPUs)   •  1.52x  speedup  than  the  CPU-­‐based  implementa;on   •  Future  work   –  Op;miza;on  of  our  implementa;on   •  Improve  communica;on,  locality     –  Data  handling  larger  than  GPU  memory  capacity   •  Memory  hierarchy  management  (GPU,  DRAM,  NVM,  SSD)   35
  • 36. Comparison  with  Load  Balance  Algorithm   (Simula;on,  Weak  Scaling) •  Compare  between  naive  (Round  robin)  and  load  balancing   op;miza;on  (LPT  schedule)   •  Similar  except  for  128  nodes  (3.98%  on  SCALE  25,  64  nodes)   –  Performance  improvement:  13.8%  (SCALE  26,  128  nodes)   36 0   5   10   15   20   25   30   35   40   2   4   8   16   32   64   128   Load  Imbalance  [%]   #  Compute  Nodes Round  Robin   LPT   1.67x   BeXer
  • 37. Large-­‐scale  Graphs  in  Real  World •  Graphs  in  real  world   –  Health  care,  SNS,  Biology,  Electric  power  grid  etc.   –  Millions  to  trillions  of  ver;ces  and  100  millions  to  100  trillions  of   edges     –  Similar  proper;es   •  Scale-­‐free  (power-­‐low  degree  distribu;on)   •  Small  diameter   •  Kronecker  Graph   –  Similar  proper;es  as  real  world  graphs   –  Widely  used  (e.g.  the  Graph500  benchmark*1)  since  obtained   easily  by  simply  applying  itera;ve  products  on  a  base  matrix   37 *1  :  D.  A.  Bader  et  al.  The  graph500  list.  Graph500.org.  hXp://www.graph500.org/   16 4 3 2 1 8 12 4 8 6 4   2   12 3 2 1 4 3 2 1