Bryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYC

SYSTAP LLC | © 2006-2015 All Rights Reserved
MLConf @ NYC: March 27, 2015

Graph Databases Grew at Over 500% in the Last Two Years
Popularity changes per category – March 2015
PopularityChanges
Graph Databases

The Amount of Graph Data is Exploding
Billion+ Edges

Graphs are different. You need the
right paradigm and hardware to scale.
https://
datatake.files.wordpress.com/
Graph Cache Thrash
The CPU just waits for
graph data from main
memory...
TypeofCacheorMemory
Access Latency Per Clock Cycle

SYSTAP Solves the Graph Scaling Problem Using GPUs
●  100s of Times Faster than CPU
main memory-based systems
●  Up to 40X Cheaper
●  10,000X Faster than disk-based
technologies
●  High Level API

Uncovering inﬂuence links in molecular knowledge networks to streamline personalized medicine | Shin, Dmitriy et al.Journal of
Biomedical Informatics , Volume 52 , 394 - 405
Finding the Next Cure for Cancer is a
Billion+ Edge Graph Challenge

Graph is BIG and changing 
(Trillion+ Edges)

Graphs Enable People to Find Knowledge
A Bunch of Pages An Answer

http://BlazeGraph.com
http://MapGraph.io
Parallel&Breadth&First&Search&on&
GPU&Clusters&
SYSTAP™, LLC
© 2006-2015 All Rights Reserved
1
3/15/15
This work was (partially) funded by the DARPA XDATA program under AFRL Contract #FA8750-13-C-0002. This material is
based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. D14PC00029.
The authors would like to thank Dr. White, NVIDIA, and the MVAPICH group at Ohio State University for their support of this
work.
Z. Fu, H.K. Dasari, B. Bebee, M. Berzins, B. Thompson. Parallel Breadth First Search on GPU Clusters. IEEE Big Data.
Bethesda, MD. 2014.
h6p://mapgraph.io&

http://MapGraph.io
Many?Core&is&Your&Future&
SYSTAP™, LLC
2
3/10/15

http://MapGraph.io
MapGraph:&Extreme&performance&
•  GTEPS&is&Billions&(10^9)&of&Traversed&Edges&per&Second.&
–  This&is&the&basic&measure&of&performance&for&graph&traversal.&
Configura)on* Cost* GTEPS* $/GTEPS*
4?Core&CPU& $4,000& 0.2& $5,333&
45Core*CPU*+**K20*GPU* $7,000* 3.0* $2,333*
XMT?2&(rumored&price)& $1,800,000& 10.0& $188,000&
64*GPUs*(32*nodes*with*2x*K20*GPUs*per*
node*and*InfiniBand*DDRx4*–*today)*
$500,000* 30.0* $16,666*
16*GPUs*(2*nodes*with*8x*Pascal*GPUs*per*
node*and*InfiniBand*DDRx4*–*Q1,*2016)*
$125,000* >30.0* <$4,166*
SYSTAP™, LLC
3
3/10/15

http://MapGraph.io
SYSTAP’s&MapGraph&APIs&Roadmap&
•  Vertex&Centric&API&
–  Same&performance&as&CUDA.&
•  Schema?ﬂexible&data?model&
–  Property&Graph&/&RDF&
–  Reduce&import&hassles&
–  Opera_ons&over&merged&graphs&
•  Graph&pa6ern&matching&language&
•  DSL/Scala&=>&CUDA&code&genera_on&
SYSTAP™, LLC
Ease&of&Use&
4
3/10/15

http://MapGraph.io
Breadth&First&Search&
•  Fundamental&building&block&for&graph&algorithms&
–  Including&SPARQL&(JOINs)&
•  In&level&synchronous&steps&
–  Label&visited&verDces&
•  For&this&graph&
–  IteraDon&0&
–  IteraDon&1&
–  IteraDon&2&
•  Hard&problem!&
–  Basis&for&Graph&500&benchmark&
SYSTAP™, LLC
10
3/10/15
0&
1&
1&
2&
1&
1&
2&
2&
2&
2&
1&
3&
2&
3&
2&
2& 3&
2&

http://MapGraph.io
GPUs&–&A&Game&Changer&for&Graph&AnalyDcs&
0
500
1000
1500
2000
2500
3000
3500
1 10 100 1000 10000 100000
MillionTraversedEdgesperSecond
Average Traversal Depth
NVIDIA Tesla C2050
Multicore per socket
Sequential
Breadth-First Search on Graphs
10x Speedup on GPUs
3 GTEPS
•  Graphs&are&a&hard&problem&
•  Non?locality&
•  Data&dependent&parallelism&
•  Memory&bus&and&
communicaDon&bo6lenecks&
•  GPUs&deliver&eﬀecDve&
parallelism&
•  10x+&memory&bandwidth&
•  Dynamic&parallelism&
SYSTAP™, LLC
11
3/10/15
Graphic: Merrill, Garland, and Grimshaw, “GPU Sparse Graph
Traversal”, GPU Technology Conference, 2012

http://MapGraph.io
2D&Par__oning&(aka&Vertex&Cuts)&
•  p&x&p&compute&grid&
–  Edges&in&rows/cols&
–  Minimize&messages&
•  log(p)&(versus&p2)&
–  One&par__on&per&GPU&
•  Batch&parallel&opera_on&
–  Grid&row:&out?edges&
–  Grid&column:&in?edges&
•  Representa_ve&fron_ers&
SYSTAP™, LLC
5
3/10/15
G11
G22
G33
G44
C1
C2
C3
C4
R1
In2
t
Out1
t
Out2
t
Out3
t
Out4
t
In1
t
In3
t
In4
t
R2
R3
R4
Source&Ver_ces&
Target&Ver_ces&
out-edges
in-edges
BFS& PR&Query&
•  Parallelism&–&work&must&be&distributed&
and&balanced.&
•  Memory&bandwidth&–&memory,&not&disk,&
is&the&bo6leneck&

http://MapGraph.io
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
0 1 2 3 4 5 6
FrontierSize
Iteration
Scale&25&Traversal&
•  Work&spans&mul_ple&orders&of&magnitude.&
SYSTAP™, LLC
6
3/10/15

http://MapGraph.io
Distributed&BFS&Algorithm&
SYSTAP™, LLC
7
3/10/15
1: procedure BFS(Root, Predecessor)
2: In0
ij LocalVertex(Root)
3: for t 0 do
4: Expand(Int
i, Outt
ij)
5: LocalFrontiert Count(Outt
ij)
6: GlobalFrontiert Reduce(LocalFrontiert)
7: if GlobalFrontiert > 0 then
8: Contract(Outt
ij, Outt
j, Int+1
i , Assignij)
9: UpdateLevels(Outt
j, t, level)
10: else
11: UpdatePreds(Assignij,Predsij, level)
12: break
13: end if
14: t++
15: end for
16: end procedure
Figure 3. Distributed Breadth First Search algorithm
1: procedure UPDATEPRED
2: for all v 2 Assignedij
3: Pred[v] 1
4: for i ColOff[v]
5: if level[v] ==
6: Pred[v]
7: end if
8: end for
9: end for
10: end procedure
Figure 7. Pre
for ﬁnding the Preﬁxt
ij, whic
performing Bitwise-OR Exclu
across each row j as shown
are several tree-based algori
Since the bitmap union opera
! Data parallel 1-hop expand on all GPUs
! Termination check
! Local frontier size
! Global frontier size
! Compute predecessors from local
In / Out levels (no communication)! Done
! Starting vertex
! Global Frontier contraction (“wave”)
! Record level of vertices in the In /
Out frontier
! Next iteration

http://MapGraph.io
Distributed&BFS&Algorithm&
SYSTAP™, LLC
8
3/10/15
1: procedure BFS(Root, Predecessor)
2: In0
ij LocalVertex(Root)
3: for t 0 do
4: Expand(Int
i, Outt
ij)
5: LocalFrontiert Count(Outt
ij)
6: GlobalFrontiert Reduce(LocalFrontiert)
7: if GlobalFrontiert > 0 then
8: Contract(Outt
ij, Outt
j, Int+1
i , Assignij)
9: UpdateLevels(Outt
j, t, level)
10: else
11: UpdatePreds(Assignij,Predsij, level)
12: break
13: end if
14: t++
15: end for
16: end procedure
Figure 3. Distributed Breadth First Search algorithm
1: procedure UPDATEPRED
2: for all v 2 Assignedij
3: Pred[v] 1
4: for i ColOff[v]
5: if level[v] ==
6: Pred[v]
7: end if
8: end for
9: end for
10: end procedure
Figure 7. Pre
for finding the Prefixt
ij, whic
performing Bitwise-OR Exclu
across each row j as shown
are several tree-based algori
Since the bitmap union opera
•  Key&differences&
•  log(p)&parallel&scan&(vs&sequen_al&wave)&
•  GPU?local&computa_on&of&predecessors&
•  1&par__on&per&GPU&
•  GPUDirect&(vs&RDMA)&
! Data parallel 1-hop expand on all GPUs
! Termination check
! Local frontier size
! Global frontier size
! Compute predecessors from local
In / Out levels (no communication)! Done
! Starting vertex
! Global Frontier contraction (“wave”)
! Record level of vertices in the In /
Out frontier
! Next iteration

http://MapGraph.io
Global&Fron_er&Contrac_on&and&Communica_on&
SYSTAP™, LLC
9
3/10/15
10: Outij convert(Lout)
11: end procedure
Figure 4. Expand algorithm
1: procedure CONTRACT(Outt
ij, Outt
j, Int+1
i , Assignij)
2: Prefixt
ij ExclusiveScanj(Outt
ij)
3: Assignedij Assignedij [ (Outt
ij Prefixt
ij)
4: if i = p then
5: Outt
j Outt
ij [ Prefixt
ij
6: end if
7: Broadcast(Outt
j, p , ROW)
8: if i = j then
9: Int+1
i Outt
j
10: end if
11: Broadcast(Int+1
i , i, COL)
12: end procedure
Figure 5. Global Frontier contraction and communication algorithm
the row. This is done to update t
vertices so that they do not prod
those vertices in the future iteratio
updating the levels of the vertice
Int+1
j . Since both Outt
j and Int+1
i
in the diagonal (i = j), in line 9,
the diagonal GPUs and then broa
Ci using MPI_Broadcast in line
setup along every row and colum
facilitate these parallel scan and b
During each BFS iteration, we
of the algorithm by checking if
zero as in line 6 of Figure 3. We
frontier using a MPI_Allreduce op
edge expansion and before the g
phase. However, an unexplored o
the termination check with the g
and communication phase in orde
! Distributed prefix sum. Prefix is vertices
discovered by your left neighbors (log p)
! Vertices first discovered by this GPU.
! Right most column has global frontier for
the row
! Broadcast frontier over row.
! Broadcast frontier over column.

http://MapGraph.io
SYSTAP™, LLC
10
3/10/15
ij, Outt
j, Int+1
i , Assignij)
2: Prefixt
ij)
ij Prefixt
ij)
4: if i = p then
5: Outt
j Outt
ij [ Prefixt
ij
6: end if
7: Broadcast(Outt
j, p , ROW)
8: if i = j then
9: Int+1
i Outt
j
10: end if
11: Broadcast(Int+1
i , i, COL)
12: end procedure
!&Distributed&prefix&sum.&Prefix&is&ver_ces&
discovered&by&your&lel&neighbors&(log&p)&
the row
G11
G22
G33
G44
C1
C2
C3
C4
R1
In2
t
Out1
t
Out2
t
Out3
t
Out4
t
In1
t
In3
t
In4
t
R2
R3
R4
Source&Ver_ces&
Target&Ver_ces&

http://MapGraph.io
SYSTAP™, LLC
11
3/10/15
ij, Outt
j, Int+1
i , Assignij)
2: Prefixt
ij)
ij Prefixt
ij)
4: if i = p then
5: Outt
j Outt
ij [ Prefixt
ij
6: end if
7: Broadcast(Outt
j, p , ROW)
8: if i = j then
9: Int+1
i Outt
j
10: end if
11: Broadcast(Int+1
i , i, COL)
12: end procedure
•  We&use&a&parallel&scan&that&minimizes&communica_on&steps&(vs&work).&
•  This&improves&the&overall&scaling&efficiency&by&30%.&
!*Distributed*prefix*sum.*Prefix*is*ver)ces*
discovered*by*your*leZ*neighbors*(log*p)*
the row
G11
G22
G33
G44
C1
C2
C3
C4
R1
In2
t
Out1
t
Out2
t
Out3
t
Out4
t
In1
t
In3
t
In4
t
R2
R3
R4
Source&Ver_ces&
Target&Ver_ces&

http://MapGraph.io
SYSTAP™, LLC
12
3/15/15
ij, Outt
j, Int+1
i , Assignij)
2: Prefixt
ij)
ij Prefixt
ij)
4: if i = p then
5: Outt
j Outt
ij [ Prefixt
ij
6: end if
7: Broadcast(Outt
j, p , ROW)
8: if i = j then
9: Int+1
i Outt
j
10: end if
11: Broadcast(Int+1
i , i, COL)
12: end procedure
the row
G11
G22
G33
G44
C1
C2
C3
C4
R1
In2
t
Out1
t
Out2
t
Out3
t
Out4
t
In1
t
In3
t
In4
t
R2
R3
R4
Source&Ver_ces&
Target&Ver_ces&
!*Distributed*prefix*sum.*Prefix*is*ver)ces*
discovered*by*your*leZ*neighbors*(log*p)*

http://MapGraph.io
SYSTAP™, LLC
13
3/15/15
ij, Outt
j, Int+1
i , Assignij)
2: Prefixt
ij)
ij Prefixt
ij)
4: if i = p then
5: Outt
j Outt
ij [ Prefixt
ij
6: end if
7: Broadcast(Outt
j, p , ROW)
8: if i = j then
9: Int+1
i Outt
j
10: end if
11: Broadcast(Int+1
i , i, COL)
12: end procedure
! Right most column has global frontier
for the row
G11
G22
G33
G44
C1
C2
C3
C4
R1
In2
t
Out1
t
Out2
t
Out3
t
Out4
t
In1
t
In3
t
In4
t
R2
R3
R4
Source&Ver_ces&
Target&Ver_ces&
!&Distributed&prefix&sum.&Prefix&is&ver_ces&
discovered&by&your&lel&neighbors&(log&p)&

http://MapGraph.io
SYSTAP™, LLC
14
3/15/15
ij, Outt
j, Int+1
i , Assignij)
2: Prefixt
ij)
ij Prefixt
ij)
4: if i = p then
5: Outt
j Outt
ij [ Prefixt
ij
6: end if
7: Broadcast(Outt
j, p , ROW)
8: if i = j then
9: Int+1
i Outt
j
10: end if
11: Broadcast(Int+1
i , i, COL)
12: end procedure
the row
G11
G22
G33
G44
C1
C2
C3
C4
R1
In2
t
Out1
t
Out2
t
Out3
t
Out4
t
In1
t
In3
t
In4
t
R2
R3
R4
Source&Ver_ces&
Target&Ver_ces&

http://MapGraph.io
SYSTAP™, LLC
15
3/15/15
ij, Outt
j, Int+1
i , Assignij)
2: Prefixt
ij)
ij Prefixt
ij)
4: if i = p then
5: Outt
j Outt
ij [ Prefixt
ij
6: end if
7: Broadcast(Outt
j, p , ROW)
8: if i = j then
9: Int+1
i Outt
j
10: end if
11: Broadcast(Int+1
i , i, COL)
12: end procedure
the row
G11
G22
G33
G44
C1
C2
C3
C4
R1
In2
t
Out1
t
Out2
t
Out3
t
Out4
t
In1
t
In3
t
In4
t
R2
R3
R4
Source&Ver_ces&
Target&Ver_ces&
•  All&GPUs&now&have&the&new&fron_er.&

http://MapGraph.io
Weak&Scaling&
•  Scaling&the&problem&size&with&
more&GPUs&
–  Fixed&problem&size&per&GPU.&
•  Maximum&scale&27&(4.3B&edges)&
–  64&K20&GPUs&=>&.147s&=>&29&GTEPS&
–  64&K40&GPUs&=>&.135s&=>&32&GTEPS&
•  K40&has&faster&memory&bus.&
0
5
10
15
20
25
30
35
1 4 16 64
GTEPS
#GPUs
GPUs& Scale& Ver)ces& Edges& Time*(s)& GTEPS&
1& 21& 2,097,152& 67,108,864& 0.0254* 2.5&
4& 23& 8,388,608& 268,435,456& 0.0429* 6.3&
16& 25& 33,554,432& 1,073,741,824& 0.0715* 15.0&
64& 27& 134,217,728& 4,294,967,296& 0.1478* 29.1&
SYSTAP™, LLC
16
3/10/15
Scaling the problem size with
more GPUs.

http://MapGraph.io
Central&Itera_on&Costs&(Weak&Scaling)&
SYSTAP™, LLC
17
3/10/15
0.0000
0.0020
0.0040
0.0060
0.0080
0.0100
0.0120
4 16 64
Time(Seconds)
#GPUs
Computation
Communication
Termination
•  Communica_on&
costs&are&not&
constant.&
•  2D&design&implies&
cost&grows&as&
2*log(2p)/log(p)*
•  How&to&scale?&
–  Overlapping&
–  Compression&
•  Graph&
•  Message&
–  Hybrid&par__oning&
–  Heterogeneous&
compu_ng&

http://MapGraph.io
Costs&During&BFS&Traversal&
SYSTAP™, LLC
18
3/10/15
•  Chart&shows&the&diﬀerent&costs&for&each&GPU&in&each&itera_on&(64&GPUs).&
•  Wave&_me&is&essen_ally&constant,&as&expected.&
•  Compute&_me&peaks&during&the&central&itera_ons.&
•  Costs&are&reasonably&well&balanced&across&all&GPUs&aler&the&2nd&itera_on.&

http://MapGraph.io
Strong&Scaling&
•  Speedup&on&a&constant&problem&size&with&more&GPUs&
•  Problem&scale&25&
–  2^25&ver_ces&(33,554,432)&
–  2^26&directed&edges&(1,073,741,824)&
•  Strong&scaling&eﬃciency&of&48%&
–  Versus&44%&for&BG/Q&
GPUs* GTEPS* Time*(s)*
16* 15.2* 0.071*
25* 18.2* 0.059*
36* 20.5* 0.053*
49& 21.8& 0.049*
64* 22.7* 0.047*
SYSTAP™, LLC
19
3/10/15
Speedup on a constant
problem size with more GPUs.

http://MapGraph.io
MapGraph&Beta&Customer?&
&
Contact&
Bradley&Bebee&
beebs@systap.com&
20
3/10/15
SYSTAP™, LLC

Bryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYC

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Bryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYC

Similar to Bryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYC (20)

More from MLconf

More from MLconf (20)

Recently uploaded

Recently uploaded (20)

Bryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYC