Bryan Thompson, Chief Scientist and Founder, SYSTAP, LLC at MLconf ATL

http://bigdata.com
http://mapgraph.io
SYSTAP, LLC
Graphs
Graph Databases
Graph Analytics on GPUs
SYSTAP™, LLC
© 2006-2014 All Rights Reserved
1
9/19/2014

http://bigdata.com
http://mapgraph.io
Graphs
• This talk is about recent advances in large scale graph processing on
GPUs.
– The motivation is extreme performance.
– Everything we do (as a company) is focused on graphs.
• Graph Database
• Graph processing
• Common characteristics:
– irregular data shape, irregular access patterns, and irregular parallelism.
• A lot of data can be mapped onto graphs
– Sparse matrices and graphs are very close data structures
– Graphs, as we deal with them, have attributes on vertices and edges.
• A lot of algorithms can be mapped onto graphs
– Including many machine learning algorithms.
SYSTAP™, LLC
2
http://www.bigdata.com/blog

Small Business, Founded 2006 100% Employee Owned
http://bigdata.com
http://mapgraph.io
SYSTAP, LLC
Graph Database
• High performance, Scalable
– 50B edges/node
– High level query language
– Efficient Graph Traversal
– High 9s solution
• Open Source
– Subscriptions
GPU Analytics
• Extreme Performance
– 5-100x faster than graphlab
– 10,000x faster than graphdbs
• DARPA funding
• Disruptive technology
– Early adopters
– Huge ROIs
• Open Source
SYSTAP™, LLC
3
9/19/2014

Related “Graph” Technologies
Redpoint repositions existing technology,
adding interoperability for blueprints and
gremlin.
http://bigdata.com
http://mapgraph.io
STTR
Pair up bigdata
and MapGraph
MapGraph compares favorably with high end
hardware solutions from YARC, Oracle, and
SAP, but is open source and uses commodity
hardware.
SYSTAP™, LLC
4
9/19/2014

Embedded, Single Server, HA, Scale-out
• RDF/SPARQL
• Property graphs
– Blueprints, gremlin, rexter
• REST API (NSS)
• Extension points
– Stored queries for custom
application logic on the
server.
– Custom services & indices
– Custom functions
– Vertex-centric programs
http://bigdata.com
http://mapgraph.io
• Embedded Server
Journal
JVM
• Standalone Server
Journal
WAR
SYSTAP™, LLC
5
9/19/2014

http://bigdata.com
http://mapgraph.io
High Availability
• Shared nothing architecture
– Same data on each node
– Coordinate only at commit
– Transparent load balancing
• Scaling
– 50 billion triples or quads
– Query throughput scales linearly
• Self healing
– Automatic failover
– Automatic resync after disconnect
– Online single node disaster recovery
• Online Backup
– Online snapshots (full backups)
– HA Logs (incremental backups)
• Point in time recovery (offline)
HAService
Quorum
k=3
size=3
leader
follower
HAService
HAService
SYSTAP™, LLC
6
9/19/2014

Embedded, Single Server, HA, Scale-out
http://bigdata.com
http://mapgraph.io
RDF Data and SPARQL Query
Client Service
Distributed Index Management and Query
Management Functions
Client Service
Registrar
Data Service
Client Service
Data Service Data Service Data Service
Data Service Data Service Data Service
Zookeeper
Shard Locator
Transaction Mgr
Load Balancer
Unified API
Application
Client
Application
Client
Application
Client
Application
Client
Application
Client
Client Service
SPARQL XML
SPARQL JSON
RDF/XML
N-Triples
N-Quads
Turtle
TriG
RDF/JSON
SYSTAP™, LLC
7
9/19/2014

http://bigdata.com
http://mapgraph.io
And now on GPUs
SYSTAP™, LLC
8
9/19/2014

Similar models, different problems
• Graph query and graph analytics (traversal/mining)
– Related data models
– Very different computational requirements
• Many technologies are a bad match or limited solution
– Key-value stores (bigtable, Accumulo, Cassandra, HBase)
– Map-reduce
• Anti-pattern
– Dump all data into “big bucket”
http://bigdata.com
http://mapgraph.io
SYSTAP™, LLC
9
9/19/2014

Similar models, different problems
• Graph query and graph analytics (traversal/mining)
– Related data models
– Very different computational requirements
• Many technologies are a bad match or limited solution
– Key-value stores (bigtable, Accumulo, Cassandra, HBase)
– Map-reduce
• Anti-pattern
– Dump all data into “big bucket”
Storage and computation patterns must be
correctly matched for high performance.
http://bigdata.com
http://mapgraph.io
SYSTAP™, LLC
10
9/19/2014

Optimize for the right problem
• Graph analytics
– Parallelism – work must be distributed and balanced.
– Memory bandwidth – memory, not disk, is the bottleneck
– 2D partitioning – O(log(N)) communications pattern (versus O(N*N))
• 1D design looses locality when updating link weights for reverse indices.
BFS PR
• Storage and computation patterns must be correctly matched
for high performance.
http://bigdata.com
http://mapgraph.io
SYSTAP™, LLC
11
9/19/2014

GPUs – A Game Changer for Graph Analytics
• Graphs are a hard problem
• Non-locality
• Data dependent parallelism
• Memory, PCIe bus and network are
bottlenecks
• Recent performance gains driven by
innovations in bottom-up search,
data layout, and partitioning.
• GPUs deliver effective parallelism
• 10x CPU FLOPS
• 10x CPU/RAM bandwidth
• Significant speeds up over CPU
• 3 GTEPS on one GPU
• 32 GTEPS on 64 GPU cluster
http://bigdata.com
http://mapgraph.io
3500
3000
2500
2000
1500
1000
500
0
Breadth-First Search on Graphs
NVIDIA Tesla C2050
Multicore per socket
Sequential
0
2
1 10 100 1000 10000 100000
Million Traversed Edges per
Second
Average Traversal Depth
1
1
2
1
1
2
2
2
1
3
2
3
2
1 2
2
10x Speedup on GPUs

http://bigdata.com
http://mapgraph.io
GPU Hardware Trends
• K40 GPU (today)
• 12G RAM/GPU
• 288 GB/s bandwidth
• PCIe Gen 3
• Pascal GPU (Q1 2016)
• 24G RAM/GPU
• 1 TB/s bandwidth
• Unified memory
across CPU, GPUs
SYSTAP™, LLC
13
9/19/2014

Full Bandwidth Access to CPU RAM
http://bigdata.com
http://mapgraph.io
SYSTAP™, LLC
14
9/19/2014

Architecture shapes performance
• The data was a scale-free graph with 2.7M vertices and 5.6M
– MapGraph used a larger version of the graph (24M vertices, 25M edges)
• The query was a 5-degree subgraph (depth-limited BFS)
• Two main takeaways
– Horizontal scaling for titan is very expensive – wrong abstraction.
– GPUs are ridiculously fast.
platform load (s)
http://bigdata.com
http://mapgraph.io
query
(ms) comments
titan 497.00 935 4 node cluster using Cassandra
neo4j 608.00 668 single node community edition
bigdata 396.00 281 single node (open source)
MapGraph 0.08 27 NVIDIA K20 GPU
SYSTAP™, LLC
15
9/19/2014

http://bigdata.com
http://mapgraph.io
MapGraph
Graph Processing on GPUs
http://MapGraph.io
SYSTAP™, LLC
16
9/19/2014

http://bigdata.com
http://mapgraph.io
Think Like a Vertex
• Simple APIs
pageRank(Message m) {
total = m.value();
vertex.val = .15 * .85 + total;
for(nbr : out_neighbors) {
SendMsg(nbr, vertex.val/num_out_nbrs);
}
}
• Lots of algorithms
– BFS, SSSP, Page Rank, Connected Components, Louvain Modularity,
Jaccard Distance, k-means clustering, Betweenness-Centrality,
Personalized Page Rank, Loopy Belief Propagation, Graph search (crisp
and approximate), etc.
SYSTAP™, LLC
17
9/19/2014

GAS – a Graph-Parallel Abstraction
• Graph-Parallel Vertex-Centric API ala GraphLab
• “Think like a vertex”
• Gather: collect information
about my neighborhood
• Apply: update my value
• Scatter: signal adjacent vertices
• Can write all sorts of graph algorithms this way
– BFS, PageRank, Connected Component, Triangle Counting, Max Flow,
etc.
http://bigdata.com
http://mapgraph.io
SYSTAP™, LLC
18
9/19/2014

http://bigdata.com
http://mapgraph.io
MapGraph
• High-level graph processing framework
• High programmability
GPU architecture
Optimization techniques
CUDA
• High performance
Comparable to low-level approach
SYSTAP™, LLC
19
9/19/2014

http://bigdata.com
http://mapgraph.io
MapGraph
• High-level graph processing framework
• High programmability
GPU architecture
Optimization techniques
CUDA
• High performance
Comparable to low-level approach
SYSTAP™, LLC
20
9/19/2014

http://bigdata.com
http://mapgraph.io
Single GPU MapGraph (BFS)
Dataset #vertices #edges Max Degree Milliseconds
Webbase 1,000,005 3,105,536 23 1.2
Delaunay 2,097,152 6,291,408 4,700 24.5
Bitcoin 6,297,539 28,143,065 4,075,472 345.3
Wiki 3,566,907 45,030,389 7,061 51.0
Kron 1,048,576 89,239,674 131,505 47.7
154.0
513.6
74.8
821.3
1870.9
2,000
1,800
1,600
1,400
1,200
1,000
800
600
400
200
0
Webbase Delaunay Bitcoin Wiki Kron
MTEPS
SYSTAP™, LLC
21
9/19/2014

BFS Results : MapGraph vs GraphLab
1,000.00
100.00
10.00
http://bigdata.com
http://mapgraph.io
1.00
0.10
Speedup
MapGraph Speedup vs GraphLab (BFS)
GL-2
GL-4
GL-8
GL-12
MPG
SYSTAP™, LLC
22
9/19/2014

PageRank : MapGraph vs GraphLab
100.00
10.00
http://bigdata.com
http://mapgraph.io
1.00
0.10
Speedup
MapGraph Speedup vs GraphLab (Page Rank)
GL-2
GL-4
GL-8
GL-12
MPG
SYSTAP™, LLC
23
9/19/2014

Graph Mining on GPU Clusters
• 2D partitioning (aka vertex cuts)
• Minimizes the communication volume.
• Batch parallel Gather in row, Scatter in
column.
http://bigdata.com
http://mapgraph.io
SYSTAP™, LLC
24
9/19/2014

Accelerated Graph Analytics
http://bigdata.com
http://mapgraph.io
SYSTAP™, LLC
25
9/19/2014

• Work spans multiple orders of magnitude.
10,000,000
1,000,000
100,000
10,000
1,000
100
http://bigdata.com
http://mapgraph.io
10
1
Scale 25 Traversal
0 1 2 3 4 5 6
Frontier Size
Iteration
SYSTAP™, LLC
26
9/19/2014

http://bigdata.com
http://mapgraph.io
Strong Scaling
• Speedup on a constant problem size with more GPUs
• Problem scale 25
– 2^25 vertices (33,554,432)
– 2^26 directed edges (1,073,741,824)
23
22
21
20
19
18
17
16
15
14
13
10 20 30 40 50 60 70
GTEPS
#GPUs
Strong scaling
GPUs GTEPS Time (s)
16 14.3 0.075
25 16.4 0.066
36 18.1 0.059
64 22.7 0.047
SYSTAP™, LLC
27
9/19/2014

http://bigdata.com
http://mapgraph.io
Weak Scaling
• Scaling the problem size with more GPUs
35
30
25
20
15
10
5
0
1 4 16 64
GTEPS
#GPUs
Weak scaling
GPUs Scale Vertices Edges Time (s) GTEPS
1 21 2,097,152 67,108,864 0.0254 3
4 23 8,388,608 268,435,456 0.0429 6
16 25 33,554,432 1,073,741,824 0.0715 15
64 27 134,217,728 4,294,967,296 0.1478 29
SYSTAP™, LLC
28
9/19/2014

http://bigdata.com
http://mapgraph.io
Highlights
• For algorithms on large graphs
– Memory is the bottleneck
• CPUs quickly saturate the memory bus.
• CPU cache thrashing limits scaling for graph traversal.
• Continued performance gains for CPUs focus on reducing the #of visited edges to reduce
bandwidth.
• Hybrid CPU/GPU architectures offload either small degree vertices (reduce cache
thrashing) or high degree vertices (if the algorithm is FLOPS bound on the CPU, e.g., BC)
– Many core is the future.
– GPUs are primarily known for their FLOPS, but they have high memory bandwidth
and can deliver effective parallelism on parallel graph problems (with sophisticated
kernels).
• Scaling to very large graphs on large compute clusters
– Communications bound.
• Communication must be constant for perfect scaling
– Hybrid partitioning seeks to reduce #of messages, size of messages, and optimize
for asynchronous communications and degree-aware layouts for bottom-up search
to reduce memory bandwidth.
SYSTAP™, LLC
29
http://www.bigdata.com/blog

http://bigdata.com
http://mapgraph.io
Bryan Thompson
SYSTAP, LLC
bryan@systap.com
http://bigdata.com http://mapgraph.io
SYSTAP™, LLC
30
9/19/2014

Bryan Thompson, Chief Scientist and Founder, SYSTAP, LLC at MLconf ATL

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Bryan Thompson, Chief Scientist and Founder, SYSTAP, LLC at MLconf ATL

Similar to Bryan Thompson, Chief Scientist and Founder, SYSTAP, LLC at MLconf ATL (20)

More from MLconf

More from MLconf (20)

Recently uploaded

Recently uploaded (20)

Bryan Thompson, Chief Scientist and Founder, SYSTAP, LLC at MLconf ATL