UC 
BERKELEY 
GraphX 
Graph Analytics in Spark 
Ankur Dave 
Graduate Student, UC Berkeley AMPLab 
Joint work with Joseph Gonzalez, Reynold Xin, Daniel 
Crankshaw, Michael Franklin, and Ion Stoica
Graphs are Everywhere
Social Networks
Web Graphs
Web Graphs 
“The political blogosphere and the 2004 U.S. election: divided they blog.” 
Adamic and Glance, 2005.
User-Item Graphs 
⋆⋆⋆⋆⋆ 
⋆ 
⋆⋆⋆⋆ 
⋆⋆⋆⋆ 
⋆⋆⋆⋆⋆
Graph Algorithms
PageRank
Triangle Counting
Collaborative Filtering 
⋆⋆⋆⋆⋆ 
⋆ 
⋆⋆⋆⋆ 
⋆⋆⋆⋆ 
⋆⋆⋆⋆⋆ 
⚙ 
♫ 
⚙ 
♫
The Graph-Parallel Pattern
The Graph-Parallel Pattern
The Graph-Parallel Pattern
Many Graph-Parallel Algorithms 
Collaborative Filtering 
» Alternating Least Squares 
» Stochastic Gradient Descent 
» Tensor Factorization 
Structured Prediction 
» Loopy Belief Propagation 
» Max-Product Linear 
Programs 
» Gibbs Sampling 
Semi-supervised ML 
» Graph SSL 
» CoEM 
Community Detection 
» Triangle-Counting 
» K-core Decomposition 
» K-Truss 
Graph Analytics 
» PageRank 
» Personalized PageRank 
» Shortest Path 
» Graph Coloring 
Classification 
» Neural Networks
Raw 
Wikipedia 
 / /   /  
XML 
Hyperlinks PageRank Top 20 Pages 
Title PR 
Link Table 
Title Link 
Editor Graph 
Community 
Detection 
User 
Community 
User Com. 
Editor 
Table 
Editor Title 
Top Communities 
Com. PR.. 
Modern Analytics
Tables 
Raw 
Wikipedia 
 / /   /  
XML 
Hyperlinks PageRank Top 20 Pages 
Title PR 
Link Table 
Title Link 
Editor Graph 
Community 
Detection 
User 
Community 
User Com. 
Top Communities 
Com. PR.. 
Editor 
Table 
Editor Title
Editor 
Table 
Editor Title 
Raw 
Wikipedia 
 / /   /  
XML 
Hyperlinks PageRank Top 20 Pages 
Title PR 
Link Table 
Title Link 
Editor Graph 
Community 
Detection 
User 
Community 
User Com. 
Top Communities 
Com. PR.. 
Graphs
The GraphX API
Property Graphs 
Vertex Property: 
• User Profile 
• Current PageRank Value 
Edge Property: 
• Weights 
• Relationships 
• Timestamps
type 
Creating a Graph (Scala) 
VertexId 
= 
Long 
Graph 
val 
vertices: 
RDD[(VertexId, 
String)] 
= 
sc.parallelize(List( 
(1L, 
“Alice”), 
(2L, 
“Bob”), 
(3L, 
“Charlie”))) 
class 
Edge[ED]( 
val 
srcId: 
VertexId, 
val 
dstId: 
VertexId, 
val 
attr: 
ED) 
val 
edges: 
RDD[Edge[String]] 
= 
sc.parallelize(List( 
Edge(1L, 
2L, 
“coworker”), 
Edge(2L, 
3L, 
“friend”))) 
val 
graph 
= 
Graph(vertices, 
edges) 
1 
2 
3 
Alice 
Bob 
Charlie 
coworker 
friend
Graph Operations (Scala) 
class 
Graph[VD, 
ED] 
{ 
// 
Table 
Views 
-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ 
def 
vertices: 
RDD[(VertexId, 
VD)] 
def 
edges: 
RDD[Edge[ED]] 
def 
triplets: 
RDD[EdgeTriplet[VD, 
ED]] 
// 
Transformations 
-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ 
def 
mapVertices[VD2](f: 
(VertexId, 
VD) 
= 
VD2): 
Graph[VD2, 
ED] 
def 
mapEdges[ED2](f: 
Edge[ED] 
= 
ED2): 
Graph[VD2, 
ED] 
def 
reverse: 
Graph[VD, 
ED] 
def 
subgraph(epred: 
EdgeTriplet[VD, 
ED] 
= 
Boolean, 
vpred: 
(VertexId, 
VD) 
= 
Boolean): 
Graph[VD, 
ED] 
// 
Joins 
-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ 
def 
outerJoinVertices[U, 
VD2] 
(tbl: 
RDD[(VertexId, 
U)]) 
(f: 
(VertexId, 
VD, 
Option[U]) 
= 
VD2): 
Graph[VD2, 
ED] 
// 
Computation 
-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ 
def 
mapReduceTriplets[A]( 
sendMsg: 
EdgeTriplet[VD, 
ED] 
= 
Iterator[(VertexId, 
A)], 
mergeMsg: 
(A, 
A) 
= 
A): 
RDD[(VertexId, 
A)]
// 
Continued 
from 
previous 
slide 
def 
pageRank(tol: 
Double): 
Graph[Double, 
Double] 
def 
triangleCount(): 
Graph[Int, 
ED] 
def 
connectedComponents(): 
Graph[VertexId, 
ED] 
// 
...and 
more: 
org.apache.spark.graphx.lib 
} 
Built-in Algorithms (Scala) 
PageRank Triangle Count Connected 
Components
The triplets view 
} 
class 
EdgeTriplet[VD, 
ED]( 
val 
srcId: 
VertexId, 
val 
dstId: 
VertexId, 
val 
attr: 
ED, 
val 
srcAttr: 
VD, 
val 
dstAttr: 
VD) 
RDD 
class 
Graph[VD, 
ED] 
{ 
def 
triplets: 
RDD[EdgeTriplet[VD, 
ED]] 
Graph 
1 
2 
3 
Alice 
Bob 
Charlie 
coworker 
friend 
srcAttr dstAttr attr 
Alice coworker Bob 
Bob friend Charlie 
triplets
The subgraph transformation 
class 
Graph[VD, 
ED] 
{ 
def 
subgraph(epred: 
EdgeTriplet[VD, 
ED] 
= 
Boolean, 
vpred: 
(VertexId, 
VD) 
= 
Boolean): 
Graph[VD, 
ED] 
} 
graph.subgraph(epred 
= 
(edge) 
= 
edge.attr 
!= 
“relative”) 
subgraph 
Graph 
Alice Bob 
relative 
Charlie 
coworker 
friend 
relative David 
Graph 
Alice Bob 
Charlie 
coworker 
friend 
David
The subgraph transformation 
class 
Graph[VD, 
ED] 
{ 
def 
subgraph(epred: 
EdgeTriplet[VD, 
ED] 
= 
Boolean, 
vpred: 
(VertexId, 
VD) 
= 
Boolean): 
Graph[VD, 
ED] 
} 
graph.subgraph(vpred 
= 
(id, 
name) 
= 
name 
!= 
“Bob”) 
subgraph 
Graph 
Alice Bob 
relative 
Charlie 
coworker 
friend 
relative David 
Graph 
Alice 
relative 
Charlie 
relative David
Computation with mapReduceTriplets 
class 
Graph[VD, 
ED] 
{ 
def 
mapReduceTriplets[A]( 
upgrade to aggregateMessages 
sendMsg: 
EdgeTriplet[VD, 
ED] 
= 
Iterator[(VertexId, 
A)], 
mergeMsg: 
(A, 
A) 
= 
A): 
RDD[(VertexId, 
A)] 
} 
graph.mapReduceTriplets( 
edge 
= 
Iterator( 
(edge.srcId, 
1), 
(edge.dstId, 
1)), 
_ 
+ 
_) 
Graph 
Alice Bob 
relative 
Charlie 
friend 
coworker 
relative David 
mapReduceTriplets 
RDD 
vertex id degree 
Alice 2 
Bob 2 
Charlie 3 
David 1 
in Spark 1.2.0
How GraphX Works
Encoding Property Graphs as RDDs 
Part. 1 
A D 
Part. 2 
Vertex 
Table 
(RDD) 
C 
A D 
F 
E 
D 
Property Graph 
B A 
Machine 1 Machine 2 
Edge Table 
(RDD) 
A B 
A C 
B C 
C D 
A E 
A F 
E D 
E F 
A 
B 
C 
D 
E 
F 
Routing 
Table 
(RDD) 
A 
B 
C 
D 
E 
F 
1 
2 
1 
1 
1 
2 
2 
2 
Vertex Cut
Graph System Optimizations 
Specialized 
Vertex-Cuts 
Remote 
Data-Structures 
Partitioning 
Caching / Mirroring 
Message Combiners Active Set Tracking
3500 
3000 
EC2 Cluster of 16 x m2.4xLarge (8 cores) + 1GigE 
Seconds) 
2500 
2000 
(Runtime 1500 
1000 
500 
0 
state-of-the-art graph processing systems. PageRank Benchmark 
Twitter Graph (42M Vertices,1.5B Edges) UK-Graph (106M Vertices, 3.7B Edges) 
9000 
8000 
7000 
6000 
5000 
7x 18x 
4000 
3000 
2000 
1000 
0 
GraphX performs comparably to
Future of GraphX 
1. Language support 
a) Java API: PR #3234 
b) Python API: collaborating with Intel, SPARK-3789 
2. More algorithms 
a) LDA (topic modeling): PR #2388 
b) Correlation clustering 
c) Your algorithm here? 
3. Speculative 
a) Streaming/time-varying graphs 
b) Graph database–like queries
Raw 
Wikipedia 
 / /   /  
Try GraphX Today 
XML Hyperlinks 
PageRank 
Top 20 Pages 
Title PR 
Text 
Table 
Title Body 
Berkeley 
subgraph
Thanks! 
http://spark.apache.org/graphx 
ankurd@eecs.berkeley.edu 
jegonzal@eecs.berkeley.edu 
rxin@eecs.berkeley.edu 
crankshaw@eecs.berkeley.edu

GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)

  • 1.
    UC BERKELEY GraphX Graph Analytics in Spark Ankur Dave Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez, Reynold Xin, Daniel Crankshaw, Michael Franklin, and Ion Stoica
  • 2.
  • 3.
  • 4.
  • 5.
    Web Graphs “Thepolitical blogosphere and the 2004 U.S. election: divided they blog.” Adamic and Glance, 2005.
  • 6.
    User-Item Graphs ⋆⋆⋆⋆⋆ ⋆ ⋆⋆⋆⋆ ⋆⋆⋆⋆ ⋆⋆⋆⋆⋆
  • 7.
  • 8.
  • 9.
  • 10.
    Collaborative Filtering ⋆⋆⋆⋆⋆ ⋆ ⋆⋆⋆⋆ ⋆⋆⋆⋆ ⋆⋆⋆⋆⋆ ⚙ ♫ ⚙ ♫
  • 11.
  • 12.
  • 13.
  • 14.
    Many Graph-Parallel Algorithms Collaborative Filtering » Alternating Least Squares » Stochastic Gradient Descent » Tensor Factorization Structured Prediction » Loopy Belief Propagation » Max-Product Linear Programs » Gibbs Sampling Semi-supervised ML » Graph SSL » CoEM Community Detection » Triangle-Counting » K-core Decomposition » K-Truss Graph Analytics » PageRank » Personalized PageRank » Shortest Path » Graph Coloring Classification » Neural Networks
  • 15.
    Raw Wikipedia / / / XML Hyperlinks PageRank Top 20 Pages Title PR Link Table Title Link Editor Graph Community Detection User Community User Com. Editor Table Editor Title Top Communities Com. PR.. Modern Analytics
  • 16.
    Tables Raw Wikipedia / / / XML Hyperlinks PageRank Top 20 Pages Title PR Link Table Title Link Editor Graph Community Detection User Community User Com. Top Communities Com. PR.. Editor Table Editor Title
  • 17.
    Editor Table EditorTitle Raw Wikipedia / / / XML Hyperlinks PageRank Top 20 Pages Title PR Link Table Title Link Editor Graph Community Detection User Community User Com. Top Communities Com. PR.. Graphs
  • 18.
  • 19.
    Property Graphs VertexProperty: • User Profile • Current PageRank Value Edge Property: • Weights • Relationships • Timestamps
  • 20.
    type Creating aGraph (Scala) VertexId = Long Graph val vertices: RDD[(VertexId, String)] = sc.parallelize(List( (1L, “Alice”), (2L, “Bob”), (3L, “Charlie”))) class Edge[ED]( val srcId: VertexId, val dstId: VertexId, val attr: ED) val edges: RDD[Edge[String]] = sc.parallelize(List( Edge(1L, 2L, “coworker”), Edge(2L, 3L, “friend”))) val graph = Graph(vertices, edges) 1 2 3 Alice Bob Charlie coworker friend
  • 21.
    Graph Operations (Scala) class Graph[VD, ED] { // Table Views -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ def vertices: RDD[(VertexId, VD)] def edges: RDD[Edge[ED]] def triplets: RDD[EdgeTriplet[VD, ED]] // Transformations -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ def mapVertices[VD2](f: (VertexId, VD) = VD2): Graph[VD2, ED] def mapEdges[ED2](f: Edge[ED] = ED2): Graph[VD2, ED] def reverse: Graph[VD, ED] def subgraph(epred: EdgeTriplet[VD, ED] = Boolean, vpred: (VertexId, VD) = Boolean): Graph[VD, ED] // Joins -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ def outerJoinVertices[U, VD2] (tbl: RDD[(VertexId, U)]) (f: (VertexId, VD, Option[U]) = VD2): Graph[VD2, ED] // Computation -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ def mapReduceTriplets[A]( sendMsg: EdgeTriplet[VD, ED] = Iterator[(VertexId, A)], mergeMsg: (A, A) = A): RDD[(VertexId, A)]
  • 22.
    // Continued from previous slide def pageRank(tol: Double): Graph[Double, Double] def triangleCount(): Graph[Int, ED] def connectedComponents(): Graph[VertexId, ED] // ...and more: org.apache.spark.graphx.lib } Built-in Algorithms (Scala) PageRank Triangle Count Connected Components
  • 23.
    The triplets view } class EdgeTriplet[VD, ED]( val srcId: VertexId, val dstId: VertexId, val attr: ED, val srcAttr: VD, val dstAttr: VD) RDD class Graph[VD, ED] { def triplets: RDD[EdgeTriplet[VD, ED]] Graph 1 2 3 Alice Bob Charlie coworker friend srcAttr dstAttr attr Alice coworker Bob Bob friend Charlie triplets
  • 24.
    The subgraph transformation class Graph[VD, ED] { def subgraph(epred: EdgeTriplet[VD, ED] = Boolean, vpred: (VertexId, VD) = Boolean): Graph[VD, ED] } graph.subgraph(epred = (edge) = edge.attr != “relative”) subgraph Graph Alice Bob relative Charlie coworker friend relative David Graph Alice Bob Charlie coworker friend David
  • 25.
    The subgraph transformation class Graph[VD, ED] { def subgraph(epred: EdgeTriplet[VD, ED] = Boolean, vpred: (VertexId, VD) = Boolean): Graph[VD, ED] } graph.subgraph(vpred = (id, name) = name != “Bob”) subgraph Graph Alice Bob relative Charlie coworker friend relative David Graph Alice relative Charlie relative David
  • 26.
    Computation with mapReduceTriplets class Graph[VD, ED] { def mapReduceTriplets[A]( upgrade to aggregateMessages sendMsg: EdgeTriplet[VD, ED] = Iterator[(VertexId, A)], mergeMsg: (A, A) = A): RDD[(VertexId, A)] } graph.mapReduceTriplets( edge = Iterator( (edge.srcId, 1), (edge.dstId, 1)), _ + _) Graph Alice Bob relative Charlie friend coworker relative David mapReduceTriplets RDD vertex id degree Alice 2 Bob 2 Charlie 3 David 1 in Spark 1.2.0
  • 27.
  • 28.
    Encoding Property Graphsas RDDs Part. 1 A D Part. 2 Vertex Table (RDD) C A D F E D Property Graph B A Machine 1 Machine 2 Edge Table (RDD) A B A C B C C D A E A F E D E F A B C D E F Routing Table (RDD) A B C D E F 1 2 1 1 1 2 2 2 Vertex Cut
  • 29.
    Graph System Optimizations Specialized Vertex-Cuts Remote Data-Structures Partitioning Caching / Mirroring Message Combiners Active Set Tracking
  • 30.
    3500 3000 EC2Cluster of 16 x m2.4xLarge (8 cores) + 1GigE Seconds) 2500 2000 (Runtime 1500 1000 500 0 state-of-the-art graph processing systems. PageRank Benchmark Twitter Graph (42M Vertices,1.5B Edges) UK-Graph (106M Vertices, 3.7B Edges) 9000 8000 7000 6000 5000 7x 18x 4000 3000 2000 1000 0 GraphX performs comparably to
  • 31.
    Future of GraphX 1. Language support a) Java API: PR #3234 b) Python API: collaborating with Intel, SPARK-3789 2. More algorithms a) LDA (topic modeling): PR #2388 b) Correlation clustering c) Your algorithm here? 3. Speculative a) Streaming/time-varying graphs b) Graph database–like queries
  • 32.
    Raw Wikipedia / / / Try GraphX Today XML Hyperlinks PageRank Top 20 Pages Title PR Text Table Title Body Berkeley subgraph
  • 33.
    Thanks! http://spark.apache.org/graphx ankurd@eecs.berkeley.edu jegonzal@eecs.berkeley.edu rxin@eecs.berkeley.edu crankshaw@eecs.berkeley.edu