Graphs in data structures are non-linear data structures made up of a finite number of nodes or vertices and the edges that connect them

GraphX:
Unifying Data-Parallel and
Graph-Parallel Analytics
*These slides are best viewed in PowerPoint with anima

Graphs are Central to Analytics
Raw
Wikipedia
< / >
< / >
< / >
XML
Hyperlinks PageRank Top 20 Page
Title PR
Text
Table
Title Body
Topic Model
(LDA) Word Topics
Word Topic
Editor Graph
Community
Detection
User
Community
User Com.
Term-Doc
Graph
Discussion
Table
User Disc.
Community
Topic
Topic Com.

Update ranks in parallel
Iterate until convergence
Rank of
user i Weighted sum of
neighbors’ ranks
3
PageRank: Identifying Leaders

The Graph-Parallel Pattern
4
Model / Alg.
State
Computation depends
only on the neighbors

Many Graph-Parallel Algorithms
• Collaborative Filtering
– Alternating Least Squares
– Stochastic Gradient
Descent
– Tensor Factorization
• Structured Prediction
– Loopy Belief Propagation
– Max-Product Linear
Programs
– Gibbs Sampling
• Semi-supervised ML
– Graph SSL
– CoEM
• Community Detection
– Triangle-Counting
– K-core Decomposition
– K-Truss
• Graph Analytics
– PageRank
– Personalized PageRank
– Shortest Path
– Graph Coloring
• Classification
– Neural Networks
5

Graph-Parallel Systems
6
oogle
Expose specialized APIs to simplify
graph programming.
Exploit graph structure to achieve
orders-of-magnitude performance gains
over more general

PageRank on the Live-Journal Graph
22
354
1340
0 200 400 600 800 1000 1200 1400 1600
GraphLab
Naïve Spark
Mahout/Hadoop
Runtime (in seconds, PageRank for 10 iterations)
GraphLab is 60x faster than Hadoop
GraphLab is 16x faster than Spark

Separate Systems to Support Each
View
Table View Graph View
Dependency
Graph
Table
Resul
t
Row
Row
Row
Row

Having separate systems
for each view is
difficult to use and inefficient
10

Difficult to Program and Use
Users must Learn, Deploy, and
Manage multiple systems
Leads to brittle and often
complex interfaces
11

Inefficient
12
Extensive data movement and duplication across
the network and file system
< / >
< / >
< / >
XML
HDFS HDFS HDFS HDFS
Limited reuse internal data-structures
across stages

Solution: The GraphX Unified Approach
Enabling users to easily and efficiently
express the entire graph analytics
pipeline
New API
Blurs the distinction
between Tables and
Graphs
New System
Combines Data-Parallel
Graph-Parallel Systems

Tables and Graphs are composable
views of the same physical data
GraphX Unified
Representation
Graph View
Table View
Each view has its own operators that
exploit the semantics of the view
to achieve efficient execution

View a Graph as a Table
Id
Rxin
Jegonzal
Franklin
Istoica
SrcId DstId
rxin jegonzal
franklin rxin
istoica franklin
franklin jegonzal
Property (E)
Friend
Advisor
Coworker
PI
Property (V)
(Stu., Berk.)
(PstDoc, Berk.)
(Prof., Berk)
(Prof., Berk)
R
J
F
I
Property Graph
Vertex Property Table
Edge Property Table

Table Operators
Table (RDD) operators are inherited from
Spark:
16
map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
groupByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save
...

class Graph [ V, E ] {
def Graph(vertices: Table[ (Id, V) ],
edges: Table[ (Id, Id, E) ])
// Table Views -----------------
def vertices: Table[ (Id, V) ]
def edges: Table[ (Id, Id, E) ]
def triplets: Table [ ((Id, V), (Id, V), E) ]
// Transformations ------------------------------
def reverse: Graph[V, E]
def subgraph(pV: (Id, V) => Boolean,
pE: Edge[V,E] => Boolean): Graph[V,E]
def mapV(m: (Id, V) => T ): Graph[T,E]
def mapE(m: Edge[V,E] => T ): Graph[V,T]
// Joins ----------------------------------------
def joinV(tbl: Table [(Id, T)]): Graph[(V, T), E ]
def joinE(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)]
// Computation ----------------------------------
def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)],
reduceF: (T, T) => T): Graph[T, E]
}
Graph Operators
17

Triplets Join Vertices and
Edges
The triplets operator joins vertices and
edges:
The mrTriplets operator sums adjacent
triplets.
SELECT t.dstId, reduceUDF( mapUDF(t) ) AS sum
FROM triplets AS t GROUPBY t.dstId
Triplets
Vertices Edges
B
A
C
D
A B
A C
B C
C D
A B
A
B A C
B C
C D

F
E
Map Reduce Triplets
Map-Reduce for each vertex
D
B
A
C
mapF( )
A B
mapF( )
A C
A1
A2
reduceF( , )
A1 A2 A
19

F
E
Example: Oldest Follower
D
B
A
C
What is the age of the oldest
follower for each user?
val oldestFollowerAge = graph
.mrTriplets(
e=> (e.dst.id, e.src.age),//Map
(a,b)=> max(a, b) //Reduce
)
.vertices
23 42
30
19 75
16
20

We express the Pregel and GraphLab
abstractions using the GraphX operators
in less than 50 lines of code!
21
By composing these operators we can
construct entire graph-analytics pipelines.

Part. 2
Part. 1
Vertex
Table
(RDD)
B C
A D
F E
A D
Distributed Graphs as Tables
(RDDs)
D
Property Graph
B C
D
E
A
A
F
Edge
Table
(RDD)
A B
A C
C D
B C
A E
A F
E F
E D
B
C
D
E
A
F
Routing
Table
(RDD)
B
C
D
E
A
F
1
2
1 2
1 2
1
2
2D Vertex Cut Heuristic

Vertex
Table
(RDD)
Caching for Iterative mrTriplets
Edge Table
(RDD)
A B
A C
C D
B C
A E
A F
E F
E D
Mirror
Cache
B
C
D
A
Mirror
Cache
D
E
F
A
B
C
D
E
A
F
B
C
D
E
A
F
A
D

Vertex
Table
(RDD)
Edge Table
(RDD)
A B
A C
C D
B C
A E
A F
E F
E D
Mirror
Cache
B
C
D
A
Mirror
Cache
D
E
F
A
Incremental Updates for Iterative
mrTriplets
B
C
D
E
A
F
Change A
A
Change E
Scan

Vertex
Table
(RDD)
Edge Table
(RDD)
A B
A C
C D
B C
A E
A F
E F
E D
Mirror
Cache
B
C
D
A
Mirror
Cache
D
E
F
A
Aggregation for Iterative mrTriplets
B
C
D
E
A
F
Change
Change
Scan
Change
Change
Change
Change
Local
Aggregate
Local
Aggregate
B
C
D
F

Reduction in
Communication Due to
Cached Updates
0.1
1
10
100
1000
10000
0 2 4 6 8 10 12 14 16
Network
Comm.
(MB)
Iteration
Connected Components on Twitter Graph
Most vertices are within 8 hops
of all vertices in their comp.

Benefit of Indexing Active
Edges
0
5
10
15
20
25
30
0 2 4 6 8 10 12 14 16
Runtime
(Seconds)
Iteration
Connected Components on Twitter Graph
Scan
Indexed
Scan All Edges
Index of “Active” Edges

Join Elimination
Identify and bypass joins for unused triplets
fields
»Example: PageRank only accesses source
attribute
30
0
2000
4000
6000
8000
10000
12000
14000
0 5 10 15 20
Communication
(MB)
Iteration
PageRank on Twitter Three Way Join
Join Elimination
Factor of 2 reduction in communication

Additional Query
Optimizations
Indexing and Bitmaps:
»To accelerate joins across graphs
»To efficiently construct sub-graphs
Substantial Index and Data Reuse:
»Reuse routing tables across graphs and sub-
graphs
»Reuse edge adjacency information and indices
31

Performance Comparisons
22
68
207
354
1340
0 200 400 600 800 1000 1200 1400 1600
GraphLab
GraphX
Giraph
Naïve Spark
Mahout/Hadoop
GraphX is roughly 3x slower than GraphLab
Live-Journal: 69 Million Edges

GraphX scales to larger
graphs
203
451
749
0 200 400 600 800
GraphLab
GraphX
Giraph
GraphX is roughly 2x slower than GraphLab
»Scala + Java overhead: Lambdas, GC time, …
»No shared memory parallelism: 2x increase in comm.
Twitter Graph: 1.5 Billion Edges

PageRank is just one
stage….
What about a pipeline?

HDFS
HDFS
Compute
Spark Preprocess Spark Post.
A Small Pipeline in GraphX
Timed end-to-end GraphX is faster than
Raw Wikipedia
< / >
< / >
< / >
XML
Hyperlinks PageRank Top 20 Pages
342
1492
0 200 400 600 800 1000 1200 1400 1600
GraphLab + Spark
GraphX
Giraph + Spark
Spark
Total Runtime (in Seconds)
605
375

The GraphX Stack
(Lines of Code)
GraphX (3575)
Spark
Pregel (28) + GraphLab (50)
PageRan
k (5)
Connected
Comp. (10)
Shortest
Path
(10)
ALS
(40) LDA
(120)
K-core
(51)
Triangl
e
Count
(45)
SVD
(40)

Status
Alpha release as part of Spark 0.9
Seeking collaborators and feedback

Conclusion and
Observations
Domain specific views: Tables and Graphs
»tables and graphs are first-class composable
objects
»specialized operators which exploit view
semantics
Single system that efficiently spans the
pipeline
»minimize data movement and duplication
»eliminates need to learn and manage multiple
systems
Graphs through the lens of database 38

Active Research
Static Data  Dynamic Data
»Apply GraphX unified approach to time evolving
data
»Model and analyze relationships over time
Serving Graph Structured Data
»Allow external systems to interact with GraphX
»Unify distributed graph databases with relational
database technology
39

Thanks!
ankurd@eecs.berkeley.edu
crankshaw@eecs.berkeley.edu
rxin@eecs.berkeley.edu
jegonzal@eecs.berkeley.edu
http://amplab.github.io/graphx/

Graph Property 1
Real-World Graphs
41
10
0
10
2
10
4
10
6
10
8
10
0
10
2
10
4
10
6
10
8
10
10
degree
count
Top 1% of vertices
are adjacent to
50% of the edges!
Number
of
Vertices
AltaVista WebGraph1.4B Vertices, 6.6B
Edges
Degree
More than 108
vertices
have one neighbor.
0
20
40
60
80
100
120
140
160
180
200
2008 2010 2012
Ratio
of
Edges
to
Vertices
Year
Facebook
Power-Law Degree Distribution Edges >> Vertices

Graph Property 2
Active Vertices
1
10
100
1000
10000
100000
1000000
10000000
100000000
0 10 20 30 40 50 60 70
Num-Vertices
Number of Updates
51% updated only once!
PageRank on Web Graph

Graphs are Essential to
Data Mining and Machine
Learning
Identify influential people and information
Find communities
Understand people’s shared interests
Model complex data dependencies

Ratings Item
s
Recommending Products
Users

Low-Rank Matrix Factorization:
45
r13
r14
r24
r25
f(1)
f(2)
f(3)
f(4)
f(5)
User
Factors
(U)
Movie
Factors
(M)
User
s
Movie
s
Netflix
User
s
≈
x
Movie
s
f(i)
f(j)
Iterate:
Recommending Products

Liberal Conservative
Post
Post
Post
Post
Post
Post
Post
Post
Predicting User Behavior
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
?
?
?
?
?
?
?
? ?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
? ?
?
46
Conditional Random Field
Belief Propagation

Count triangles passing through each
vertex:
Measures “cohesiveness” of local
community
More Triangles
Stronger Community
Fewer Triangles
Weaker Community
1
2 3
4
Finding Communities

Preprocessing Compute Post Proc.
Example Graph Analytics
Pipeline
48
< / >
< / >
< / >
XML
Raw
Data
ETL Slice Compute
Repeat
Subgraph PageRank
Initial
Graph
Analyz
e
Top
Users

Graphs in data structures are non-linear data structures made up of a finite number of nodes or vertices and the edges that connect them

Recommended

Recommended

More Related Content

Similar to Graphs in data structures are non-linear data structures made up of a finite number of nodes or vertices and the edges that connect them

Similar to Graphs in data structures are non-linear data structures made up of a finite number of nodes or vertices and the edges that connect them (20)

More from bhargavi804095

More from bhargavi804095 (16)

Recently uploaded

Recently uploaded (20)

Graphs in data structures are non-linear data structures made up of a finite number of nodes or vertices and the edges that connect them

Editor's Notes