SlideShare a Scribd company logo
1 of 33
Download to read offline
UC 
BERKELEY 
GraphX 
Graph Analytics in Spark 
Ankur Dave 
Graduate Student, UC Berkeley AMPLab 
Joint work with Joseph Gonzalez, Reynold Xin, Daniel 
Crankshaw, Michael Franklin, and Ion Stoica
Graphs are Everywhere
Social Networks
Web Graphs
Web Graphs 
“The political blogosphere and the 2004 U.S. election: divided they blog.” 
Adamic and Glance, 2005.
User-Item Graphs 
⋆⋆⋆⋆⋆ 
⋆ 
⋆⋆⋆⋆ 
⋆⋆⋆⋆ 
⋆⋆⋆⋆⋆
Graph Algorithms
PageRank
Triangle Counting
Collaborative Filtering 
⋆⋆⋆⋆⋆ 
⋆ 
⋆⋆⋆⋆ 
⋆⋆⋆⋆ 
⋆⋆⋆⋆⋆ 
⚙ 
♫ 
⚙ 
♫
The Graph-Parallel Pattern
The Graph-Parallel Pattern
The Graph-Parallel Pattern
Many Graph-Parallel Algorithms 
Collaborative Filtering 
» Alternating Least Squares 
» Stochastic Gradient Descent 
» Tensor Factorization 
Structured Prediction 
» Loopy Belief Propagation 
» Max-Product Linear 
Programs 
» Gibbs Sampling 
Semi-supervised ML 
» Graph SSL 
» CoEM 
Community Detection 
» Triangle-Counting 
» K-core Decomposition 
» K-Truss 
Graph Analytics 
» PageRank 
» Personalized PageRank 
» Shortest Path 
» Graph Coloring 
Classification 
» Neural Networks
Raw 
Wikipedia 
 / /   /  
XML 
Hyperlinks PageRank Top 20 Pages 
Title PR 
Link Table 
Title Link 
Editor Graph 
Community 
Detection 
User 
Community 
User Com. 
Editor 
Table 
Editor Title 
Top Communities 
Com. PR.. 
Modern Analytics
Tables 
Raw 
Wikipedia 
 / /   /  
XML 
Hyperlinks PageRank Top 20 Pages 
Title PR 
Link Table 
Title Link 
Editor Graph 
Community 
Detection 
User 
Community 
User Com. 
Top Communities 
Com. PR.. 
Editor 
Table 
Editor Title
Editor 
Table 
Editor Title 
Raw 
Wikipedia 
 / /   /  
XML 
Hyperlinks PageRank Top 20 Pages 
Title PR 
Link Table 
Title Link 
Editor Graph 
Community 
Detection 
User 
Community 
User Com. 
Top Communities 
Com. PR.. 
Graphs
The GraphX API
Property Graphs 
Vertex Property: 
• User Profile 
• Current PageRank Value 
Edge Property: 
• Weights 
• Relationships 
• Timestamps
type 
Creating a Graph (Scala) 
VertexId 
= 
Long 
Graph 
val 
vertices: 
RDD[(VertexId, 
String)] 
= 
sc.parallelize(List( 
(1L, 
“Alice”), 
(2L, 
“Bob”), 
(3L, 
“Charlie”))) 
class 
Edge[ED]( 
val 
srcId: 
VertexId, 
val 
dstId: 
VertexId, 
val 
attr: 
ED) 
val 
edges: 
RDD[Edge[String]] 
= 
sc.parallelize(List( 
Edge(1L, 
2L, 
“coworker”), 
Edge(2L, 
3L, 
“friend”))) 
val 
graph 
= 
Graph(vertices, 
edges) 
1 
2 
3 
Alice 
Bob 
Charlie 
coworker 
friend
Graph Operations (Scala) 
class 
Graph[VD, 
ED] 
{ 
// 
Table 
Views 
-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ 
def 
vertices: 
RDD[(VertexId, 
VD)] 
def 
edges: 
RDD[Edge[ED]] 
def 
triplets: 
RDD[EdgeTriplet[VD, 
ED]] 
// 
Transformations 
-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ 
def 
mapVertices[VD2](f: 
(VertexId, 
VD) 
= 
VD2): 
Graph[VD2, 
ED] 
def 
mapEdges[ED2](f: 
Edge[ED] 
= 
ED2): 
Graph[VD2, 
ED] 
def 
reverse: 
Graph[VD, 
ED] 
def 
subgraph(epred: 
EdgeTriplet[VD, 
ED] 
= 
Boolean, 
vpred: 
(VertexId, 
VD) 
= 
Boolean): 
Graph[VD, 
ED] 
// 
Joins 
-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ 
def 
outerJoinVertices[U, 
VD2] 
(tbl: 
RDD[(VertexId, 
U)]) 
(f: 
(VertexId, 
VD, 
Option[U]) 
= 
VD2): 
Graph[VD2, 
ED] 
// 
Computation 
-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ 
def 
mapReduceTriplets[A]( 
sendMsg: 
EdgeTriplet[VD, 
ED] 
= 
Iterator[(VertexId, 
A)], 
mergeMsg: 
(A, 
A) 
= 
A): 
RDD[(VertexId, 
A)]
// 
Continued 
from 
previous 
slide 
def 
pageRank(tol: 
Double): 
Graph[Double, 
Double] 
def 
triangleCount(): 
Graph[Int, 
ED] 
def 
connectedComponents(): 
Graph[VertexId, 
ED] 
// 
...and 
more: 
org.apache.spark.graphx.lib 
} 
Built-in Algorithms (Scala) 
PageRank Triangle Count Connected 
Components
The triplets view 
} 
class 
EdgeTriplet[VD, 
ED]( 
val 
srcId: 
VertexId, 
val 
dstId: 
VertexId, 
val 
attr: 
ED, 
val 
srcAttr: 
VD, 
val 
dstAttr: 
VD) 
RDD 
class 
Graph[VD, 
ED] 
{ 
def 
triplets: 
RDD[EdgeTriplet[VD, 
ED]] 
Graph 
1 
2 
3 
Alice 
Bob 
Charlie 
coworker 
friend 
srcAttr dstAttr attr 
Alice coworker Bob 
Bob friend Charlie 
triplets
The subgraph transformation 
class 
Graph[VD, 
ED] 
{ 
def 
subgraph(epred: 
EdgeTriplet[VD, 
ED] 
= 
Boolean, 
vpred: 
(VertexId, 
VD) 
= 
Boolean): 
Graph[VD, 
ED] 
} 
graph.subgraph(epred 
= 
(edge) 
= 
edge.attr 
!= 
“relative”) 
subgraph 
Graph 
Alice Bob 
relative 
Charlie 
coworker 
friend 
relative David 
Graph 
Alice Bob 
Charlie 
coworker 
friend 
David
The subgraph transformation 
class 
Graph[VD, 
ED] 
{ 
def 
subgraph(epred: 
EdgeTriplet[VD, 
ED] 
= 
Boolean, 
vpred: 
(VertexId, 
VD) 
= 
Boolean): 
Graph[VD, 
ED] 
} 
graph.subgraph(vpred 
= 
(id, 
name) 
= 
name 
!= 
“Bob”) 
subgraph 
Graph 
Alice Bob 
relative 
Charlie 
coworker 
friend 
relative David 
Graph 
Alice 
relative 
Charlie 
relative David
Computation with mapReduceTriplets 
class 
Graph[VD, 
ED] 
{ 
def 
mapReduceTriplets[A]( 
upgrade to aggregateMessages 
sendMsg: 
EdgeTriplet[VD, 
ED] 
= 
Iterator[(VertexId, 
A)], 
mergeMsg: 
(A, 
A) 
= 
A): 
RDD[(VertexId, 
A)] 
} 
graph.mapReduceTriplets( 
edge 
= 
Iterator( 
(edge.srcId, 
1), 
(edge.dstId, 
1)), 
_ 
+ 
_) 
Graph 
Alice Bob 
relative 
Charlie 
friend 
coworker 
relative David 
mapReduceTriplets 
RDD 
vertex id degree 
Alice 2 
Bob 2 
Charlie 3 
David 1 
in Spark 1.2.0
How GraphX Works
Encoding Property Graphs as RDDs 
Part. 1 
A D 
Part. 2 
Vertex 
Table 
(RDD) 
C 
A D 
F 
E 
D 
Property Graph 
B A 
Machine 1 Machine 2 
Edge Table 
(RDD) 
A B 
A C 
B C 
C D 
A E 
A F 
E D 
E F 
A 
B 
C 
D 
E 
F 
Routing 
Table 
(RDD) 
A 
B 
C 
D 
E 
F 
1 
2 
1 
1 
1 
2 
2 
2 
Vertex Cut
Graph System Optimizations 
Specialized 
Vertex-Cuts 
Remote 
Data-Structures 
Partitioning 
Caching / Mirroring 
Message Combiners Active Set Tracking
3500 
3000 
EC2 Cluster of 16 x m2.4xLarge (8 cores) + 1GigE 
Seconds) 
2500 
2000 
(Runtime 1500 
1000 
500 
0 
state-of-the-art graph processing systems. PageRank Benchmark 
Twitter Graph (42M Vertices,1.5B Edges) UK-Graph (106M Vertices, 3.7B Edges) 
9000 
8000 
7000 
6000 
5000 
7x 18x 
4000 
3000 
2000 
1000 
0 
GraphX performs comparably to
Future of GraphX 
1. Language support 
a) Java API: PR #3234 
b) Python API: collaborating with Intel, SPARK-3789 
2. More algorithms 
a) LDA (topic modeling): PR #2388 
b) Correlation clustering 
c) Your algorithm here? 
3. Speculative 
a) Streaming/time-varying graphs 
b) Graph database–like queries
Raw 
Wikipedia 
 / /   /  
Try GraphX Today 
XML Hyperlinks 
PageRank 
Top 20 Pages 
Title PR 
Text 
Table 
Title Body 
Berkeley 
subgraph
Thanks! 
http://spark.apache.org/graphx 
ankurd@eecs.berkeley.edu 
jegonzal@eecs.berkeley.edu 
rxin@eecs.berkeley.edu 
crankshaw@eecs.berkeley.edu

More Related Content

What's hot

Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks DataWorks Summit/Hadoop Summit
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustSpark Summit
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengDatabricks
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingPetr Zapletal
 
Graph computation
Graph computationGraph computation
Graph computationSigmoid
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Spark Summit
 
Signals from outer space
Signals from outer spaceSignals from outer space
Signals from outer spaceGraphAware
 
Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and GiraphDoug Needham
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks
 
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Spark Summit
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big dataSigmoid
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Databricks
 
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)Spark Summit
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
AMP Camp 5 Intro
AMP Camp 5 IntroAMP Camp 5 Intro
AMP Camp 5 Introjeykottalam
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Databricks
 

What's hot (20)

Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Spark graphx
Spark graphxSpark graphx
Spark graphx
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
 
Graph computation
Graph computationGraph computation
Graph computation
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
 
Signals from outer space
Signals from outer spaceSignals from outer space
Signals from outer space
 
Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and Giraph
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
 
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big data
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
 
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
AMP Camp 5 Intro
AMP Camp 5 IntroAMP Camp 5 Intro
AMP Camp 5 Intro
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
 

Similar to GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)

Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...bhargavi804095
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide trainingSpark Summit
 
High-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingHigh-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingNesreen K. Ahmed
 
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)Spark Summit
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsFlink Forward
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkDB Tsai
 
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Databricks
 
GraphQL & DGraph with Go
GraphQL & DGraph with GoGraphQL & DGraph with Go
GraphQL & DGraph with GoJames Tan
 
BUILDING WHILE FLYING
BUILDING WHILE FLYINGBUILDING WHILE FLYING
BUILDING WHILE FLYINGKamal Shannak
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingDatabricks
 
"Визуализация данных с помощью d3.js", Михаил Дунаев, MoscowJS 19
"Визуализация данных с помощью d3.js", Михаил Дунаев, MoscowJS 19"Визуализация данных с помощью d3.js", Михаил Дунаев, MoscowJS 19
"Визуализация данных с помощью d3.js", Михаил Дунаев, MoscowJS 19MoscowJS
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleDataWorks Summit
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADtab0ris_1
 
OrientDB - The 2nd generation of (multi-model) NoSQL
OrientDB - The 2nd generation of  (multi-model) NoSQLOrientDB - The 2nd generation of  (multi-model) NoSQL
OrientDB - The 2nd generation of (multi-model) NoSQLRoberto Franchini
 

Similar to GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20) (20)

Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
 
High-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingHigh-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and Modeling
 
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)
 
DEX: Seminar Tutorial
DEX: Seminar TutorialDEX: Seminar Tutorial
DEX: Seminar Tutorial
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
 
3DRepo
3DRepo3DRepo
3DRepo
 
SVGD3Angular2React
SVGD3Angular2ReactSVGD3Angular2React
SVGD3Angular2React
 
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
 
GraphQL & DGraph with Go
GraphQL & DGraph with GoGraphQL & DGraph with Go
GraphQL & DGraph with Go
 
BUILDING WHILE FLYING
BUILDING WHILE FLYINGBUILDING WHILE FLYING
BUILDING WHILE FLYING
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
 
"Визуализация данных с помощью d3.js", Михаил Дунаев, MoscowJS 19
"Визуализация данных с помощью d3.js", Михаил Дунаев, MoscowJS 19"Визуализация данных с помощью d3.js", Михаил Дунаев, MoscowJS 19
"Визуализация данных с помощью d3.js", Михаил Дунаев, MoscowJS 19
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at Scale
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
 
OrientDB - The 2nd generation of (multi-model) NoSQL
OrientDB - The 2nd generation of  (multi-model) NoSQLOrientDB - The 2nd generation of  (multi-model) NoSQL
OrientDB - The 2nd generation of (multi-model) NoSQL
 
The D3 Toolbox
The D3 ToolboxThe D3 Toolbox
The D3 Toolbox
 

Recently uploaded

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Recently uploaded (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 

GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)

  • 1. UC BERKELEY GraphX Graph Analytics in Spark Ankur Dave Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez, Reynold Xin, Daniel Crankshaw, Michael Franklin, and Ion Stoica
  • 5. Web Graphs “The political blogosphere and the 2004 U.S. election: divided they blog.” Adamic and Glance, 2005.
  • 6. User-Item Graphs ⋆⋆⋆⋆⋆ ⋆ ⋆⋆⋆⋆ ⋆⋆⋆⋆ ⋆⋆⋆⋆⋆
  • 10. Collaborative Filtering ⋆⋆⋆⋆⋆ ⋆ ⋆⋆⋆⋆ ⋆⋆⋆⋆ ⋆⋆⋆⋆⋆ ⚙ ♫ ⚙ ♫
  • 14. Many Graph-Parallel Algorithms Collaborative Filtering » Alternating Least Squares » Stochastic Gradient Descent » Tensor Factorization Structured Prediction » Loopy Belief Propagation » Max-Product Linear Programs » Gibbs Sampling Semi-supervised ML » Graph SSL » CoEM Community Detection » Triangle-Counting » K-core Decomposition » K-Truss Graph Analytics » PageRank » Personalized PageRank » Shortest Path » Graph Coloring Classification » Neural Networks
  • 15. Raw Wikipedia / / / XML Hyperlinks PageRank Top 20 Pages Title PR Link Table Title Link Editor Graph Community Detection User Community User Com. Editor Table Editor Title Top Communities Com. PR.. Modern Analytics
  • 16. Tables Raw Wikipedia / / / XML Hyperlinks PageRank Top 20 Pages Title PR Link Table Title Link Editor Graph Community Detection User Community User Com. Top Communities Com. PR.. Editor Table Editor Title
  • 17. Editor Table Editor Title Raw Wikipedia / / / XML Hyperlinks PageRank Top 20 Pages Title PR Link Table Title Link Editor Graph Community Detection User Community User Com. Top Communities Com. PR.. Graphs
  • 19. Property Graphs Vertex Property: • User Profile • Current PageRank Value Edge Property: • Weights • Relationships • Timestamps
  • 20. type Creating a Graph (Scala) VertexId = Long Graph val vertices: RDD[(VertexId, String)] = sc.parallelize(List( (1L, “Alice”), (2L, “Bob”), (3L, “Charlie”))) class Edge[ED]( val srcId: VertexId, val dstId: VertexId, val attr: ED) val edges: RDD[Edge[String]] = sc.parallelize(List( Edge(1L, 2L, “coworker”), Edge(2L, 3L, “friend”))) val graph = Graph(vertices, edges) 1 2 3 Alice Bob Charlie coworker friend
  • 21. Graph Operations (Scala) class Graph[VD, ED] { // Table Views -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ def vertices: RDD[(VertexId, VD)] def edges: RDD[Edge[ED]] def triplets: RDD[EdgeTriplet[VD, ED]] // Transformations -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ def mapVertices[VD2](f: (VertexId, VD) = VD2): Graph[VD2, ED] def mapEdges[ED2](f: Edge[ED] = ED2): Graph[VD2, ED] def reverse: Graph[VD, ED] def subgraph(epred: EdgeTriplet[VD, ED] = Boolean, vpred: (VertexId, VD) = Boolean): Graph[VD, ED] // Joins -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ def outerJoinVertices[U, VD2] (tbl: RDD[(VertexId, U)]) (f: (VertexId, VD, Option[U]) = VD2): Graph[VD2, ED] // Computation -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ def mapReduceTriplets[A]( sendMsg: EdgeTriplet[VD, ED] = Iterator[(VertexId, A)], mergeMsg: (A, A) = A): RDD[(VertexId, A)]
  • 22. // Continued from previous slide def pageRank(tol: Double): Graph[Double, Double] def triangleCount(): Graph[Int, ED] def connectedComponents(): Graph[VertexId, ED] // ...and more: org.apache.spark.graphx.lib } Built-in Algorithms (Scala) PageRank Triangle Count Connected Components
  • 23. The triplets view } class EdgeTriplet[VD, ED]( val srcId: VertexId, val dstId: VertexId, val attr: ED, val srcAttr: VD, val dstAttr: VD) RDD class Graph[VD, ED] { def triplets: RDD[EdgeTriplet[VD, ED]] Graph 1 2 3 Alice Bob Charlie coworker friend srcAttr dstAttr attr Alice coworker Bob Bob friend Charlie triplets
  • 24. The subgraph transformation class Graph[VD, ED] { def subgraph(epred: EdgeTriplet[VD, ED] = Boolean, vpred: (VertexId, VD) = Boolean): Graph[VD, ED] } graph.subgraph(epred = (edge) = edge.attr != “relative”) subgraph Graph Alice Bob relative Charlie coworker friend relative David Graph Alice Bob Charlie coworker friend David
  • 25. The subgraph transformation class Graph[VD, ED] { def subgraph(epred: EdgeTriplet[VD, ED] = Boolean, vpred: (VertexId, VD) = Boolean): Graph[VD, ED] } graph.subgraph(vpred = (id, name) = name != “Bob”) subgraph Graph Alice Bob relative Charlie coworker friend relative David Graph Alice relative Charlie relative David
  • 26. Computation with mapReduceTriplets class Graph[VD, ED] { def mapReduceTriplets[A]( upgrade to aggregateMessages sendMsg: EdgeTriplet[VD, ED] = Iterator[(VertexId, A)], mergeMsg: (A, A) = A): RDD[(VertexId, A)] } graph.mapReduceTriplets( edge = Iterator( (edge.srcId, 1), (edge.dstId, 1)), _ + _) Graph Alice Bob relative Charlie friend coworker relative David mapReduceTriplets RDD vertex id degree Alice 2 Bob 2 Charlie 3 David 1 in Spark 1.2.0
  • 28. Encoding Property Graphs as RDDs Part. 1 A D Part. 2 Vertex Table (RDD) C A D F E D Property Graph B A Machine 1 Machine 2 Edge Table (RDD) A B A C B C C D A E A F E D E F A B C D E F Routing Table (RDD) A B C D E F 1 2 1 1 1 2 2 2 Vertex Cut
  • 29. Graph System Optimizations Specialized Vertex-Cuts Remote Data-Structures Partitioning Caching / Mirroring Message Combiners Active Set Tracking
  • 30. 3500 3000 EC2 Cluster of 16 x m2.4xLarge (8 cores) + 1GigE Seconds) 2500 2000 (Runtime 1500 1000 500 0 state-of-the-art graph processing systems. PageRank Benchmark Twitter Graph (42M Vertices,1.5B Edges) UK-Graph (106M Vertices, 3.7B Edges) 9000 8000 7000 6000 5000 7x 18x 4000 3000 2000 1000 0 GraphX performs comparably to
  • 31. Future of GraphX 1. Language support a) Java API: PR #3234 b) Python API: collaborating with Intel, SPARK-3789 2. More algorithms a) LDA (topic modeling): PR #2388 b) Correlation clustering c) Your algorithm here? 3. Speculative a) Streaming/time-varying graphs b) Graph database–like queries
  • 32. Raw Wikipedia / / / Try GraphX Today XML Hyperlinks PageRank Top 20 Pages Title PR Text Table Title Body Berkeley subgraph
  • 33. Thanks! http://spark.apache.org/graphx ankurd@eecs.berkeley.edu jegonzal@eecs.berkeley.edu rxin@eecs.berkeley.edu crankshaw@eecs.berkeley.edu