SlideShare a Scribd company logo
Introduction to
Graph Analytics
CS194-16 Introduction to Data Science
*These slides are best viewed in PowerPoint with anima
Joseph E. Gonzalez
Post-doc, AMPLab
jegonzal@cs.berkeley.edu
Outline
1. Graph structured data
2. Common properties of graph data
3. Graph algorithms
4. Systems for large-scale graph
computation
5. GraphX: Graph Computation in Spark
6. Summary of other graph frameworks
Graph structured data is
everywhere …
Social Network
Vertices
• Users
• Posts / Images
Edges
• Social Relationships
• Directed: Twitter
• Undirected: Facebook
• Likes
CHAPTER 1. OVERVIE
27
15
23
10 20
4
13
16
34
31
14
12
18
17
30
33
32
9
2
1
5
6
21
24
25
3
8
22
11
7
19
28
29
26
e 1.7: From the social network of friendships in the karate club from Figure 1.1,
nd clues to the latent schism that eventually split the group into two separate clu
Actual Social Graph
Karate Club Network
Web Graphs
• Vertices: Web-pages
• Edges: Links (Directed)
Generated Content:
• Click-streams
Wikipedia restricted to
1000 climate change
pages
Web Graphs
• Vertices: Web-pages
• Edges: Links (Directed)
Generated Content:
• Click-streams
2004 Political Blogs
Semantic Networks
Organize Knowledge
Vertices: Subject,
Object
Edges: Predicates
Example:
Google Knowledge Graph
• 570M Vertices
• 18B Edges
http://wiki.dbpedia.org
Transaction Networks
Supply Chain:
Vertices: Suppliers/Consumers
Edges: Exchange of Goods
Transaction Networks (e.g., Bitcoin):
Vertices: Users
Edges: Exchange of Currency
http://anonymity-in-bitcoin.blogspot.com/2011/07/bitcoin-is-not-
Transaction Networks
Supply Chain:
Vertices: Suppliers/Consumers
Edges: Exchange of Goods
Transaction Networks (e.g., Bitcoin):
Vertices: Users
Edges: Exchange of Currency
Biological Networks
Protein-Protein Interaction Networks (Interactomes)
Vertices: Proteins
Edges: Interactions
Biological Networks
Regulatory Networks
(Bipartite)
Vertices: Regulators, targets
Edges: Regulates target
Email
Call records
Communication Networks
Vertices: Devices, Routers
Directed Edges: Network Flows
Who Talks to Whom
GraphEnron Email Graphs
Vertices: Users
Directed Edges: Email FromTo
User - Item Graphs
(Recommender Systems)
Bipartite Graphs
Vertices: Users and Items
Edges: Ratings
Graphical Models
Vertices: Random Variables, Factors
Edges: Statistical Dependencies
LDA
Cat
Apple
Growth
Hat
Plant
Co-Authorship Network
Vertices: Authors
Edges: Co-authorship
Example: Erdos
Number
http://academic.research.microsoft.com/VisualExplorer#2952384&1112639
Others?
Common properties of
graphs derived from
natural phenomena
Power-Law Degree
Distribution
10
0
10
2
10
4
10
6
10
810
0
10
2
10
4
10
6
10
8
10
10
degree
count
Top 1% of vertices are
adjacent to
50% of the edges!
High-Degree
Vertices
20
NumberofVertices
AltaVista WebGraph
1.4B Vertices, 6.6B Edges
Degree
More than 108 vertices
have one neighbor.
Giant Connected
Component
Densification
22
Average distance between nodes reduces over time.
60
80
100
120
140
160
180
200
2008 2010 2012
RatioofEdgestoVertices
Year
Facebook US Patent Citations
Community Structure
Linked-In Messenger
Graph Algorithms
“Think Globally, Act Locally”
25
Identifying Leaders
PageRank (Centrality
Measures)
Recursive Relationship:
Where:
»α is the random reset probability (typically 0.15)
»L[j] is the number of links on page j
1 32
4 65
http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf
Liberal Conservative
Post
Post
Post
Post
Post
Post
Post
Post
Predicting Behavior
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
?
?
?
?
?
?
?
? ?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
? ?
?
27
Profile
Label Propagation
(Structured Prediction)
Social Arithmetic:
Recurrence Algorithm:
» iterate until convergence
Sue Ann
Carlos
Me
50% What I list on my profile
40% Sue Ann Likes
10% Carlos Like
40%
10%
50%
80% Cameras
20% Biking
30% Cameras
70% Biking
50% Cameras
50% Biking
I Like:
+
60% Cameras, 40% Biking
Likes[i]= Wij ´ Likes[ j]
jÎFriends[i]
å
http://www.cs.cmu.edu/~zhuxj/pub/CMU-CALD-02-107.pdf
Ratings Item
s
Recommending Products
Users
Low-Rank Matrix Factorization:
31
r13
r14
r24
r25
f(1)
f(2)
f(3)
f(4)
f(5)
UserFactors(U)
MovieFactors(M)
User
s
Movie
sNetflix
User
s≈
x
Movie
s
f(i)
f(j)
Iterate:
Recommending Products
Count triangles passing through each
vertex:
Measure “cohesiveness” of local community
1
2 3
4
Finding Communities
ClusterCoeff[i] =
2 * #Triangles[i]
Deg[i] * (Deg[i] – 1)
Count triangles passing through each vertex
by counting triangles on each edge:
Counting Triangles
2
1
E
F
D
C
G
D
C
E
F
B
D
C
G
A
D
CA B
Every vertex starts out with a unique
component id (typically it’s vertex id):
Connected Components
4
5
6
1
3
2 4
4
4
1
2
1 4
4
4
1
1
1
Putting it All Together
Raw
Wikipedia
< / >< / >< / >
XML
Hyperlinks PageRank Top 20 Page
Title PR
Text
Table
Title Body
Topic Model
(LDA) Word Topics
Word Topic
Editor Graph
Community
Detection
User
Community
User Com.
Term-Doc
Graph
Discussion
Table
User Disc.
Community
Topic
Topic Com.
Many Other Graph Algorithms
• Collaborative Filtering
– Alternating Least Squares
– Stochastic Gradient
Descent
– Tensor Factorization
• Structured Prediction
– Loopy Belief Propagation
– Max-Product Linear
Programs
– Gibbs Sampling
• Semi-supervised ML
– Graph SSL
– CoEM
• Community Detection
– Triangle-Counting
– K-core Decomposition
– K-Truss
• Graph Analytics
– PageRank
– Personalized PageRank
– Shortest Path
– Graph Coloring
• Classification
– Neural Networks
36
The Graph-Parallel Pattern
37
Model / Alg.
State
Fundamental Pattern
Graph-Parallel Systems
38
Expose specialized APIs to simplify
graph programming.
The Vertex Program Abstraction
Vertex-Programs interact by sending messages.
iPregel_PageRank(i, messages) :
// Receive all the messages
total = 0
foreach( msg in messages) :
total = total + msg
// Update the rank of this vertex
R[i] = 0.15 + total
// Send new messages to neighbors
foreach(j in out_neighbors[i]) :
Send msg(R[i]) to vertex j
39Malewicz et al. [PODC’09, SIGMOD’10]
Barrier
Iterative Bulk Synchronous
Execution
Compute Communicate
Graph-Parallel Systems
41
Exploit graph structure to achieve
orders-of-magnitude performance gains
over more general data-parallel
systems.
Graph System
Optimizations
42
Specialized
Data-Structures
Vertex-Cuts
Partitioning
Remote
Caching / Mirroring
Message
Combiners
Active Set Tracking
Machine 1 Machine 2
Split High-Degree vertices
New Abstraction  Equivalence on Split 43
Program
This
Run on This
Machine 2Machine 1
Machine 4Machine 3
GAS Decomposition
Σ1 Σ2
Σ3 Σ4
+ + +
YYYY
Y’
Σ
Y’Y’Y’Gather
Apply
Scatter
44
Master
Mirror
Mirror
Mirror
2D Partitioning
Adj.
Matrix
Vertices
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
Vertices
16 Machines
vi
vi
2
5 6 7 8
10
14
vi only has
neighbors on
7 machines
45
Counted: 34.8 Billion
Triangles
50
Triangle Counting on Twitter
64 Machines
15 Seconds
1536 Machines
423 Minutes
Hadoop
[WWW’11]
S. Suri and S. Vassilvitskii, “Counting triangles and the curse of the last reducer,” WWW’11
1000 x
Faster
40M Users, 1.4 Billion Links
Break!
jegonzal@eecs.berkeley.edu
http://tinyurl.com/ampgraphx
Graph Analytics Pipeline
Raw
Wikipedia
< / >< / >< / >
XML
Hyperlinks PageRank Top 20 Page
Title PR
Text
Table
Title Body
Topic Model
(LDA) Word Topics
Word Topic
Editor Graph
Community
Detection
User
Community
User Com.
Term-Doc
Graph
Discussion
Table
User Disc.
Community
Topic
Topic Com.
Tables
Raw
Wikipedia
< / >< / >< / >
XML
Hyperlinks PageRank Top 20 Page
Title PR
Text
Table
Title Body
Topic Model
(LDA) Word Topics
Word Topic
Editor Graph
Community
Detection
User
Community
User Com.
Term-Doc
Graph
Discussion
Table
User Disc.
Community
Topic
Topic Com.
Graphs
Raw
Wikipedia
< / >< / >< / >
XML
Hyperlinks PageRank Top 20 Page
Title PR
Text
Table
Title Body
Topic Model
(LDA) Word Topics
Word Topic
Editor Graph
Community
Detection
User
Community
User Com.
Term-Doc
Graph
Discussion
Table
User Disc.
Community
Topic
Topic Com.
Separate Systems
Tables Graphs
Separate Systems
Graphs
Dataflow Systems
Table
Resul
t
Row
Row
Row
Row
Separate Systems
Dataflow Systems Graph Systems
Dependency
Graph
Table
Resul
t
Row
Row
Row
Row
Separate systems
for each view can be
difficult to use and
inefficient
58
Difficult to Program and Use
Users must Learn, Deploy, and
Manage multiple systems
Leads to brittle and often
complex interfaces
59
Inefficient
60
Extensive data movement and duplication across
the network and file system
< / >< / >< / >
XML
HDFS HDFS HDFS HDFS
Limited reuse internal data-structures
across stages
The GraphX Unified Approach
Enabling users to easily and efficiently
express the entire graph analytics
pipeline
New API
Blurs the distinction
between Tables and
Graphs
New System
Combines Data-Parallel
Graph-Parallel Systems
Representation
Optimizations
Distributed
Graphs
Horizontally
Partitioned Tables
Join
Vertex
Programs
Dataflow
Operators
Advances in Graph Processing Systems
Distributed Join
Optimization
Materialized View
Maintenance
View a Graph as a Table
Id
Rxin
Jegonzal
Franklin
Istoica
SrcId DstId
rxin jegonzal
franklin rxin
istoica franklin
franklin jegonzal
Property (E)
Friend
Advisor
Coworker
PI
Property (V)
(Stu., Berk.)
(PstDoc, Berk.)
(Prof., Berk)
(Prof., Berk)
R
J
F
I
Property Graph
Vertex Property Table
Edge Property Table
Spark Table Operators
Table (RDD) operators are inherited from
Spark:
64
map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
groupByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save
...
class Graph [ V, E ] {
def Graph(vertices: Table[ (Id, V) ],
edges: Table[ (Id, Id, E) ])
// Table Views -----------------
def vertices: Table[ (Id, V) ]
def edges: Table[ (Id, Id, E) ]
def triplets: Table [ ((Id, V), (Id, V), E) ]
// Transformations ------------------------------
def reverse: Graph[V, E]
def subgraph(pV: (Id, V) => Boolean,
pE: Edge[V,E] => Boolean): Graph[V,E]
def mapV(m: (Id, V) => T ): Graph[T,E]
def mapE(m: Edge[V,E] => T ): Graph[V,T]
// Joins ----------------------------------------
def joinV(tbl: Table [(Id, T)]): Graph[(V, T), E ]
def joinE(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)]
// Computation ----------------------------------
def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)],
reduceF: (T, T) => T): Graph[T, E]
}
Graph Operators (Scala)
65
class Graph [ V, E ] {
def Graph(vertices: Table[ (Id, V) ],
edges: Table[ (Id, Id, E) ])
// Table Views -----------------
def vertices: Table[ (Id, V) ]
def edges: Table[ (Id, Id, E) ]
def triplets: Table [ ((Id, V), (Id, V), E) ]
// Transformations ------------------------------
def reverse: Graph[V, E]
def subgraph(pV: (Id, V) => Boolean,
pE: Edge[V,E] => Boolean): Graph[V,E]
def mapV(m: (Id, V) => T ): Graph[T,E]
def mapE(m: Edge[V,E] => T ): Graph[V,T]
// Joins ----------------------------------------
def joinV(tbl: Table [(Id, T)]): Graph[(V, T), E ]
def joinE(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)]
// Computation ----------------------------------
def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)],
reduceF: (T, T) => T): Graph[T, E]
}
Graph Operators (Scala)
66
def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)],
reduceF: (T, T) => T): Graph[T, E]
capture the Gather-Scatter pattern from
specialized graph-processing systems
Triplets Join Vertices and
Edges
The triplets operator joins vertices and
edges:
TripletsVertices
B
A
C
D
Edges
A B
A C
B C
C D
A BA
B A C
B C
C D
Map-Reduce Triplets
Map-Reduce triplets collects information
about the neighborhood of each vertex:
C D
A C
B C
A B
Src. or Dst.
MapFunction( )  (B, )
MapFunction( )  (C, )
MapFunction( )  (C, )
MapFunction( )  (D, )
Reduce
(B, )
(C, + )
(D, )
Message
Combiners
Using these basic GraphX operators
we implemented Pregel and GraphLab
in under 50 lines of code!
69
The GraphX Stack
(Lines of Code)
GraphX (2,500)
Spark (30,000)
Pregel API (34)
PageRank
(20)
Connected
Comp. (20)
K-core
(60)
Triangl
e
Count
(50)
LDA
(220)
SVD++
(110)
Some algorithms are more naturally expressed
using the GraphX primitive operators
We express enhanced Pregel and
GraphLab
abstractions using the GraphX operators
in less than 50 lines of code!
71
Enhanced Pregel in GraphX
72Malewicz et al. [PODC’09, SIGMOD’10]
pregelPR(i, messageList ):
// Receive all the messages
total = 0
foreach( msg in messageList) :
total = total + msg
// Update the rank of this vertex
R[i] = 0.15 + total
// Send new messages to neighbors
foreach(j in out_neighbors[i]) :
Send msg(R[i]/E[i,j]) to vertex
Require Message
CombinersmessageSum
messageSum
Remove Message
Computation
from the
Vertex Program
sendMsg(ij, R[i], R[j], E[i,j]):
// Compute single message
return msg(R[i]/E[i,j])
combineMsg(a, b):
// Compute sum of two messages
return a + b
GraphX System Design
Part. 2
Part. 1
Vertex
Table
(RDD)
B C
A D
F E
A D
Distributed Graphs as Tables
(RDDs)
D
Property Graph
B C
D
E
AA
F
Edge
Table
(RDD)A B
A C
C D
B C
A E
A F
E F
E D
B
C
D
E
A
F
Routing
Table
(RDD)
B
C
D
E
A
F
1
2
1 2
1 2
1
2
2D Vertex Cut Heuristic
Vertex
Table
(RDD)
Caching for Iterative mrTriplets
Edge Table
(RDD)
A B
A C
C D
B C
A E
A F
E F
E D
Mirror
Cache
B
C
D
A
Mirror
Cache
D
E
F
A
B
C
D
E
A
F
B
C
D
E
A
F
A
D
Vertex
Table
(RDD)
Edge Table
(RDD)
A B
A C
C D
B C
A E
A F
E F
E D
Mirror
Cache
B
C
D
A
Mirror
Cache
D
E
F
A
Incremental Updates for Iterative
mrTriplets
B
C
D
E
A
F
Change AA
Change E
Scan
Vertex
Table
(RDD)
Edge Table
(RDD)
A B
A C
C D
B C
A E
A F
E F
E D
Mirror
Cache
B
C
D
A
Mirror
Cache
D
E
F
A
Aggregation for Iterative mrTriplets
B
C
D
E
A
F
Change
Change
Scan
Change
Change
Change
Change
Local
Aggregate
Local
Aggregate
B
C
D
F
Performance Comparisons
22
68
207
354
1340
0 200 400 600 800 1000 1200 1400 1600
GraphLab
GraphX
Giraph
Naïve Spark
Mahout/Hadoop
Runtime (in seconds, PageRank for 10 iterations)
GraphX is roughly 3x slower than GraphLab
Live-Journal: 69 Million Edges
GraphX scales to larger
graphs
203
451
749
0 200 400 600 800
GraphLab
GraphX
Giraph
Runtime (in seconds, PageRank for 10 iterations)
GraphX is roughly 2x slower than GraphLab
»Scala + Java overhead: Lambdas, GC time, …
»No shared memory parallelism: 2x increase in comm.
Twitter Graph: 1.5 Billion Edges
PageRank is just one
stage….
What about a pipeline?
HDFSHDFS
ComputeSpark Preprocess Spark Post.
A Small Pipeline in GraphX
Timed end-to-end GraphX is faster than
Raw Wikipedia
< / >< / >< / >
XML
Hyperlinks PageRank Top 20 Pages
342
1492
0 200 400 600 800 1000 1200 1400 1600
GraphLab + Spark
GraphX
Giraph + Spark
Spark
Total Runtime (in Seconds)
605
375
Open Source Project
Alpha release since Spark 0.9
Contributors? Python Bindings?
Graph Processing Systems
• Apache Giraph: java Pregel
implementation
• GraphLab.org: C++ GraphLab
implementation
• NetworkX: python API for small gaphs
• GraphLab Create: commercial GraphLab
python framework for large graphs and
ML
• Gephi: graph visualization framework
Graph Database
Technologies
Property graph data-model for storing and
retrieving graph structured data.
• Neo4j: popular commercial graph
database
• Titan: open-source distributed graph
database
Break!
jegonzal@eecs.berkeley.edu
http://tinyurl.com/ampgraphx
About Scala
High-level language for the Java VM
»Object-oriented + functional programming
Statically typed
»Comparable in speed to Java
»But often no need to write types due to type
inference
Interoperates with Java
»Can use any Java class, inherit from it, etc; can
also call Scala code from Java
Quick Tour
Declaring variables:
var x: Int = 7
var x = 7 // type inferred
val y = “hi” // read-only
Java equivalent:
int x = 7;
final String y = “hi”;
Functions:
def square(x: Int): Int = x*x
def min(a:Int, b:Int): Int = {
if (a < b) a else b
}
def announce(text: String) {
println(text)
}
Java equivalent:
int square(int x) {
return x*x;
}
void announce(String text) {
System.out.println(text);
}
Quick Tour
Generic types:
var arr = new Array[Int](8)
var lst = List(1, 2, 3)
// type of lst is List[Int]
Java equivalent:
int[] arr = new int[8];
List<Integer> lst =
new ArrayList<Integer>();
lst.add(...)
Indexing:
arr(5) = 7
println(lst(5))
Java equivalent:
arr[5] = 7;
System.out.println(lst.get(5));
Processing collections with functional
programming:
val list = List(1, 2, 3)
list.foreach(x => println(x)) // prints 1, 2, 3
list.foreach(println) // same
list.map(x => x + 2) // => List(3, 4, 5)
list.map(_ + 2) // same, with placeholder notation
list.filter(x => x % 2 == 1) // => List(1, 3)
list.filter(_ % 2 == 1) // => List(1, 3)
list.reduce((x, y) => x + y) // => 6
list.reduce(_ + _) // => 6
QuickTour
Function expression (closure)
All of these leave the list unchanged (List is immutable)
Other Collection Methods
Scala collections provide many other
functional methods; for example, Google for
“Scala Seq”Method on Seq[T] Explanation
map(f: T => U): Seq[U] Pass each element through f
flatMap(f: T => Seq[U]): Seq[U] One-to-many map
filter(f: T => Boolean): Seq[T] Keep elements passing f
exists(f: T => Boolean): Boolean True if one element passes
forall(f: T => Boolean): Boolean True if all elements pass
reduce(f: (T, T) => T): T Merge elements using f
groupBy(f: T => K): Map[K,List[T]] Group elements by f(element)
sortBy(f: T => K): Seq[T] Sort elements by f(element)
. . .

More Related Content

What's hot

software engineering
software engineeringsoftware engineering
software engineering
Abinaya B
 
Object Oriented Approach for Software Development
Object Oriented Approach for Software DevelopmentObject Oriented Approach for Software Development
Object Oriented Approach for Software Development
Rishabh Soni
 
Object oriented methodologies
Object oriented methodologiesObject oriented methodologies
Object oriented methodologies
naina-rani
 
Data Mining
Data MiningData Mining
Data Mining
IIIT ALLAHABAD
 
Hate Speech Recognition System through NLP and Deep Learning
Hate Speech Recognition System through NLP and Deep LearningHate Speech Recognition System through NLP and Deep Learning
Hate Speech Recognition System through NLP and Deep Learning
IRJET Journal
 
Data Mining: Classification and analysis
Data Mining: Classification and analysisData Mining: Classification and analysis
Data Mining: Classification and analysis
DataminingTools Inc
 

What's hot (6)

software engineering
software engineeringsoftware engineering
software engineering
 
Object Oriented Approach for Software Development
Object Oriented Approach for Software DevelopmentObject Oriented Approach for Software Development
Object Oriented Approach for Software Development
 
Object oriented methodologies
Object oriented methodologiesObject oriented methodologies
Object oriented methodologies
 
Data Mining
Data MiningData Mining
Data Mining
 
Hate Speech Recognition System through NLP and Deep Learning
Hate Speech Recognition System through NLP and Deep LearningHate Speech Recognition System through NLP and Deep Learning
Hate Speech Recognition System through NLP and Deep Learning
 
Data Mining: Classification and analysis
Data Mining: Classification and analysisData Mining: Classification and analysis
Data Mining: Classification and analysis
 

Similar to F14 lec12graphs

Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013
MLconf
 
GraphLab: Large-Scale Machine Learning on Graphs (BDT204) | AWS re:Invent 2013
GraphLab: Large-Scale Machine Learning on Graphs (BDT204) | AWS re:Invent 2013GraphLab: Large-Scale Machine Learning on Graphs (BDT204) | AWS re:Invent 2013
GraphLab: Large-Scale Machine Learning on Graphs (BDT204) | AWS re:Invent 2013
Amazon Web Services
 
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Till Blume
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processing
jins0618
 
Measurement and modeling of the web and related data sets
Measurement and modeling of the web and related data setsMeasurement and modeling of the web and related data sets
Measurement and modeling of the web and related data sets
Mark J. Feldman
 
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsScalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Jason Riedy
 
High-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingHigh-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and Modeling
Nesreen K. Ahmed
 
Democratizing Data Science in the Cloud
Democratizing Data Science in the CloudDemocratizing Data Science in the Cloud
Democratizing Data Science in the Cloud
University of Washington
 
Machine Learning in the Cloud with GraphLab
Machine Learning in the Cloud with GraphLabMachine Learning in the Cloud with GraphLab
Machine Learning in the Cloud with GraphLab
Danny Bickson
 
2009 Node XL Overview: Social Network Analysis in Excel 2007
2009 Node XL Overview: Social Network Analysis in Excel 20072009 Node XL Overview: Social Network Analysis in Excel 2007
2009 Node XL Overview: Social Network Analysis in Excel 2007
Marc Smith
 
Poster
PosterPoster
Design Patterns for Efficient Graph Algorithms in MapReduce__HadoopSummit2010
Design Patterns for Efficient Graph Algorithms in MapReduce__HadoopSummit2010Design Patterns for Efficient Graph Algorithms in MapReduce__HadoopSummit2010
Design Patterns for Efficient Graph Algorithms in MapReduce__HadoopSummit2010
Yahoo Developer Network
 
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
csandit
 
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONSBIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
cscpconf
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at Scale
DataWorks Summit
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
marpierc
 
PresentationTest
PresentationTestPresentationTest
PresentationTest
bolu804
 
Graph Analytics: Graph Algorithms Inside Neo4j
Graph Analytics: Graph Algorithms Inside Neo4jGraph Analytics: Graph Algorithms Inside Neo4j
Graph Analytics: Graph Algorithms Inside Neo4j
Neo4j
 
Data Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneData Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZone
Doug Needham
 
The Future is Big Graphs: A Community View on Graph Processing Systems
The Future is Big Graphs: A Community View on Graph Processing SystemsThe Future is Big Graphs: A Community View on Graph Processing Systems
The Future is Big Graphs: A Community View on Graph Processing Systems
Neo4j
 

Similar to F14 lec12graphs (20)

Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013
 
GraphLab: Large-Scale Machine Learning on Graphs (BDT204) | AWS re:Invent 2013
GraphLab: Large-Scale Machine Learning on Graphs (BDT204) | AWS re:Invent 2013GraphLab: Large-Scale Machine Learning on Graphs (BDT204) | AWS re:Invent 2013
GraphLab: Large-Scale Machine Learning on Graphs (BDT204) | AWS re:Invent 2013
 
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processing
 
Measurement and modeling of the web and related data sets
Measurement and modeling of the web and related data setsMeasurement and modeling of the web and related data sets
Measurement and modeling of the web and related data sets
 
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsScalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
 
High-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingHigh-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and Modeling
 
Democratizing Data Science in the Cloud
Democratizing Data Science in the CloudDemocratizing Data Science in the Cloud
Democratizing Data Science in the Cloud
 
Machine Learning in the Cloud with GraphLab
Machine Learning in the Cloud with GraphLabMachine Learning in the Cloud with GraphLab
Machine Learning in the Cloud with GraphLab
 
2009 Node XL Overview: Social Network Analysis in Excel 2007
2009 Node XL Overview: Social Network Analysis in Excel 20072009 Node XL Overview: Social Network Analysis in Excel 2007
2009 Node XL Overview: Social Network Analysis in Excel 2007
 
Poster
PosterPoster
Poster
 
Design Patterns for Efficient Graph Algorithms in MapReduce__HadoopSummit2010
Design Patterns for Efficient Graph Algorithms in MapReduce__HadoopSummit2010Design Patterns for Efficient Graph Algorithms in MapReduce__HadoopSummit2010
Design Patterns for Efficient Graph Algorithms in MapReduce__HadoopSummit2010
 
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
 
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONSBIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at Scale
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
PresentationTest
PresentationTestPresentationTest
PresentationTest
 
Graph Analytics: Graph Algorithms Inside Neo4j
Graph Analytics: Graph Algorithms Inside Neo4jGraph Analytics: Graph Algorithms Inside Neo4j
Graph Analytics: Graph Algorithms Inside Neo4j
 
Data Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneData Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZone
 
The Future is Big Graphs: A Community View on Graph Processing Systems
The Future is Big Graphs: A Community View on Graph Processing SystemsThe Future is Big Graphs: A Community View on Graph Processing Systems
The Future is Big Graphs: A Community View on Graph Processing Systems
 

Recently uploaded

Question paper of renewable energy sources
Question paper of renewable energy sourcesQuestion paper of renewable energy sources
Question paper of renewable energy sources
mahammadsalmanmech
 
New techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdfNew techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdf
wisnuprabawa3
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
ihlasbinance2003
 
Low power architecture of logic gates using adiabatic techniques
Low power architecture of logic gates using adiabatic techniquesLow power architecture of logic gates using adiabatic techniques
Low power architecture of logic gates using adiabatic techniques
nooriasukmaningtyas
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
JamalHussainArman
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
Hitesh Mohapatra
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Christina Lin
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
SUTEJAS
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
NidhalKahouli2
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
SyedAbiiAzazi1
 
2. Operations Strategy in a Global Environment.ppt
2. Operations Strategy in a Global Environment.ppt2. Operations Strategy in a Global Environment.ppt
2. Operations Strategy in a Global Environment.ppt
PuktoonEngr
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
Madan Karki
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
gestioneergodomus
 
A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...
nooriasukmaningtyas
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdfIron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
RadiNasr
 
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
University of Maribor
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
heavyhaig
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 

Recently uploaded (20)

Question paper of renewable energy sources
Question paper of renewable energy sourcesQuestion paper of renewable energy sources
Question paper of renewable energy sources
 
New techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdfNew techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdf
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
 
Low power architecture of logic gates using adiabatic techniques
Low power architecture of logic gates using adiabatic techniquesLow power architecture of logic gates using adiabatic techniques
Low power architecture of logic gates using adiabatic techniques
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
 
2. Operations Strategy in a Global Environment.ppt
2. Operations Strategy in a Global Environment.ppt2. Operations Strategy in a Global Environment.ppt
2. Operations Strategy in a Global Environment.ppt
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
 
A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
 
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdfIron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
 
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 

F14 lec12graphs

  • 1. Introduction to Graph Analytics CS194-16 Introduction to Data Science *These slides are best viewed in PowerPoint with anima Joseph E. Gonzalez Post-doc, AMPLab jegonzal@cs.berkeley.edu
  • 2. Outline 1. Graph structured data 2. Common properties of graph data 3. Graph algorithms 4. Systems for large-scale graph computation 5. GraphX: Graph Computation in Spark 6. Summary of other graph frameworks
  • 3. Graph structured data is everywhere …
  • 4. Social Network Vertices • Users • Posts / Images Edges • Social Relationships • Directed: Twitter • Undirected: Facebook • Likes
  • 5. CHAPTER 1. OVERVIE 27 15 23 10 20 4 13 16 34 31 14 12 18 17 30 33 32 9 2 1 5 6 21 24 25 3 8 22 11 7 19 28 29 26 e 1.7: From the social network of friendships in the karate club from Figure 1.1, nd clues to the latent schism that eventually split the group into two separate clu Actual Social Graph Karate Club Network
  • 6. Web Graphs • Vertices: Web-pages • Edges: Links (Directed) Generated Content: • Click-streams Wikipedia restricted to 1000 climate change pages
  • 7. Web Graphs • Vertices: Web-pages • Edges: Links (Directed) Generated Content: • Click-streams 2004 Political Blogs
  • 8. Semantic Networks Organize Knowledge Vertices: Subject, Object Edges: Predicates Example: Google Knowledge Graph • 570M Vertices • 18B Edges http://wiki.dbpedia.org
  • 9. Transaction Networks Supply Chain: Vertices: Suppliers/Consumers Edges: Exchange of Goods Transaction Networks (e.g., Bitcoin): Vertices: Users Edges: Exchange of Currency http://anonymity-in-bitcoin.blogspot.com/2011/07/bitcoin-is-not-
  • 10. Transaction Networks Supply Chain: Vertices: Suppliers/Consumers Edges: Exchange of Goods Transaction Networks (e.g., Bitcoin): Vertices: Users Edges: Exchange of Currency
  • 11. Biological Networks Protein-Protein Interaction Networks (Interactomes) Vertices: Proteins Edges: Interactions
  • 12. Biological Networks Regulatory Networks (Bipartite) Vertices: Regulators, targets Edges: Regulates target
  • 13. Email Call records Communication Networks Vertices: Devices, Routers Directed Edges: Network Flows
  • 14. Who Talks to Whom GraphEnron Email Graphs Vertices: Users Directed Edges: Email FromTo
  • 15. User - Item Graphs (Recommender Systems) Bipartite Graphs Vertices: Users and Items Edges: Ratings
  • 16. Graphical Models Vertices: Random Variables, Factors Edges: Statistical Dependencies LDA Cat Apple Growth Hat Plant
  • 17. Co-Authorship Network Vertices: Authors Edges: Co-authorship Example: Erdos Number http://academic.research.microsoft.com/VisualExplorer#2952384&1112639
  • 19. Common properties of graphs derived from natural phenomena
  • 20. Power-Law Degree Distribution 10 0 10 2 10 4 10 6 10 810 0 10 2 10 4 10 6 10 8 10 10 degree count Top 1% of vertices are adjacent to 50% of the edges! High-Degree Vertices 20 NumberofVertices AltaVista WebGraph 1.4B Vertices, 6.6B Edges Degree More than 108 vertices have one neighbor.
  • 22. Densification 22 Average distance between nodes reduces over time. 60 80 100 120 140 160 180 200 2008 2010 2012 RatioofEdgestoVertices Year Facebook US Patent Citations
  • 26. PageRank (Centrality Measures) Recursive Relationship: Where: »α is the random reset probability (typically 0.15) »L[j] is the number of links on page j 1 32 4 65 http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf
  • 28. Profile Label Propagation (Structured Prediction) Social Arithmetic: Recurrence Algorithm: » iterate until convergence Sue Ann Carlos Me 50% What I list on my profile 40% Sue Ann Likes 10% Carlos Like 40% 10% 50% 80% Cameras 20% Biking 30% Cameras 70% Biking 50% Cameras 50% Biking I Like: + 60% Cameras, 40% Biking Likes[i]= Wij ´ Likes[ j] jÎFriends[i] å http://www.cs.cmu.edu/~zhuxj/pub/CMU-CALD-02-107.pdf
  • 31. Count triangles passing through each vertex: Measure “cohesiveness” of local community 1 2 3 4 Finding Communities ClusterCoeff[i] = 2 * #Triangles[i] Deg[i] * (Deg[i] – 1)
  • 32. Count triangles passing through each vertex by counting triangles on each edge: Counting Triangles 2 1 E F D C G D C E F B D C G A D CA B
  • 33. Every vertex starts out with a unique component id (typically it’s vertex id): Connected Components 4 5 6 1 3 2 4 4 4 1 2 1 4 4 4 1 1 1
  • 34. Putting it All Together Raw Wikipedia < / >< / >< / > XML Hyperlinks PageRank Top 20 Page Title PR Text Table Title Body Topic Model (LDA) Word Topics Word Topic Editor Graph Community Detection User Community User Com. Term-Doc Graph Discussion Table User Disc. Community Topic Topic Com.
  • 35. Many Other Graph Algorithms • Collaborative Filtering – Alternating Least Squares – Stochastic Gradient Descent – Tensor Factorization • Structured Prediction – Loopy Belief Propagation – Max-Product Linear Programs – Gibbs Sampling • Semi-supervised ML – Graph SSL – CoEM • Community Detection – Triangle-Counting – K-core Decomposition – K-Truss • Graph Analytics – PageRank – Personalized PageRank – Shortest Path – Graph Coloring • Classification – Neural Networks 36
  • 36. The Graph-Parallel Pattern 37 Model / Alg. State Fundamental Pattern
  • 37. Graph-Parallel Systems 38 Expose specialized APIs to simplify graph programming.
  • 38. The Vertex Program Abstraction Vertex-Programs interact by sending messages. iPregel_PageRank(i, messages) : // Receive all the messages total = 0 foreach( msg in messages) : total = total + msg // Update the rank of this vertex R[i] = 0.15 + total // Send new messages to neighbors foreach(j in out_neighbors[i]) : Send msg(R[i]) to vertex j 39Malewicz et al. [PODC’09, SIGMOD’10]
  • 40. Graph-Parallel Systems 41 Exploit graph structure to achieve orders-of-magnitude performance gains over more general data-parallel systems.
  • 42. Machine 1 Machine 2 Split High-Degree vertices New Abstraction  Equivalence on Split 43 Program This Run on This
  • 43. Machine 2Machine 1 Machine 4Machine 3 GAS Decomposition Σ1 Σ2 Σ3 Σ4 + + + YYYY Y’ Σ Y’Y’Y’Gather Apply Scatter 44 Master Mirror Mirror Mirror
  • 44. 2D Partitioning Adj. Matrix Vertices 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Vertices 16 Machines vi vi 2 5 6 7 8 10 14 vi only has neighbors on 7 machines 45
  • 45. Counted: 34.8 Billion Triangles 50 Triangle Counting on Twitter 64 Machines 15 Seconds 1536 Machines 423 Minutes Hadoop [WWW’11] S. Suri and S. Vassilvitskii, “Counting triangles and the curse of the last reducer,” WWW’11 1000 x Faster 40M Users, 1.4 Billion Links
  • 47. Graph Analytics Pipeline Raw Wikipedia < / >< / >< / > XML Hyperlinks PageRank Top 20 Page Title PR Text Table Title Body Topic Model (LDA) Word Topics Word Topic Editor Graph Community Detection User Community User Com. Term-Doc Graph Discussion Table User Disc. Community Topic Topic Com.
  • 48. Tables Raw Wikipedia < / >< / >< / > XML Hyperlinks PageRank Top 20 Page Title PR Text Table Title Body Topic Model (LDA) Word Topics Word Topic Editor Graph Community Detection User Community User Com. Term-Doc Graph Discussion Table User Disc. Community Topic Topic Com.
  • 49. Graphs Raw Wikipedia < / >< / >< / > XML Hyperlinks PageRank Top 20 Page Title PR Text Table Title Body Topic Model (LDA) Word Topics Word Topic Editor Graph Community Detection User Community User Com. Term-Doc Graph Discussion Table User Disc. Community Topic Topic Com.
  • 52. Separate Systems Dataflow Systems Graph Systems Dependency Graph Table Resul t Row Row Row Row
  • 53. Separate systems for each view can be difficult to use and inefficient 58
  • 54. Difficult to Program and Use Users must Learn, Deploy, and Manage multiple systems Leads to brittle and often complex interfaces 59
  • 55. Inefficient 60 Extensive data movement and duplication across the network and file system < / >< / >< / > XML HDFS HDFS HDFS HDFS Limited reuse internal data-structures across stages
  • 56. The GraphX Unified Approach Enabling users to easily and efficiently express the entire graph analytics pipeline New API Blurs the distinction between Tables and Graphs New System Combines Data-Parallel Graph-Parallel Systems
  • 57. Representation Optimizations Distributed Graphs Horizontally Partitioned Tables Join Vertex Programs Dataflow Operators Advances in Graph Processing Systems Distributed Join Optimization Materialized View Maintenance
  • 58. View a Graph as a Table Id Rxin Jegonzal Franklin Istoica SrcId DstId rxin jegonzal franklin rxin istoica franklin franklin jegonzal Property (E) Friend Advisor Coworker PI Property (V) (Stu., Berk.) (PstDoc, Berk.) (Prof., Berk) (Prof., Berk) R J F I Property Graph Vertex Property Table Edge Property Table
  • 59. Spark Table Operators Table (RDD) operators are inherited from Spark: 64 map filter groupBy sort union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey groupByKey cogroup cross zip sample take first partitionBy mapWith pipe save ...
  • 60. class Graph [ V, E ] { def Graph(vertices: Table[ (Id, V) ], edges: Table[ (Id, Id, E) ]) // Table Views ----------------- def vertices: Table[ (Id, V) ] def edges: Table[ (Id, Id, E) ] def triplets: Table [ ((Id, V), (Id, V), E) ] // Transformations ------------------------------ def reverse: Graph[V, E] def subgraph(pV: (Id, V) => Boolean, pE: Edge[V,E] => Boolean): Graph[V,E] def mapV(m: (Id, V) => T ): Graph[T,E] def mapE(m: Edge[V,E] => T ): Graph[V,T] // Joins ---------------------------------------- def joinV(tbl: Table [(Id, T)]): Graph[(V, T), E ] def joinE(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)] // Computation ---------------------------------- def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)], reduceF: (T, T) => T): Graph[T, E] } Graph Operators (Scala) 65
  • 61. class Graph [ V, E ] { def Graph(vertices: Table[ (Id, V) ], edges: Table[ (Id, Id, E) ]) // Table Views ----------------- def vertices: Table[ (Id, V) ] def edges: Table[ (Id, Id, E) ] def triplets: Table [ ((Id, V), (Id, V), E) ] // Transformations ------------------------------ def reverse: Graph[V, E] def subgraph(pV: (Id, V) => Boolean, pE: Edge[V,E] => Boolean): Graph[V,E] def mapV(m: (Id, V) => T ): Graph[T,E] def mapE(m: Edge[V,E] => T ): Graph[V,T] // Joins ---------------------------------------- def joinV(tbl: Table [(Id, T)]): Graph[(V, T), E ] def joinE(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)] // Computation ---------------------------------- def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)], reduceF: (T, T) => T): Graph[T, E] } Graph Operators (Scala) 66 def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)], reduceF: (T, T) => T): Graph[T, E] capture the Gather-Scatter pattern from specialized graph-processing systems
  • 62. Triplets Join Vertices and Edges The triplets operator joins vertices and edges: TripletsVertices B A C D Edges A B A C B C C D A BA B A C B C C D
  • 63. Map-Reduce Triplets Map-Reduce triplets collects information about the neighborhood of each vertex: C D A C B C A B Src. or Dst. MapFunction( )  (B, ) MapFunction( )  (C, ) MapFunction( )  (C, ) MapFunction( )  (D, ) Reduce (B, ) (C, + ) (D, ) Message Combiners
  • 64. Using these basic GraphX operators we implemented Pregel and GraphLab in under 50 lines of code! 69
  • 65. The GraphX Stack (Lines of Code) GraphX (2,500) Spark (30,000) Pregel API (34) PageRank (20) Connected Comp. (20) K-core (60) Triangl e Count (50) LDA (220) SVD++ (110) Some algorithms are more naturally expressed using the GraphX primitive operators
  • 66. We express enhanced Pregel and GraphLab abstractions using the GraphX operators in less than 50 lines of code! 71
  • 67. Enhanced Pregel in GraphX 72Malewicz et al. [PODC’09, SIGMOD’10] pregelPR(i, messageList ): // Receive all the messages total = 0 foreach( msg in messageList) : total = total + msg // Update the rank of this vertex R[i] = 0.15 + total // Send new messages to neighbors foreach(j in out_neighbors[i]) : Send msg(R[i]/E[i,j]) to vertex Require Message CombinersmessageSum messageSum Remove Message Computation from the Vertex Program sendMsg(ij, R[i], R[j], E[i,j]): // Compute single message return msg(R[i]/E[i,j]) combineMsg(a, b): // Compute sum of two messages return a + b
  • 69. Part. 2 Part. 1 Vertex Table (RDD) B C A D F E A D Distributed Graphs as Tables (RDDs) D Property Graph B C D E AA F Edge Table (RDD)A B A C C D B C A E A F E F E D B C D E A F Routing Table (RDD) B C D E A F 1 2 1 2 1 2 1 2 2D Vertex Cut Heuristic
  • 70. Vertex Table (RDD) Caching for Iterative mrTriplets Edge Table (RDD) A B A C C D B C A E A F E F E D Mirror Cache B C D A Mirror Cache D E F A B C D E A F B C D E A F A D
  • 71. Vertex Table (RDD) Edge Table (RDD) A B A C C D B C A E A F E F E D Mirror Cache B C D A Mirror Cache D E F A Incremental Updates for Iterative mrTriplets B C D E A F Change AA Change E Scan
  • 72. Vertex Table (RDD) Edge Table (RDD) A B A C C D B C A E A F E F E D Mirror Cache B C D A Mirror Cache D E F A Aggregation for Iterative mrTriplets B C D E A F Change Change Scan Change Change Change Change Local Aggregate Local Aggregate B C D F
  • 73. Performance Comparisons 22 68 207 354 1340 0 200 400 600 800 1000 1200 1400 1600 GraphLab GraphX Giraph Naïve Spark Mahout/Hadoop Runtime (in seconds, PageRank for 10 iterations) GraphX is roughly 3x slower than GraphLab Live-Journal: 69 Million Edges
  • 74. GraphX scales to larger graphs 203 451 749 0 200 400 600 800 GraphLab GraphX Giraph Runtime (in seconds, PageRank for 10 iterations) GraphX is roughly 2x slower than GraphLab »Scala + Java overhead: Lambdas, GC time, … »No shared memory parallelism: 2x increase in comm. Twitter Graph: 1.5 Billion Edges
  • 75. PageRank is just one stage…. What about a pipeline?
  • 76. HDFSHDFS ComputeSpark Preprocess Spark Post. A Small Pipeline in GraphX Timed end-to-end GraphX is faster than Raw Wikipedia < / >< / >< / > XML Hyperlinks PageRank Top 20 Pages 342 1492 0 200 400 600 800 1000 1200 1400 1600 GraphLab + Spark GraphX Giraph + Spark Spark Total Runtime (in Seconds) 605 375
  • 77. Open Source Project Alpha release since Spark 0.9 Contributors? Python Bindings?
  • 78. Graph Processing Systems • Apache Giraph: java Pregel implementation • GraphLab.org: C++ GraphLab implementation • NetworkX: python API for small gaphs • GraphLab Create: commercial GraphLab python framework for large graphs and ML • Gephi: graph visualization framework
  • 79. Graph Database Technologies Property graph data-model for storing and retrieving graph structured data. • Neo4j: popular commercial graph database • Titan: open-source distributed graph database
  • 81. About Scala High-level language for the Java VM »Object-oriented + functional programming Statically typed »Comparable in speed to Java »But often no need to write types due to type inference Interoperates with Java »Can use any Java class, inherit from it, etc; can also call Scala code from Java
  • 82. Quick Tour Declaring variables: var x: Int = 7 var x = 7 // type inferred val y = “hi” // read-only Java equivalent: int x = 7; final String y = “hi”; Functions: def square(x: Int): Int = x*x def min(a:Int, b:Int): Int = { if (a < b) a else b } def announce(text: String) { println(text) } Java equivalent: int square(int x) { return x*x; } void announce(String text) { System.out.println(text); }
  • 83. Quick Tour Generic types: var arr = new Array[Int](8) var lst = List(1, 2, 3) // type of lst is List[Int] Java equivalent: int[] arr = new int[8]; List<Integer> lst = new ArrayList<Integer>(); lst.add(...) Indexing: arr(5) = 7 println(lst(5)) Java equivalent: arr[5] = 7; System.out.println(lst.get(5));
  • 84. Processing collections with functional programming: val list = List(1, 2, 3) list.foreach(x => println(x)) // prints 1, 2, 3 list.foreach(println) // same list.map(x => x + 2) // => List(3, 4, 5) list.map(_ + 2) // same, with placeholder notation list.filter(x => x % 2 == 1) // => List(1, 3) list.filter(_ % 2 == 1) // => List(1, 3) list.reduce((x, y) => x + y) // => 6 list.reduce(_ + _) // => 6 QuickTour Function expression (closure) All of these leave the list unchanged (List is immutable)
  • 85. Other Collection Methods Scala collections provide many other functional methods; for example, Google for “Scala Seq”Method on Seq[T] Explanation map(f: T => U): Seq[U] Pass each element through f flatMap(f: T => Seq[U]): Seq[U] One-to-many map filter(f: T => Boolean): Seq[T] Keep elements passing f exists(f: T => Boolean): Boolean True if one element passes forall(f: T => Boolean): Boolean True if all elements pass reduce(f: (T, T) => T): T Merge elements using f groupBy(f: T => K): Map[K,List[T]] Group elements by f(element) sortBy(f: T => K): Seq[T] Sort elements by f(element) . . .