Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Web-Scale Graph Analytics
with Apache Spark
Tim Hunter, Databricks
#EUds6
2
About Me
• Tim Hunter
• Software engineer @ Databricks
• Ph.D. from UC Berkeley in Machine Learning
• Very early Spark u...
3
Outline
• Why GraphFrames
• Writing scalable graph algorithms with Spark
– Where is my vertex? Indexing data
– Connected...
4
Graphs are everywhere
4#EUds6
JFK
IAD
LAX
SFO
SEA
DFW
Example: airports & flights between them
src dst delay tripID
“JFK...
5
Apache Spark’s GraphX library
• General-purpose graph
processing library
• Built into Spark
• Optimized for fast distrib...
6
The GraphFrames Spark Package
Brings DataFrames API for Spark
• Simplifies interactive queries
• Benefits from DataFrame...
7
Dataframe-based representation
7#EUds6
JFK
IAD
LAX
SFO
SEA
DFW
id City State
“JFK” “New York” NY
“SEA
”
“Seattle” WA
src...
8
Supported graph algorithms
• Find Vertices:
– PageRank
– Motif finding
• Communities:
– Connected Components
– Strongly ...
1
Assigning integral vertex IDs
… lessons learned
10#EUds6
1
Pros of integer vertex IDs
GraphFrames take arbitrary vertex IDs.
 convenient for users
Algorithms prefer integer verte...
1
The hashing trick?
• Possible solution: hash vertex ID to long integer
• What is the chance of collision?
–1 - (N-1)/N *...
1
Generating unique IDs
Spark has built-in methods to generate unique IDs.
• RDD: zipWithUniqueId(), zipWithIndex()
• Data...
1
How it works
Partition 1
Vertex ID
Sue Ann 0
Joseph 1
Partition 2
Vertex ID
Xiangrui 100 + 0
Felix 100 + 1
Partition 3
V...
1
… but not always
• DataFrames/RDDs are immutable and reproducible
by design.
• However, records do not always have stabl...
1
Our implementation
We implemented (v0.5.0) an expensive but correct
version:
1. (hash) re-partition + distinct vertex ID...
1
Connected Components
17#EUds6
1
Connected Components
Assign each vertex a component ID such that vertices
receive the same component ID iff they are con...
1
Naive implementation (GraphX)
1.Assign each vertex a unique component ID.
2.Iterate until convergence:
–For each vertex ...
2
Small-/large-star algorithm
Kiveris et al. "Connected Components in MapReduce and Beyond."
1. Assign each vertex a uniqu...
2
Small-star operation
Kiveris et al., Connected Components in MapReduce and Beyond.
#EUds6 21
2
Big-star operation
Kiveris et al., Connected Components in MapReduce and Beyond.
#EUds6 22
2
Another interpretation
1 5 7 8 9
1 x
5 x
7 x
8 x
9
adjacency matrix
#EUds6 23
2
Small-star operation
1 5 7 8 9
1 x x x
5
7
8 x
9
1 5 7 8 9
1 x
5 x
7 x
8 x
9
rotate & lift
#EUds6 24
2
Big-star operation
lift
1 5 7 8 9
1 x x
5 x
7 x
8
9
1 5 7 8 9
1 x
5 x
7 x
8 x
9
#EUds6 25
2
Convergence
1 5 7 8 9
1 x x x x x
5
7
8
9
#EUds6 26
2
Properties of the algorithm
• Small-/big-star operations do not change graph
connectivity.
• Extra edges are pruned duri...
2
Implementation
Iterate:
• filter
• self-join
Challenge: handle these operations at scale.
#EUds6 28
2
Skewed joins
Real-world graphs contain big components.
The ”Justin Bieber problem” at Twitter
 data skew during connect...
3
Skewed joins
3
0
src dst
0 1
0 2
0 3
0 4
… …
0 2,000,000
hash join
1 3
2 5
broadcast join
(#nbrs > 1,000,000)
union
src ...
3
Checkpointing
We checkpoint every 2 iterations to avoid:
• query plan explosion (exponential growth)
• optimizer slowdow...
3
Experiments
twitter-2010 from WebGraph datasets (small
diameter)
–42 million vertices, 1.5 billion edges
16 r3.4xlarge w...
3
Experiments
uk-2007-05 from WebGraph datasets
–105 million vertices, 3.7 billion edges
16 r3.4xlarge workers on Databric...
3
Experiments
regular grid 32,000 x 32,000 (large diameter)
–1 billion nodes, 4 billion edges
32 r3.8xlarge workers on Dat...
3
Future improvements
GraphFrames
• better graph partitioning
• letting Spark SQL handle skewed joins and iterations
• gra...
Thank you!
• http://graphframes.github.io
• https://docs.databricks.com
36#EUds6
37#EUds6
3
2 types of graph representations
Algorithm-based Query-based
Standard & custom algorithms
Optimized for batch processing...
3
Graph analysis with GraphFrames
Simple queries
Motif finding
Graph algorithms
39
#EUds6
4
Simple queries
SQL queries on vertices & edges
40
Simple graph queries (e.g., vertex degrees)
#EUds6
4
Motif finding
41
JFK
IAD
LAX
SFO
SEA
DFW
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(...
4
Motif finding
42
JFK
IAD
LAX
SFO
SEA
DFW
(b)
(a)Search for structural
patterns within a graph.
val paths: DataFrame =
g....
4
Motif finding
43
JFK
IAD
LAX
SFO
SEA
DFW
(b)
(a)
(c)
Search for structural
patterns within a graph.
val paths: DataFrame...
4
Motif finding
44
JFK
IAD
LAX
SFO
SEA
DFW
(b)
(a)
(c)
Search for structural
patterns within a graph.
val paths: DataFrame...
4
Motif finding
45
JFK
IAD
LAX
SFO
SEA
DFW
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(...
4
Graph algorithms
Find important vertices
• PageRank
46
Find paths between sets of
vertices
• Breadth-first search (BFS)
...
4
Saving & loading graphs
Save & load the DataFrames.
vertices = sqlContext.read.parquet(...)
edges = sqlContext.read.parq...
Upcoming SlideShare
Loading in …5
×

Web-Scale Graph Analytics with Apache Spark with Tim Hunter

896 views

Published on

Graph analytics has a wide range of applications, from information propagation and network flow optimization to fraud and anomaly detection. The rise of social networks and the Internet of Things has given us complex web-scale graphs with billions of vertices and edges. However, in order to extract the hidden gems of understanding and information within those graphs, you need tools to analyze the graphs easily and efficiently.

At Spark Summit 2016, Databricks introduced GraphFrames, which implements graph queries and pattern matching on top of Spark SQL to simplify graph analytics. In this talk, we’ll discuss the work that has made graph algorithms in GraphFrames faster and more scalable. For example, new implementations of connected components have received algorithm improvements based on recent research, as well as performance improvements from Spark DataFrames. Discover lessons learned from scaling the implementation from millions to billions of nodes; see its performance in the context of other popular graph libraries; and hear about real-world applications.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Web-Scale Graph Analytics with Apache Spark with Tim Hunter

  1. 1. Web-Scale Graph Analytics with Apache Spark Tim Hunter, Databricks #EUds6
  2. 2. 2 About Me • Tim Hunter • Software engineer @ Databricks • Ph.D. from UC Berkeley in Machine Learning • Very early Spark user • Contributor to MLlib • Co-author of TensorFrames, GraphFrames, Deep Learning Pipelines #EUds6 2
  3. 3. 3 Outline • Why GraphFrames • Writing scalable graph algorithms with Spark – Where is my vertex? Indexing data – Connected Components: implementing complex algorithms with Spark and GraphFrames – The Social Network: real-world issues • Future of GraphFrames 3#EUds6
  4. 4. 4 Graphs are everywhere 4#EUds6 JFK IAD LAX SFO SEA DFW Example: airports & flights between them src dst delay tripID “JFK” “SEA” 45 1058923 id City State “JFK” “New York” NY Vertices: Edges:
  5. 5. 5 Apache Spark’s GraphX library • General-purpose graph processing library • Built into Spark • Optimized for fast distributed computing • Library of algorithms: PageRank, Connected Components, etc. 5 Issues: • No Java, Python APIs • Lower-level RDD-based API (vs. DataFrames) • Cannot use recent Spark optimizations: Catalyst query optimizer, Tungsten memory management #EUds6
  6. 6. 6 The GraphFrames Spark Package Brings DataFrames API for Spark • Simplifies interactive queries • Benefits from DataFrames optimizations • Integrates with the rest of Spark ecosystem Collaboration between Databricks, UC Berkeley & MIT 6 #EUds6
  7. 7. 7 Dataframe-based representation 7#EUds6 JFK IAD LAX SFO SEA DFW id City State “JFK” “New York” NY “SEA ” “Seattle” WA src dst delay tripID “JFK” “SEA” 45 1058923 “DFW ” “SFO” -7 4100224 vertices DataFrame edges DataFrame
  8. 8. 8 Supported graph algorithms • Find Vertices: – PageRank – Motif finding • Communities: – Connected Components – Strongly Connected Components – Label propagation (LPA) • Paths: – Breadth-first search – Shortest paths • Other: – Triangle count – SVD++ 8#EUds6 (Bold: native DataFrame implementation)
  9. 9. 1 Assigning integral vertex IDs … lessons learned 10#EUds6
  10. 10. 1 Pros of integer vertex IDs GraphFrames take arbitrary vertex IDs.  convenient for users Algorithms prefer integer vertex IDs.  optimize in-memory storage  reduce communication Our task: Map unique vertex IDs to unique (long) integers. #EUds6 11
  11. 11. 1 The hashing trick? • Possible solution: hash vertex ID to long integer • What is the chance of collision? –1 - (N-1)/N * (N-2)/N * … –seems unlikely with long range N=264 –with 1 billion nodes, the chance is ~5.4% • Problem: collisions change graph topology. Name Hash Sue Ann 84088 Joseph -2070372689 Xiangrui 264245405 Felix 67762524 #EUds6 12
  12. 12. 1 Generating unique IDs Spark has built-in methods to generate unique IDs. • RDD: zipWithUniqueId(), zipWithIndex() • DataFrame: monotonically_increasing_id() Possible solution: just use these methods #EUds6 13
  13. 13. 1 How it works Partition 1 Vertex ID Sue Ann 0 Joseph 1 Partition 2 Vertex ID Xiangrui 100 + 0 Felix 100 + 1 Partition 3 Vertex ID Veronica 200 + 0 … 200 + 1 #EUds6 14
  14. 14. 1 … but not always • DataFrames/RDDs are immutable and reproducible by design. • However, records do not always have stable orderings. –distinct –repartition • cache() does not help. Partition 1 Vertex ID Xiangrui 0 Joseph 1 Partition 1 Vertex ID Joseph 0 Xiangrui 1 repartition distinct shuffle #EUds6 15
  15. 15. 1 Our implementation We implemented (v0.5.0) an expensive but correct version: 1. (hash) re-partition + distinct vertex IDs 2. sort vertex IDs within each partition 3. generate unique integer IDs #EUds6 16
  16. 16. 1 Connected Components 17#EUds6
  17. 17. 1 Connected Components Assign each vertex a component ID such that vertices receive the same component ID iff they are connected. Applications: –fraud detection • Spark Summit 2016 keynote from Capital One –clustering –entity resolution 1 3 2 #EUds6 18
  18. 18. 1 Naive implementation (GraphX) 1.Assign each vertex a unique component ID. 2.Iterate until convergence: –For each vertex v, update: component ID of v  Smallest component ID in neighborhood of v Pro: easy to implement Con: slow convergence on large-diameter graphs #EUds6 19
  19. 19. 2 Small-/large-star algorithm Kiveris et al. "Connected Components in MapReduce and Beyond." 1. Assign each vertex a unique ID. 2. Iterate until convergence: –(small-star) for each vertex, connect smaller neighbors to smallest neighbor –(big-star) for each vertex, connect bigger neighbors to smallest neighbor (or itself) #EUds6 20
  20. 20. 2 Small-star operation Kiveris et al., Connected Components in MapReduce and Beyond. #EUds6 21
  21. 21. 2 Big-star operation Kiveris et al., Connected Components in MapReduce and Beyond. #EUds6 22
  22. 22. 2 Another interpretation 1 5 7 8 9 1 x 5 x 7 x 8 x 9 adjacency matrix #EUds6 23
  23. 23. 2 Small-star operation 1 5 7 8 9 1 x x x 5 7 8 x 9 1 5 7 8 9 1 x 5 x 7 x 8 x 9 rotate & lift #EUds6 24
  24. 24. 2 Big-star operation lift 1 5 7 8 9 1 x x 5 x 7 x 8 9 1 5 7 8 9 1 x 5 x 7 x 8 x 9 #EUds6 25
  25. 25. 2 Convergence 1 5 7 8 9 1 x x x x x 5 7 8 9 #EUds6 26
  26. 26. 2 Properties of the algorithm • Small-/big-star operations do not change graph connectivity. • Extra edges are pruned during iterations. • Each connected component converges to a star graph. • Converges in log2(#nodes) iterations #EUds6 27
  27. 27. 2 Implementation Iterate: • filter • self-join Challenge: handle these operations at scale. #EUds6 28
  28. 28. 2 Skewed joins Real-world graphs contain big components. The ”Justin Bieber problem” at Twitter  data skew during connected components iterations src dst 0 1 0 2 0 3 0 4 … … 0 2,000,000 1 3 2 5 src Component id neighbors 0 0 2,000,000 1 0 10 2 3 5 join #EUds6 29
  29. 29. 3 Skewed joins 3 0 src dst 0 1 0 2 0 3 0 4 … … 0 2,000,000 hash join 1 3 2 5 broadcast join (#nbrs > 1,000,000) union src Component id neighbors 0 0 2,000,000 1 0 10 2 3 5 #EUds6
  30. 30. 3 Checkpointing We checkpoint every 2 iterations to avoid: • query plan explosion (exponential growth) • optimizer slowdown • disk out of shuffle space • unexpected node failures 3 1 #EUds6
  31. 31. 3 Experiments twitter-2010 from WebGraph datasets (small diameter) –42 million vertices, 1.5 billion edges 16 r3.4xlarge workers on Databricks –GraphX: 4 minutes –GraphFrames: 6 minutes • algorithm difference, checkpointing, checking skewness 3 2 #EUds6
  32. 32. 3 Experiments uk-2007-05 from WebGraph datasets –105 million vertices, 3.7 billion edges 16 r3.4xlarge workers on Databricks –GraphX: 25 minutes • slow convergence –GraphFrames: 4.5 minutes 3 3 #EUds6
  33. 33. 3 Experiments regular grid 32,000 x 32,000 (large diameter) –1 billion nodes, 4 billion edges 32 r3.8xlarge workers on Databricks –GraphX: failed –GraphFrames: 1 hour 3 4 #EUds6
  34. 34. 3 Future improvements GraphFrames • better graph partitioning • letting Spark SQL handle skewed joins and iterations • graph compression Connected Components • local iterations • node pruning and better stopping criteria #EUds6 35
  35. 35. Thank you! • http://graphframes.github.io • https://docs.databricks.com 36#EUds6
  36. 36. 37#EUds6
  37. 37. 3 2 types of graph representations Algorithm-based Query-based Standard & custom algorithms Optimized for batch processing Motif finding Point queries & updates GraphFrames: Both algorithms & queries (but not point updates)#EUds6 38
  38. 38. 3 Graph analysis with GraphFrames Simple queries Motif finding Graph algorithms 39 #EUds6
  39. 39. 4 Simple queries SQL queries on vertices & edges 40 Simple graph queries (e.g., vertex degrees) #EUds6
  40. 40. 4 Motif finding 41 JFK IAD LAX SFO SEA DFW Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”) #EUds6
  41. 41. 4 Motif finding 42 JFK IAD LAX SFO SEA DFW (b) (a)Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”) #EUds6
  42. 42. 4 Motif finding 43 JFK IAD LAX SFO SEA DFW (b) (a) (c) Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”) #EUds6
  43. 43. 4 Motif finding 44 JFK IAD LAX SFO SEA DFW (b) (a) (c) Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”) #EUds6
  44. 44. 4 Motif finding 45 JFK IAD LAX SFO SEA DFW Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”) (b) (a) (c) Then filter using vertex & edge data. paths.filter(“e1.delay > 20”) #EUds6
  45. 45. 4 Graph algorithms Find important vertices • PageRank 46 Find paths between sets of vertices • Breadth-first search (BFS) • Shortest paths Find groups of vertices (components, communities) • Connected components • Strongly connected components • Label Propagation Algorithm (LPA) Other • Triangle counting • SVDPlusPlus #EUds6
  46. 46. 4 Saving & loading graphs Save & load the DataFrames. vertices = sqlContext.read.parquet(...) edges = sqlContext.read.parquet(...) g = GraphFrame(vertices, edges) g.vertices.write.parquet(...) g.edges.write.parquet(...) 47 #EUds6

×