Challenging Web-Scale Graph
Analytics with Apache Spark
Xiangrui Meng
Spark Summit 2017
About me
• Software Engineer at Databricks
• machine learning and data science/engineering
• Committer and PMC member of Apache Spark
• MLlib, SparkR, PySpark, Spark Packages, etc
2
GraphFrames
3
GraphFrames
• A Spark package introduced in 2016 (graphframes.github.io)
• collaboration between Databricks, UC Berkeley, and MIT
• GraphX to RDDs as GraphFrames are to DataFrames
• Python, Java, and Scala APIs,
• expressive graph queries,
• query plan optimizers from Spark SQL,
• graph algorithms.
4
Quick examples
Launch a Spark shell with GraphFrames:
spark-shell —-packages graphframes:graphframes:0.5.0-spark2.1-s_2.11
Or try it on Databricks Community Edition (databricks.com/try).
5
Quick examples
Find 2nd-degree followers:
g.find(“(A)-[]->(B); (B)-[]->(C); !(A)-[]->(C)”)
.filter(“A.id != C.id”)
.select(“A”, “C”)
6
Quick examples
Compute PageRank:
g.pageRank(resetProbability=0.15, maxIter=20)
7
Supported graph algorithms
• breath-first search (BFS)
• connected components
• strongly connected components
• label propagation algorithm (LPA)
• PageRank and personalized PageRank
• shortest paths
• triangle count
8
Moving implementations to DataFrames
• Several algorithms in GraphFrames are simple wrappers over
GraphX RDD implementations, which do not scale very well.
• DataFrames are optimized for a huge number of small records.
• columnar storage
• code generation
• query optimization
9
Assigning integral vertex IDs
… lessons learned
10
Pros of having integral vertex IDs
GraphFrames take string vertex identifiers, whose values are not
used in graph algorithms. Having integral vertex IDs can help
• optimize in-memory storage,
• save communication.
So the task is to map unique vertex identifiers to unique (long) integers.
11
The hashing trick?
• It is easy to hash the vertex identifier to a long integer.
• What is the chance of collision?
• 1 - (k-1)/N * (k-2)/N * …
• seems unlikely with long range N=264
• with 1 billion nodes, the chance is ~5.4%
• And collisions change graph topology.
12
Name Hash
Tim 84088
Joseph -2070372689
Xiangrui 264245405
Felix 67762524
Generating unique IDs
Spark has builtin methods to generate unique IDs.
• RDD: zipWithUniqueId()/zipWithIndex()
• DataFrame: monotonically_increasing_id()
So given a DataFrame of distinct vertex identifiers, we can add a new column
with generated unique long IDs. Simple?
13
How it works?
14
Partition 1
Vertex ID
Tim 0
Joseph 1
Partition 2
Vertex ID
Xiangrui 100 + 0
Felix 100 + 1
Partition 3
Vertex ID
… 200 + 0
… 200 + 1
… but not always work
• DataFrames/RDDs are immutable and reproducible by design.
• However, records do not always have stable order.
• distinct
• repartition
• And cache doesn’t help.
15
Partition 1
Vertex ID
Tim 0
Joseph 1
Partition 1
Vertex ID
Joseph 0
Tim 1
re-compute
Our implementation
We implemented (v0.5.0) an expensive but correct version:
1. (hash) re-partition + distinct vertex identifiers,
2. sort vertex identifiers within each partition,
3. generate unique integral IDs
16
Connected Components
17
Connected Components
• Assign each vertex a component ID such that vertices receive the
same component ID iff they are connected.
• Applications:
• fraud detection
• Spark Summit 2016 keynote from Capital One
• clustering
18
1 3
2
A naive implementation
1. Assign each vertex a unique component ID.
2. Run in batches until convergence:
• For each vertex v, update its component ID to the smallest
component ID among its neighbors’ and its own.
• easy to implement
• slow convergence on large-diameter graphs
19
Small-/large-star algorithm [Kiveris14]
Kiveris et al., Connected Components in MapReduce and Beyond.
1. Assign each vertex a unique ID.
2. Alternatively update edges in batches until convergence:
• (small-star) for each vertex, connect its smaller neighbors to the
smallest neighbor vertex
• (big-star) for each vertex, connect its bigger neighbors to the
smallest neighbor vertex (or itself)
20
Small-star operation
21
Kiveris et al., Connected Components in MapReduce and Beyond.
Big-star operation
22
Kiveris et al., Connected Components in MapReduce and Beyond.
Another interpretation
23
1 5 7 8 9
1 x
5 x
7 x
8 x
9
adjacency matrix
Small-star operation
24
1 5 7 8 9
1 x x x
5
7
8 x
9
1 5 7 8 9
1 x
5 x
7 x
8 x
9
rotate & lift
Big-star operation
25
1 5 7 8 9
1 x
5 x
7 x
8 x
9
1 5 7 8 9
1 x x
5 x
7 x
8
9
lift
Convergence
26
1 5 7 8 9
1 x x x x x
5
7
8
9
Small-/big-star algorithm
• Small-/big-star operations do not change graph connectivity.
• Extra edges are pruned during iterations.
• Each connected component converges to a star graph.
Kiveris et al. proved one variation of the algorithm converges in
log2(#nodes) iterations. We chose a variation that alternates
small-/big-star operations in GraphFrames.
27
Implementation
Essentially the small-/big-star operations map to a sequence of
filters and self joins with DataFrames. So we need to handle the
following operations at scale:
• joins
• iterations
28
Skewed joins
A real-world graph usually contains big component, which leads to
data skewness at connected components iterations.
29
src dst
0 1
0 2
0 3
0 4
… …
0 2,000,000
1 3
2 5
src id nbrs
0 0 2,000,000
1 0 10
2 3 5
join
Skewed joins
30
src dst
0 1
0 2
0 3
0 4
… …
0 2,000,000
src id nbrs
0 0 2,000,000
hash join
1 0 10
2 3 5
1 3
2 5
broadcast join
(#nbrs > 1,000,000)
union
Checkpointing
We do checkpoint at every 2 iterations to avoid:
• query plan getting too big (exponential growth)
• optimizer taking too long
• disk out of shuffle space
• unexpected node failures
31
Experiments
• twitter-2010 from WebGraph datasets (small diameter)
• 42 million vertices, 1.5 billion edges
• 16 r3.4xlarge workers on Databricks
• GraphX: 4 minutes
• GraphFrames: 6 minutes
• algorithm difference, checkpointing, checking skewness
32
Experiments
• uk-2007-05 from WebGraph datasets
• 105 million vertices, 3.7 billion edges
• 16 r3.4xlarge workers on Databricks
• GraphX: 25 minutes
• slow convergence
• GraphFrames: 4.5 minutes
33
Experiments
• regular grid 32,000 x 32,000 (large diameter)
• 1 billion nodes, 4 billion edges
• 32 r3.8xlarge workers on Databricks
• GraphX: failed
• GraphFrames: 1 hour
34
Experiments
• regular grid 50,000 x 50,000 (large diameter)
• 2.5 billion nodes, 10 billion edges
• 32 r3.8xlarge workers on Databricks
• GraphX: failed
• GraphFrames: 1.6 hours
35
Future improvements
• update inefficient code (due to Spark 1.6 compatibility)
• better graph partitioning
• local iterations
• node pruning and better stop criterion
• letting Spark SQL handle skewed joins and iterations
• graph compression
• prove log(N) iterations or maybe a better algorithm?
36
Thank You
• graphframes.github.io
• docs.databricks.com

Challenging Web-Scale Graph Analytics with Apache Spark

  • 1.
    Challenging Web-Scale Graph Analyticswith Apache Spark Xiangrui Meng Spark Summit 2017
  • 2.
    About me • SoftwareEngineer at Databricks • machine learning and data science/engineering • Committer and PMC member of Apache Spark • MLlib, SparkR, PySpark, Spark Packages, etc 2
  • 3.
  • 4.
    GraphFrames • A Sparkpackage introduced in 2016 (graphframes.github.io) • collaboration between Databricks, UC Berkeley, and MIT • GraphX to RDDs as GraphFrames are to DataFrames • Python, Java, and Scala APIs, • expressive graph queries, • query plan optimizers from Spark SQL, • graph algorithms. 4
  • 5.
    Quick examples Launch aSpark shell with GraphFrames: spark-shell —-packages graphframes:graphframes:0.5.0-spark2.1-s_2.11 Or try it on Databricks Community Edition (databricks.com/try). 5
  • 6.
    Quick examples Find 2nd-degreefollowers: g.find(“(A)-[]->(B); (B)-[]->(C); !(A)-[]->(C)”) .filter(“A.id != C.id”) .select(“A”, “C”) 6
  • 7.
  • 8.
    Supported graph algorithms •breath-first search (BFS) • connected components • strongly connected components • label propagation algorithm (LPA) • PageRank and personalized PageRank • shortest paths • triangle count 8
  • 9.
    Moving implementations toDataFrames • Several algorithms in GraphFrames are simple wrappers over GraphX RDD implementations, which do not scale very well. • DataFrames are optimized for a huge number of small records. • columnar storage • code generation • query optimization 9
  • 10.
    Assigning integral vertexIDs … lessons learned 10
  • 11.
    Pros of havingintegral vertex IDs GraphFrames take string vertex identifiers, whose values are not used in graph algorithms. Having integral vertex IDs can help • optimize in-memory storage, • save communication. So the task is to map unique vertex identifiers to unique (long) integers. 11
  • 12.
    The hashing trick? •It is easy to hash the vertex identifier to a long integer. • What is the chance of collision? • 1 - (k-1)/N * (k-2)/N * … • seems unlikely with long range N=264 • with 1 billion nodes, the chance is ~5.4% • And collisions change graph topology. 12 Name Hash Tim 84088 Joseph -2070372689 Xiangrui 264245405 Felix 67762524
  • 13.
    Generating unique IDs Sparkhas builtin methods to generate unique IDs. • RDD: zipWithUniqueId()/zipWithIndex() • DataFrame: monotonically_increasing_id() So given a DataFrame of distinct vertex identifiers, we can add a new column with generated unique long IDs. Simple? 13
  • 14.
    How it works? 14 Partition1 Vertex ID Tim 0 Joseph 1 Partition 2 Vertex ID Xiangrui 100 + 0 Felix 100 + 1 Partition 3 Vertex ID … 200 + 0 … 200 + 1
  • 15.
    … but notalways work • DataFrames/RDDs are immutable and reproducible by design. • However, records do not always have stable order. • distinct • repartition • And cache doesn’t help. 15 Partition 1 Vertex ID Tim 0 Joseph 1 Partition 1 Vertex ID Joseph 0 Tim 1 re-compute
  • 16.
    Our implementation We implemented(v0.5.0) an expensive but correct version: 1. (hash) re-partition + distinct vertex identifiers, 2. sort vertex identifiers within each partition, 3. generate unique integral IDs 16
  • 17.
  • 18.
    Connected Components • Assigneach vertex a component ID such that vertices receive the same component ID iff they are connected. • Applications: • fraud detection • Spark Summit 2016 keynote from Capital One • clustering 18 1 3 2
  • 19.
    A naive implementation 1.Assign each vertex a unique component ID. 2. Run in batches until convergence: • For each vertex v, update its component ID to the smallest component ID among its neighbors’ and its own. • easy to implement • slow convergence on large-diameter graphs 19
  • 20.
    Small-/large-star algorithm [Kiveris14] Kiveriset al., Connected Components in MapReduce and Beyond. 1. Assign each vertex a unique ID. 2. Alternatively update edges in batches until convergence: • (small-star) for each vertex, connect its smaller neighbors to the smallest neighbor vertex • (big-star) for each vertex, connect its bigger neighbors to the smallest neighbor vertex (or itself) 20
  • 21.
    Small-star operation 21 Kiveris etal., Connected Components in MapReduce and Beyond.
  • 22.
    Big-star operation 22 Kiveris etal., Connected Components in MapReduce and Beyond.
  • 23.
    Another interpretation 23 1 57 8 9 1 x 5 x 7 x 8 x 9 adjacency matrix
  • 24.
    Small-star operation 24 1 57 8 9 1 x x x 5 7 8 x 9 1 5 7 8 9 1 x 5 x 7 x 8 x 9 rotate & lift
  • 25.
    Big-star operation 25 1 57 8 9 1 x 5 x 7 x 8 x 9 1 5 7 8 9 1 x x 5 x 7 x 8 9 lift
  • 26.
    Convergence 26 1 5 78 9 1 x x x x x 5 7 8 9
  • 27.
    Small-/big-star algorithm • Small-/big-staroperations do not change graph connectivity. • Extra edges are pruned during iterations. • Each connected component converges to a star graph. Kiveris et al. proved one variation of the algorithm converges in log2(#nodes) iterations. We chose a variation that alternates small-/big-star operations in GraphFrames. 27
  • 28.
    Implementation Essentially the small-/big-staroperations map to a sequence of filters and self joins with DataFrames. So we need to handle the following operations at scale: • joins • iterations 28
  • 29.
    Skewed joins A real-worldgraph usually contains big component, which leads to data skewness at connected components iterations. 29 src dst 0 1 0 2 0 3 0 4 … … 0 2,000,000 1 3 2 5 src id nbrs 0 0 2,000,000 1 0 10 2 3 5 join
  • 30.
    Skewed joins 30 src dst 01 0 2 0 3 0 4 … … 0 2,000,000 src id nbrs 0 0 2,000,000 hash join 1 0 10 2 3 5 1 3 2 5 broadcast join (#nbrs > 1,000,000) union
  • 31.
    Checkpointing We do checkpointat every 2 iterations to avoid: • query plan getting too big (exponential growth) • optimizer taking too long • disk out of shuffle space • unexpected node failures 31
  • 32.
    Experiments • twitter-2010 fromWebGraph datasets (small diameter) • 42 million vertices, 1.5 billion edges • 16 r3.4xlarge workers on Databricks • GraphX: 4 minutes • GraphFrames: 6 minutes • algorithm difference, checkpointing, checking skewness 32
  • 33.
    Experiments • uk-2007-05 fromWebGraph datasets • 105 million vertices, 3.7 billion edges • 16 r3.4xlarge workers on Databricks • GraphX: 25 minutes • slow convergence • GraphFrames: 4.5 minutes 33
  • 34.
    Experiments • regular grid32,000 x 32,000 (large diameter) • 1 billion nodes, 4 billion edges • 32 r3.8xlarge workers on Databricks • GraphX: failed • GraphFrames: 1 hour 34
  • 35.
    Experiments • regular grid50,000 x 50,000 (large diameter) • 2.5 billion nodes, 10 billion edges • 32 r3.8xlarge workers on Databricks • GraphX: failed • GraphFrames: 1.6 hours 35
  • 36.
    Future improvements • updateinefficient code (due to Spark 1.6 compatibility) • better graph partitioning • local iterations • node pruning and better stop criterion • letting Spark SQL handle skewed joins and iterations • graph compression • prove log(N) iterations or maybe a better algorithm? 36
  • 37.