1. GraphFrames: Graph Queries in
Apache Spark SQL
Ankur Dave
UC Berkeley AMPLab
Joint work with Alekh Jindal (Microsoft), Li Erran Li
(Uber), Reynold Xin (Databricks), Joseph Gonzalez (UC
Berkeley), and Matei Zaharia (MIT and Databricks)
7. Raw Wikipedia
< / >< / >< / >
XML
Text Table
Edit Graph
Edit Table
Frequent
Collaborators
Problem: Mixed Graph Analysis
Hyperlinks PageRank
Article Text
User Article
Vandalism
Suspects
User User
User Article
12. Materialized View Selection
GraphX: Triplet view enabled efficient message-passing algorithms
Vertices
B
A
C
D
Edges
A B
A C
B C
C D
A
B
Triplet View
A C
B C
C D
Graph
+
Updated
PageRanks
B
A
C
D
A
13. Materialized View Selection
GraphFrames: User-defined views enable efficient graph queries
Vertices
B
A
C
D
Edges
A B
A C
B C
C D
A
B
Triplet View
A C
B C
C D
Graph
User-Defined Views
PageRank
Community
Detection
…
Graph Queries
14. Join Elimination
Src Dst
1 2
1 3
2 3
2 5
Edges
ID Attr
1 A
2 B
3 C
4 D
Vertices
SELECT src, dst
FROM edges INNER JOIN vertices ON src = id;
Unnecessaryjoin
can be eliminated if tables satisfy referential
integrity, simplifying graph–relational
translation:
SELECT src, dst FROM edges;
15. Join Reordering
A → B B → A
⋈A, B
B → D
C → B⋈B
B → E⋈B
C → D⋈B
C → E⋈C, D
⋈C, E
Example Query
Left-Deep Plan BushyPlan
A → B B → A
⋈A, B
B → D C → B
⋈B
B → E⋈B
⋈B
⋈B, C
User-Defined View
16. Evaluation
Faster than Neo4j for unanchored patternqueries
0
0.5
1
1.5
2
2.5
GraphFrames Neo4j
Querylatency,s
AnchoredPatternQuery
0
10
20
30
40
50
60
70
80
GraphFrames Neo4j
Querylatency,s
UnanchoredPatternQuery
Triangle query on 1M edge subgraph of web-Google. Each system configured touse a single core.
17. Evaluation
Approaches performance of GraphX for graph algorithms using Spark SQL
whole-stage code generation
0
1
2
3
4
5
6
7
GraphFrames GraphX Naïve Spark
Per-iterationruntime,s
PageRankPerformance
Per-iteration performance on web-Google, single 8-core machine. Naïve SparkusesScala RDD API.
18. Evaluation
Registering the right views cangreatlyimprove performance for some queries
Workload: J. Huang, K. Venkatraman, and D.J. Abadi.Query optimization of distributed pattern matching. In ICDE 2014.
19. Future Work
• Suggest views automatically
• Exploit attribute-based partitioning in optimizer
• Code generationfor single node
20. Try It Out!
Releasedas a Spark Package at:
https://github.com/graphframes/graphframes
Thanks to Joseph Bradley,Xiangrui Meng,and Timothy Hunter.
ankurd@eecs.berkeley.edu