Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
GraphFrames: Graph Queries in
Apache Spark SQL
Ankur Dave
UC Berkeley AMPLab
Joint work with Alekh Jindal (Microsoft), Li ...
+ Graph
Queries
2016
Apache Spark +
GraphFrames
GraphFrames (2016)
+ Graph
Algorithms
2013
Apache Spark +
GraphX
Relationa...
Graph Algorithms vs. Graph Queries
≈
x
PageRank
Alternating Least Squares
Graph Algorithms Graph Queries
Graph Algorithms vs. Graph Queries
Graph Algorithm: PageRank Graph Query: Wikipedia Collaborators
Editor 1 Editor 2 Articl...
Graph Algorithms vs. Graph Queries
Graph Algorithm: PageRank
// Iterate until convergence
wikipedia.pregel(
sendMsg = { e ...
Separate Systems
Graph Algorithms Graph Queries
Raw Wikipedia
< / >< / >< / >
XML
Text Table
Edit Graph
Edit Table
Frequent
Collaborators
Problem: Mixed Graph Analysis
Hy...
Solution: GraphFrames
Graph Algorithms Graph Queries
Spark SQL
GraphFramesAPI
Pattern Query
Optimizer
GraphFrames API
• Unifies graph algorithms, graph queries, and DataFrames
• Available in Scala,Java, and Python
class Grap...
Implementation
Parsed
Pattern
Logical Plan
Materialized
Views
Optimized
Logical Plan
DataFrame
Result
Query String
Graph–R...
Graph–Relational Translation
B
D
A
C
Existing
Logical Plan
Output: A,B,C
Src Dst
⋈C=Src
Edge Table
ID Attr
VertexTable
⋈D=...
Materialized View Selection
GraphX: Triplet view enabled efficient message-passing algorithms
Vertices
B
A
C
D
Edges
A B
A...
Materialized View Selection
GraphFrames: User-defined views enable efficient graph queries
Vertices
B
A
C
D
Edges
A B
A C
...
Join Elimination
Src Dst
1 2
1 3
2 3
2 5
Edges
ID Attr
1 A
2 B
3 C
4 D
Vertices
SELECT src, dst
FROM edges INNER JOIN vert...
Join Reordering
A → B B → A
⋈A, B
B → D
C → B⋈B
B → E⋈B
C → D⋈B
C → E⋈C, D
⋈C, E
Example Query
Left-Deep Plan BushyPlan
A ...
Evaluation
Faster than Neo4j for unanchored patternqueries
0
0.5
1
1.5
2
2.5
GraphFrames Neo4j
Querylatency,s
AnchoredPatt...
Evaluation
Approaches performance of GraphX for graph algorithms using Spark SQL
whole-stage code generation
0
1
2
3
4
5
6...
Evaluation
Registering the right views cangreatlyimprove performance for some queries
Workload: J. Huang, K. Venkatraman, ...
Future Work
• Suggest views automatically
• Exploit attribute-based partitioning in optimizer
• Code generationfor single ...
Try It Out!
Releasedas a Spark Package at:
https://github.com/graphframes/graphframes
Thanks to Joseph Bradley,Xiangrui Me...
Upcoming SlideShare
Loading in …5
×

GraphFrames: Graph Queries In Spark SQL

3,697 views

Published on

Spark Summit talk by Ankur Dave (UC Berkeley)

Published in: Data & Analytics
  • Writing good research paper is quite easy and very difficult simultaneously. It depends on the individual skill set also. You can get help from research paper writing. Check out, please ⇒ www.HelpWriting.net ⇐
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • I pasted a website that might be helpful to you: ⇒ HelpWriting.net ⇐ Good luck!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

GraphFrames: Graph Queries In Spark SQL

  1. 1. GraphFrames: Graph Queries in Apache Spark SQL Ankur Dave UC Berkeley AMPLab Joint work with Alekh Jindal (Microsoft), Li Erran Li (Uber), Reynold Xin (Databricks), Joseph Gonzalez (UC Berkeley), and Matei Zaharia (MIT and Databricks)
  2. 2. + Graph Queries 2016 Apache Spark + GraphFrames GraphFrames (2016) + Graph Algorithms 2013 Apache Spark + GraphX Relational Queries 2009 Spark
  3. 3. Graph Algorithms vs. Graph Queries ≈ x PageRank Alternating Least Squares Graph Algorithms Graph Queries
  4. 4. Graph Algorithms vs. Graph Queries Graph Algorithm: PageRank Graph Query: Wikipedia Collaborators Editor 1 Editor 2 Article 1 Article 2 ⇓ Article 1 Article 2 Editor 1 Editor 2 same day} same day}
  5. 5. Graph Algorithms vs. Graph Queries Graph Algorithm: PageRank // Iterate until convergence wikipedia.pregel( sendMsg = { e => e.sendToDst(e.srcRank * e.weight) }, mergeMsg = _ + _, vprog = { (id, oldRank, msgSum) => 0.15 + 0.85 * msgSum }) Graph Query: Wikipedia Collaborators wikipedia.find( "(u1)-[e11]->(article1); (u2)-[e21]->(article1); (u1)-[e12]->(article2); (u2)-[e22]->(article2)") .select( "*", "e11.date – e21.date".as("d1"), "e12.date – e22.date".as("d2")) .sort("d1 + d2".desc).take(10)
  6. 6. Separate Systems Graph Algorithms Graph Queries
  7. 7. Raw Wikipedia < / >< / >< / > XML Text Table Edit Graph Edit Table Frequent Collaborators Problem: Mixed Graph Analysis Hyperlinks PageRank Article Text User Article Vandalism Suspects User User User Article
  8. 8. Solution: GraphFrames Graph Algorithms Graph Queries Spark SQL GraphFramesAPI Pattern Query Optimizer
  9. 9. GraphFrames API • Unifies graph algorithms, graph queries, and DataFrames • Available in Scala,Java, and Python class GraphFrame { def vertices: DataFrame def edges: DataFrame def find(pattern: String): DataFrame def registerView(pattern: String, df: DataFrame): Unit def degrees(): DataFrame def pageRank(): GraphFrame def connectedComponents(): GraphFrame ... }
  10. 10. Implementation Parsed Pattern Logical Plan Materialized Views Optimized Logical Plan DataFrame Result Query String Graph–Relational Translation Join Elimination and Reordering Spark SQL View Selection Graph Algorithms GraphX
  11. 11. Graph–Relational Translation B D A C Existing Logical Plan Output: A,B,C Src Dst ⋈C=Src Edge Table ID Attr VertexTable ⋈D=ID
  12. 12. Materialized View Selection GraphX: Triplet view enabled efficient message-passing algorithms Vertices B A C D Edges A B A C B C C D A B Triplet View A C B C C D Graph + Updated PageRanks B A C D A
  13. 13. Materialized View Selection GraphFrames: User-defined views enable efficient graph queries Vertices B A C D Edges A B A C B C C D A B Triplet View A C B C C D Graph User-Defined Views PageRank Community Detection … Graph Queries
  14. 14. Join Elimination Src Dst 1 2 1 3 2 3 2 5 Edges ID Attr 1 A 2 B 3 C 4 D Vertices SELECT src, dst FROM edges INNER JOIN vertices ON src = id; Unnecessaryjoin can be eliminated if tables satisfy referential integrity, simplifying graph–relational translation: SELECT src, dst FROM edges;
  15. 15. Join Reordering A → B B → A ⋈A, B B → D C → B⋈B B → E⋈B C → D⋈B C → E⋈C, D ⋈C, E Example Query Left-Deep Plan BushyPlan A → B B → A ⋈A, B B → D C → B ⋈B B → E⋈B ⋈B ⋈B, C User-Defined View
  16. 16. Evaluation Faster than Neo4j for unanchored patternqueries 0 0.5 1 1.5 2 2.5 GraphFrames Neo4j Querylatency,s AnchoredPatternQuery 0 10 20 30 40 50 60 70 80 GraphFrames Neo4j Querylatency,s UnanchoredPatternQuery Triangle query on 1M edge subgraph of web-Google. Each system configured touse a single core.
  17. 17. Evaluation Approaches performance of GraphX for graph algorithms using Spark SQL whole-stage code generation 0 1 2 3 4 5 6 7 GraphFrames GraphX Naïve Spark Per-iterationruntime,s PageRankPerformance Per-iteration performance on web-Google, single 8-core machine. Naïve SparkusesScala RDD API.
  18. 18. Evaluation Registering the right views cangreatlyimprove performance for some queries Workload: J. Huang, K. Venkatraman, and D.J. Abadi.Query optimization of distributed pattern matching. In ICDE 2014.
  19. 19. Future Work • Suggest views automatically • Exploit attribute-based partitioning in optimizer • Code generationfor single node
  20. 20. Try It Out! Releasedas a Spark Package at: https://github.com/graphframes/graphframes Thanks to Joseph Bradley,Xiangrui Meng,and Timothy Hunter. ankurd@eecs.berkeley.edu

×