GraphFrames: Graph Queries In Spark SQL

724 views

Published on

Spark Summit talk by Ankur Dave (UC Berkeley)

Published in: Data & Analytics
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
724
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
82
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

GraphFrames: Graph Queries In Spark SQL

  1. 1. GraphFrames: Graph Queries in Apache Spark SQL Ankur Dave UC Berkeley AMPLab Joint work with Alekh Jindal (Microsoft), Li Erran Li (Uber), Reynold Xin (Databricks), Joseph Gonzalez (UC Berkeley), and Matei Zaharia (MIT and Databricks)
  2. 2. + Graph Queries 2016 Apache Spark + GraphFrames GraphFrames (2016) + Graph Algorithms 2013 Apache Spark + GraphX Relational Queries 2009 Spark
  3. 3. Graph Algorithms vs. Graph Queries ≈ x PageRank Alternating Least Squares Graph Algorithms Graph Queries
  4. 4. Graph Algorithms vs. Graph Queries Graph Algorithm: PageRank Graph Query: Wikipedia Collaborators Editor 1 Editor 2 Article 1 Article 2 ⇓ Article 1 Article 2 Editor 1 Editor 2 same day} same day}
  5. 5. Graph Algorithms vs. Graph Queries Graph Algorithm: PageRank // Iterate until convergence wikipedia.pregel( sendMsg = { e => e.sendToDst(e.srcRank * e.weight) }, mergeMsg = _ + _, vprog = { (id, oldRank, msgSum) => 0.15 + 0.85 * msgSum }) Graph Query: Wikipedia Collaborators wikipedia.find( "(u1)-[e11]->(article1); (u2)-[e21]->(article1); (u1)-[e12]->(article2); (u2)-[e22]->(article2)") .select( "*", "e11.date – e21.date".as("d1"), "e12.date – e22.date".as("d2")) .sort("d1 + d2".desc).take(10)
  6. 6. Separate Systems Graph Algorithms Graph Queries
  7. 7. Raw Wikipedia < / >< / >< / > XML Text Table Edit Graph Edit Table Frequent Collaborators Problem: Mixed Graph Analysis Hyperlinks PageRank Article Text User Article Vandalism Suspects User User User Article
  8. 8. Solution: GraphFrames Graph Algorithms Graph Queries Spark SQL GraphFramesAPI Pattern Query Optimizer
  9. 9. GraphFrames API • Unifies graph algorithms, graph queries, and DataFrames • Available in Scala,Java, and Python class GraphFrame { def vertices: DataFrame def edges: DataFrame def find(pattern: String): DataFrame def registerView(pattern: String, df: DataFrame): Unit def degrees(): DataFrame def pageRank(): GraphFrame def connectedComponents(): GraphFrame ... }
  10. 10. Implementation Parsed Pattern Logical Plan Materialized Views Optimized Logical Plan DataFrame Result Query String Graph–Relational Translation Join Elimination and Reordering Spark SQL View Selection Graph Algorithms GraphX
  11. 11. Graph–Relational Translation B D A C Existing Logical Plan Output: A,B,C Src Dst ⋈C=Src Edge Table ID Attr VertexTable ⋈D=ID
  12. 12. Materialized View Selection GraphX: Triplet view enabled efficient message-passing algorithms Vertices B A C D Edges A B A C B C C D A B Triplet View A C B C C D Graph + Updated PageRanks B A C D A
  13. 13. Materialized View Selection GraphFrames: User-defined views enable efficient graph queries Vertices B A C D Edges A B A C B C C D A B Triplet View A C B C C D Graph User-Defined Views PageRank Community Detection … Graph Queries
  14. 14. Join Elimination Src Dst 1 2 1 3 2 3 2 5 Edges ID Attr 1 A 2 B 3 C 4 D Vertices SELECT src, dst FROM edges INNER JOIN vertices ON src = id; Unnecessaryjoin can be eliminated if tables satisfy referential integrity, simplifying graph–relational translation: SELECT src, dst FROM edges;
  15. 15. Join Reordering A → B B → A ⋈A, B B → D C → B⋈B B → E⋈B C → D⋈B C → E⋈C, D ⋈C, E Example Query Left-Deep Plan BushyPlan A → B B → A ⋈A, B B → D C → B ⋈B B → E⋈B ⋈B ⋈B, C User-Defined View
  16. 16. Evaluation Faster than Neo4j for unanchored patternqueries 0 0.5 1 1.5 2 2.5 GraphFrames Neo4j Querylatency,s AnchoredPatternQuery 0 10 20 30 40 50 60 70 80 GraphFrames Neo4j Querylatency,s UnanchoredPatternQuery Triangle query on 1M edge subgraph of web-Google. Each system configured touse a single core.
  17. 17. Evaluation Approaches performance of GraphX for graph algorithms using Spark SQL whole-stage code generation 0 1 2 3 4 5 6 7 GraphFrames GraphX Naïve Spark Per-iterationruntime,s PageRankPerformance Per-iteration performance on web-Google, single 8-core machine. Naïve SparkusesScala RDD API.
  18. 18. Evaluation Registering the right views cangreatlyimprove performance for some queries Workload: J. Huang, K. Venkatraman, and D.J. Abadi.Query optimization of distributed pattern matching. In ICDE 2014.
  19. 19. Future Work • Suggest views automatically • Exploit attribute-based partitioning in optimizer • Code generationfor single node
  20. 20. Try It Out! Releasedas a Spark Package at: https://github.com/graphframes/graphframes Thanks to Joseph Bradley,Xiangrui Meng,and Timothy Hunter. ankurd@eecs.berkeley.edu

×