Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

of

Assessing Graph Solutions for Apache Spark Slide 1 Assessing Graph Solutions for Apache Spark Slide 2 Assessing Graph Solutions for Apache Spark Slide 3 Assessing Graph Solutions for Apache Spark Slide 4 Assessing Graph Solutions for Apache Spark Slide 5 Assessing Graph Solutions for Apache Spark Slide 6 Assessing Graph Solutions for Apache Spark Slide 7 Assessing Graph Solutions for Apache Spark Slide 8 Assessing Graph Solutions for Apache Spark Slide 9 Assessing Graph Solutions for Apache Spark Slide 10 Assessing Graph Solutions for Apache Spark Slide 11 Assessing Graph Solutions for Apache Spark Slide 12 Assessing Graph Solutions for Apache Spark Slide 13 Assessing Graph Solutions for Apache Spark Slide 14 Assessing Graph Solutions for Apache Spark Slide 15 Assessing Graph Solutions for Apache Spark Slide 16 Assessing Graph Solutions for Apache Spark Slide 17 Assessing Graph Solutions for Apache Spark Slide 18 Assessing Graph Solutions for Apache Spark Slide 19 Assessing Graph Solutions for Apache Spark Slide 20 Assessing Graph Solutions for Apache Spark Slide 21 Assessing Graph Solutions for Apache Spark Slide 22 Assessing Graph Solutions for Apache Spark Slide 23 Assessing Graph Solutions for Apache Spark Slide 24 Assessing Graph Solutions for Apache Spark Slide 25 Assessing Graph Solutions for Apache Spark Slide 26 Assessing Graph Solutions for Apache Spark Slide 27
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

4 Likes

Share

Download to read offline

Assessing Graph Solutions for Apache Spark

Download to read offline

Users have several options for running graph algorithms with Apache Spark. To support a graph data architecture on top of its linear-oriented DataFrames, the Spark platform offers GraphFrames. However, due to the fact that GraphFrames are immutable and not a native graph, there are cases where it might not offer the features or performance needed for certain use cases. Another option is to connect Spark to a real-time, scalable and distributed native graph database such as TigerGraph.

In this session, we compare three options — GraphX, Cypher for Apache Spark, and TigerGraph — for different types of workload requirements and data sizes, to help users select the right solution for their needs. We also look at the data transfer and loading time for TigerGraph.

Assessing Graph Solutions for Apache Spark

  1. 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  2. 2. Songting Chen , Victor Lee (TigerGraph) Assessing Graph Solutions for Apache Spark #UnifiedDataAnalytics #SparkAISummit
  3. 3. Graph is HOW WE THINK 3#UnifiedDataAnalytics #SparkAISummit
  4. 4. We Use Graph Every Day 4#UnifiedDataAnalytics #SparkAISummit
  5. 5. The Evolution of Graph Analysis • Early days – PageRank etc, focus on graph algorithms – Pregel programming API • Nowadays – Query language, more declarative without losing expressive power – AI + graph data: graph features, training, predictions – More real time (updates, queries) – Scale, scale, scale – Gartner: Graph DB market grows 100% YOY through 2022 5#UnifiedDataAnalytics #SparkAISummit
  6. 6. Typical Workload / Use Cases • Batch / offline processing – Web Search/PageRank, etc • Real time graph queries / updates – Graph feature extraction for AI training and prediction, e.g., spam phone call detection – Data center monitoring (server, router, apps, rack) – Entire big data industry moves towards real time • Scalability: large data volume, high QPS 6#UnifiedDataAnalytics #SparkAISummit
  7. 7. This Talk • Spark: General scalable big data / ML platform – GraphX: Spark-based Graph Platform • TigerGraph: Scalable Native Graph Platform v How they differ, pros and cons for graph applications v How they work together to provide end-to-end solutions 7#UnifiedDataAnalytics #SparkAISummit
  8. 8. Comparing GraphX and TigerGraph
  9. 9. Areas of Focus • Graph Data Storage • Query Expressiveness • Supported Workload • Scalability and Performance 9#UnifiedDataAnalytics #SparkAISummit
  10. 10. Graph Data Storage • TigerGraph • ETL preload / optimized storage • GraphX – Data stored elsewhere and load them on the fly • Pros and cons – Load data once (initial cost, good for repeated analysis) – Load data many times (minimal initial cost, good for initial exploratory analysis) 10#UnifiedDataAnalytics #SparkAISummit
  11. 11. Query Expressiveness GraphX - API-based for creating graph algorithm 11#UnifiedDataAnalytics #SparkAISummit PageRank(...) … while (iteration < numIter) { rankGraph.cache() val rankUpdates = rankGraph.aggregateMessages[Double]( ctx => ctx.sendToDst(ctx.srcAttr * ctx.attr), _ + _, TripletFields.Src) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) { (id, oldRank, msgSumOpt) => resetProb + (1.0 - resetProb) * msgSumOpt.getOrElse(0.0) }.cache() rankGraph.edges.foreachPartition(x => {}) prevRankGraph.vertices.unpersist() prevRankGraph.edges.unpersist() iteration += 1 } 1 msg: 1/4 = 0.25 +: {msg} msg msg
  12. 12. TigerGraph’s GSQL: Declarative Graph Algorithm Design 12#UnifiedDataAnalytics #SparkAISummit SumAccum @received_score = 0; SumAccum @score = 1; people = {People.*}; WHILE True LIMIT maxIter DO people = SELECT src FROM people:src-(:follow)→people:tgt ACCUM tgt.@received_score += src.@score/src.outdegree() POST-ACCUM s.@score = (1-resetProb) + resetProb * t.@received_score, s.@received_score = 0, END; src @score @received_score tgt.@received_score += src.@score/src.outdegree() src src tgt
  13. 13. 13#UnifiedDataAnalytics #SparkAISummit SumAccum @received_score = 0; SumAccum @score = 1; MaxAccum @received_max_neighbor_score = 0; MaxAccum @max_neighbor_score = 1; people = {People.*}; WHILE True LIMIT maxIter DO Start = SELECT src FROM people:src-(follow:e)→people:tgt; ACCUM tgt.@ received_score += src.@score/(s.outdegree()), tgt.@ received_max_neighbor_score += src.@score POST-ACCUM s.@score = (1-resetProb) + resetProb * t.@received_score, s.@received_score = 0, s.@max_neighbor_score = s.@received_max_neighbor_score, s.@received_max_neighbor_score = 0; END; tgt.@received_score += src.@score/src.degree() tgt.@max_neighbor_score += src.@score TigerGraph’s GSQL – cont. Simultaneously compute many metrics in a declarative way for complex algorithms src src src tgt
  14. 14. GraphFrame: Declarative Pattern Query 14#UnifiedDataAnalytics #SparkAISummit val chain4 = g.find("(a)-[ab]->(b); (b)-[bc]->(c); (c)-[cd]->(d)") def sumFriends(cnt: Column, relationship: Column): Column = { when(relationship === "friend", cnt + 1).otherwise(cnt) } val condition = Seq("ab", "bc", "cd"). foldLeft(lit(0))((cnt, e) => sumFriends(cnt, col(e)("relationship"))) // (c) Apply filter to DataFrame. val chainWith2Friends2 = chain4.where(condition >= 2)
  15. 15. TigerGraph’s GSQL: declarative pattern matching + algorithm 15#UnifiedDataAnalytics #SparkAISummit A simple recommendation algorithm SumAccum @common_buys; OrAccum @already_bought; SumAccum @product_rank; other_people = SELECT g FROM seed_people:s-(buy)→ product:t ← (buy)-people:g ACCUM g.@common_buys += 1, t.@already_bought += true recommended_products = SELECT t FROM other_people:s -> (buy:e) -> product:t WHERE t.already_bought = false ACCUM t.rank += log(1 + s.@common_buys) ORDER BY t.rank DESC LIMIT 20 @common_buys @common_buys @common_buys @rank @rank Real time updates / queries could significantly improve the effectiveness of the recommendation algorithm.
  16. 16. Query Expressiveness - Summary • GraphX (API for designing graph algorithm) + GraphFrame (declarative pattern queries) • GSQL (SQL-procedure query language, declarative on both graph algorithm and pattern matching) • Both provide powerful graph analytics capabilities 16#UnifiedDataAnalytics #SparkAISummit
  17. 17. Query Workload GraphX (OLAP) TigerGraph (HTAP) 17#UnifiedDataAnalytics #SparkAISummit GraphX TigerGraph Big Analytics Query ✓ ✓ High QPS, Sub-second Query Workload ✓ Real Time Transactional Updates ✓
  18. 18. Scalability • Spark/GraphX is well-known for its scalability and MPP capabilities. • TigerGraph is also designed from ground up with MPP and scalability in mind. 18#UnifiedDataAnalytics #SparkAISummit
  19. 19. TigerGraph: Analytics Query Scalability 19#UnifiedDataAnalytics #SparkAISummit Twitter dataset (41M vertices, 1.4B edges) AWS 16 r5.2xlarge servers (8 cores, 64GB memory) # servers Latency (s)
  20. 20. TigerGraph: Point Query Scalability 20#UnifiedDataAnalytics #SparkAISummit QPS # servers Point query: 3-step graph traversals from a seed vertex Application: real time ML prediction based on graph features
  21. 21. Performance Comparison GraphX: EdgePartition2D; AWS 16 r5.x2large servers (8 cores, 64GB memory) 21#UnifiedDataAnalytics #SparkAISummit Latency (s)
  22. 22. Performance Comparison Cont. 22#UnifiedDataAnalytics #SparkAISummit Latency (s) GraphX: EdgePartition2D; AWS 16 r5.x2large servers (8 cores, 64GB memory)
  23. 23. Summary / Recommendations • GraphX: Quick-to-result exploratory analysis without having to preload the graph data • TigerGraph: High performance graph analytics, real time transactional updates, high QPS sub-second query workload 23#UnifiedDataAnalytics #SparkAISummit
  24. 24. How Spark and TigerGraph Work Together
  25. 25. Reference Architecture: Spark + TigerGraph for AI 25
  26. 26. Connect Spark-TigerGraph through JDBC • Support Read and Write bi-directional data flow to/from TigerGraph • Read: Convert graph query results to DataFrame • Write: Load DataFrame/Files to Vertex/Edges in TigerGraph • Open Source – https://github.com/tigergraph/ecosystem/tree/master/etl/tg_jdbc_driver 26
  27. 27. Benefits of Spark + TigerGraph • Take full advantage of the value from graph data in real time • Combine them with all other data for deep insights and AI • Scalable in every step • Already have actual use cases running in this architecture 27
  • GuyTempelhof1

    Dec. 5, 2019
  • MarcosColebrookSantamaria

    Nov. 20, 2019
  • wonderdream

    Nov. 12, 2019
  • chandan00104

    Nov. 11, 2019

Users have several options for running graph algorithms with Apache Spark. To support a graph data architecture on top of its linear-oriented DataFrames, the Spark platform offers GraphFrames. However, due to the fact that GraphFrames are immutable and not a native graph, there are cases where it might not offer the features or performance needed for certain use cases. Another option is to connect Spark to a real-time, scalable and distributed native graph database such as TigerGraph. In this session, we compare three options — GraphX, Cypher for Apache Spark, and TigerGraph — for different types of workload requirements and data sizes, to help users select the right solution for their needs. We also look at the data transfer and loading time for TigerGraph.

Views

Total views

606

On Slideshare

0

From embeds

0

Number of embeds

1

Actions

Downloads

20

Shares

0

Comments

0

Likes

4

×