Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Web-Scale Graph Analytics with Apache® Spark™

2,041 views

Published on

Graph analytics has a wide range of applications, from information propagation and network flow optimization to fraud and anomaly detection. The rise of social networks and the Internet of Things has given us complex web-scale graphs with billions of vertices and edges. However, in order to extract the hidden gems within those graphs, you need tools to analyze the graphs easily and efficiently.

At Spark Summit 2016, Databricks introduced GraphFrames, which implemented graph queries and pattern matching on top of Spark SQL to simplify graph analytics. In this talk, you’ll learn about work that has made graph algorithms in GraphFrames faster and more scalable. For example, new implementations like connected components have received algorithm improvements based on recent research, as well as performance improvements from Spark DataFrames. Discover lessons learned from scaling the implementation from millions to billions of nodes; compare its performance with other popular graph libraries; and hear about real-world applications.

Published in: Software
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Web-Scale Graph Analytics with Apache® Spark™

  1. 1. Web-Scale Graph Analytics with Apache Spark Joseph K Bradley Bay Area Apache Spark Meetup September 7, 2017
  2. 2. 2 About me Software engineer at Databricks Apache Spark committer & PMC member Ph.D. Carnegie Mellon in Machine Learning
  3. 3. 3 TEAM About Databricks Started Spark project (now Apache Spark) at UC Berkeley in 2009 3 3 PRODUCT Unified Analytics Platform MISSION Making Big Data Simple
  4. 4. 4 UNIFIED ANALYTICS PLATFORM Try Apache Spark in Databricks! •  Collaborative cloud environment •  Free version (community edition) 4 4 DATABRICKS RUNTIME 3.2 •  Apache Spark - optimized for the cloud •  Caching and optimization layer - DBIO •  Enterprise security - DBES Try for free today. databricks.com
  5. 5. 5 Apache Spark Engine … Spark Core Spark Streaming Spark SQL MLlib GraphX Unified engine across diverse workloads & environments Scale out, fault tolerant Python, Java, Scala, & R APIs Standard libraries
  6. 6. 6
  7. 7. 7 Spark Packages 340+ packages written for Spark 80+ packages for ML and Graphs E.g.: • GraphFrames: DataFrame-based graphs • Bisecting K-Means: now part of MLlib • Stanford CoreNLP integration: UDFs for NLP spark-packages.org
  8. 8. 8 Outline Intro to GraphFrames Moving implementations to DataFrames •  Vertex indexing •  Scaling Connected Components •  Other challenges: skewed joins and checkpoints Future of GraphFrames
  9. 9. 9 Graphs vertex edge id City State “JFK” “New York” NY Example: airports & flights between them JFK IAD LAX SFO SEA DFW src dst delay tripID “JFK” “SEA” 45 1058923
  10. 10. 10 Apache Spark’s GraphX library Overview •  General-purpose graph processing library •  Optimized for fast distributed computing •  Library of algorithms: PageRank, Connected Components, etc. 10 Challenges •  No Java, Python APIs •  Lower-level RDD-based API (vs. DataFrames) •  Cannot use recent Spark optimizations: Catalyst query optimizer, Tungsten memory management
  11. 11. 11 The GraphFrames Spark Package Goal: DataFrame-based graphs on Apache Spark •  Simplify interactive queries •  Support motif-finding for structural pattern search •  Benefit from DataFrame optimizations Collaboration between Databricks, UC Berkeley & MIT + Now with community contributors & committers! 11
  12. 12. 12 Graphs vertex edge JFK IAD LAX SFO SEA DFW id City State “JFK” “New York” NY src dst delay tripID “JFK” “SEA” 45 1058923
  13. 13. 13 GraphFrames 13 id City State “JFK” “New York” NY “SEA” “Seattle” WA src dst delay tripID “JFK” “SEA” 45 1058923 “DFW” “SFO” -7 4100224 vertices DataFrame edges DataFrame
  14. 14. 14 Graph analysis with GraphFrames Simple queries Motif finding Graph algorithms 14
  15. 15. 15 Simple queries SQL queries on vertices & edges 15 Simple graph queries (e.g., vertex degrees)
  16. 16. 16 Motif finding 16 JFK IAD LAX SFO SEA DFW Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”)
  17. 17. 17 Motif finding 17 JFK IAD LAX SFO SEA DFW (b) (a)Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”)
  18. 18. 18 Motif finding 18 JFK IAD LAX SFO SEA DFW (b) (a) (c) Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”)
  19. 19. 19 Motif finding 19 JFK IAD LAX SFO SEA DFW (b) (a) (c) Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”)
  20. 20. 20 Motif finding 20 JFK IAD LAX SFO SEA DFW Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”) (b) (a) (c) Then filter using vertex & edge data. paths.filter(“e1.delay > 20”)
  21. 21. 21 Graph algorithms Find important vertices •  PageRank 21 Find paths between sets of vertices •  Breadth-first search (BFS) •  Shortest paths Find groups of vertices (components, communities) •  Connected components •  Strongly connected components •  Label Propagation Algorithm (LPA) Other •  Triangle counting •  SVDPlusPlus
  22. 22. 22 Saving & loading graphs Save & load the DataFrames. vertices = sqlContext.read.parquet(...) edges = sqlContext.read.parquet(...) g = GraphFrame(vertices, edges) g.vertices.write.parquet(...) g.edges.write.parquet(...) 22
  23. 23. 23 GraphFrames vs. GraphX 23 GraphFrames GraphX Built on DataFrames RDDs Languages Scala, Java, Python Scala Use cases Queries & algorithms Algorithms Vertex IDs Any type (in Catalyst) Long Vertex/edge attributes Any number of DataFrame columns Any type (VD, ED)
  24. 24. 24 2 types of graph libraries Graph algorithms Graph queries Standard & custom algorithms Optimized for batch processing Motif finding Point queries & updates GraphFrames: Both algorithms & queries (but not point updates)
  25. 25. 25 Outline Intro to GraphFrames Moving implementations to DataFrames •  Vertex indexing •  Scaling Connected Components •  Other challenges: skewed joins and checkpoints Future of GraphFrames
  26. 26. 26 Algorithm implementations Mostly wrappers for GraphX •  PageRank •  Shortest paths •  Strongly connected components •  Label Propagation Algorithm (LPA) •  SVDPlusPlus 26 Some algorithms implemented using DataFrames •  Breadth-first search •  Connected components •  Triangle counting •  Motif finding
  27. 27. 27 Moving implementations to DataFrames DataFrames are optimized for a huge number of small records. •  columnar storage •  code generation (“Project Tungsten”) •  query optimization (“Project Catalyst”) 27
  28. 28. 28 Outline Intro to GraphFrames Moving implementations to DataFrames •  Vertex indexing •  Scaling Connected Components •  Other challenges: skewed joins and checkpoints Future of GraphFrames
  29. 29. 29 Pros of integer vertex IDs GraphFrames take arbitrary vertex IDs. à convenient for users Algorithms prefer integer vertex IDs. à optimize in-memory storage à reduce communication Our task: Map unique vertex IDs to unique (long) integers.
  30. 30. 30 The hashing trick? • Possible solution: hash vertex ID to long integer • What is the chance of collision? •  1 - (k-1)/N * (k-2)/N * … •  seems unlikely with long range N=264 •  with 1 billion nodes, the chance is ~5.4% • Problem: collisions change graph topology. Name Hash Tim 84088 Joseph -2070372689 Xiangrui 264245405 Felix 67762524
  31. 31. 31 Generating unique IDs Spark has built-in methods to generate unique IDs. •  RDD: zipWithUniqueId(), zipWithIndex() •  DataFrame: monotonically_increasing_id() ! Possible solution: just use these methods
  32. 32. 32 How it works ParCCon 1 Vertex ID Tim 0 Joseph 1 ParCCon 2 Vertex ID Xiangrui 100 + 0 Felix 100 + 1 ParCCon 3 Vertex ID … 200 + 0 … 200 + 1
  33. 33. 33 … but not always • DataFrames/RDDs are immutable and reproducible by design. • However, records do not always have stable orderings. •  distinct •  repartition • cache() does not help. ParCCon 1 Vertex ID Tim 0 Joseph 1 ParCCon 1 Vertex ID Joseph 0 Tim 1 re-compute
  34. 34. 34 Our implementation We implemented (v0.5.0) an expensive but correct version: 1.  (hash) re-partition + distinct vertex IDs 2.  sort vertex IDs within each partition 3.  generate unique integer IDs
  35. 35. 35 Outline Intro to GraphFrames Moving implementations to DataFrames •  Vertex indexing •  Scaling Connected Components •  Other challenges: skewed joins and checkpoints Future of GraphFrames
  36. 36. 36 Connected Components Assign each vertex a component ID such that vertices receive the same component ID iff they are connected. Applications: •  fraud detection • Spark Summit 2016 keynote from Capital One •  clustering •  entity resolution 1 3 2
  37. 37. 37 Naive implementation (GraphX) 1.  Assign each vertex a unique component ID. 2.  Iterate until convergence: •  For each vertex v, update: component ID of v ß Smallest component ID in neighborhood of v Pro: easy to implement Con: slow convergence on large-diameter graphs
  38. 38. 38 Small-/large-star algorithm Kiveris et al. "Connected Components in MapReduce and Beyond." 1.  Assign each vertex a unique ID. 2.  Iterate until convergence: • (small-star) for each vertex, connect smaller neighbors to smallest neighbor • (big-star) for each vertex, connect bigger neighbors to smallest neighbor (or itself)
  39. 39. 39 Small-star operation Kiveris et al., Connected Components in MapReduce and Beyond.
  40. 40. 40 Big-star operation Kiveris et al., Connected Components in MapReduce and Beyond.
  41. 41. 41 Another interpretation 1 5 7 8 9 1 x 5 x 7 x 8 x 9 adjacency matrix
  42. 42. 42 Small-star operation 1 5 7 8 9 1 x x x 5 7 8 x 9 1 5 7 8 9 1 x 5 x 7 x 8 x 9 rotate & liK
  43. 43. 43 Big-star operation liK 1 5 7 8 9 1 x x 5 x 7 x 8 9 1 5 7 8 9 1 x 5 x 7 x 8 x 9
  44. 44. 44 Convergence 1 5 7 8 9 1 x x x x x 5 7 8 9
  45. 45. 45 Properties of the algorithm • Small-/big-star operations do not change graph connectivity. • Extra edges are pruned during iterations. • Each connected component converges to a star graph. • Converges in log2(#nodes) iterations
  46. 46. 46 Implementation Iterate: • filter • self-join Challenge: handle these operations at scale.
  47. 47. 47 Outline Intro to GraphFrames Moving implementations to DataFrames •  Vertex indexing •  Scaling Connected Components •  Other challenges: skewed joins and checkpoints Future of GraphFrames
  48. 48. 48 Skewed joins Real-world graphs contain big components. à data skew during connected components iterations src dst 0 1 0 2 0 3 0 4 … … 0 2,000,000 1 3 2 5 src Component id neighbors 0 0 2,000,000 1 0 10 2 3 5 join
  49. 49. 49 Skewed joins 4 src dst 0 1 0 2 0 3 0 4 … … 0 2,000,000 hash join 1 3 2 5 broadcast join (#nbrs > 1,000,000) union src Component id neighbors 0 0 2,000,000 1 0 10 2 3 5
  50. 50. 50 Checkpointing We checkpoint every 2 iterations to avoid: •  query plan explosion (exponential growth) •  optimizer slowdown •  disk out of shuffle space •  unexpected node failures 5
  51. 51. 51 Experiments twitter-2010 from WebGraph datasets (small diameter) •  42 million vertices, 1.5 billion edges 16 r3.4xlarge workers on Databricks •  GraphX: 4 minutes •  GraphFrames: 6 minutes –  algorithm difference, checkpointing, checking skewness 5
  52. 52. 52 Experiments uk-2007-05 from WebGraph datasets •  105 million vertices, 3.7 billion edges 16 r3.4xlarge workers on Databricks •  GraphX: 25 minutes –  slow convergence •  GraphFrames: 4.5 minutes 5
  53. 53. 53 Experiments regular grid 32,000 x 32,000 (large diameter) •  1 billion nodes, 4 billion edges 32 r3.8xlarge workers on Databricks •  GraphX: failed •  GraphFrames: 1 hour 5
  54. 54. 54 Experiments regular grid 50,000 x 50,000 (large diameter) •  2.5 billion nodes, 10 billion edges 32 r3.8xlarge workers on Databricks •  GraphX: failed •  GraphFrames: 1.6 hours 5
  55. 55. 55 Future improvements GraphFrames •  update inefficient code (due to Spark 1.6 compatibility) •  better graph partitioning •  letting Spark SQL handle skewed joins and iterations •  graph compression Connected Components •  local iterations •  node pruning and better stopping criteria
  56. 56. 56 https://spark-summit.org/eu-2017/ 15% Discount code: Databricks
  57. 57. hRp://dbricks.co/2sK35XT
  58. 58. https://databricks.com/company/careers
  59. 59. Thank you! Get started with GraphFrames Docs, downloads & tutorials http://graphframes.github.io https://docs.databricks.com Dev community Github issues & PRs Twitter: @jkbatcmu à I’ll share my slides.

×