Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)

2,305 views

Published on

Presentation at Spark Summit 2015

Published in: Data & Analytics
  • Two things I forgot to mention: The sorted join is basically the same idea as in sort-based shuffle, just applied at a different step (merging two partitions, instead of merging the shuffle blocks in the reducer). The other thing is that all this is implemented via the standard Spark API. Thanks to the powerful API we didn't have to modify any internals. Sorted join is implemented with zipPartitions.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)

  1. 1. Interactive Graph Analytics with Spark A talk by Daniel Darabos about the design and implementation of the LynxKite analytics application
  2. 2. About LynxKite Analytics web application AngularJS + Play! Framework + Apache Spark Each LynxKite “project” is a graph Graph operations mutate state – Typical big data workload – Minutes to hours Visualizations – Few seconds The topic of this talk
  3. 3. Idea 1: Avoid processing unused attributes
  4. 4. Column-based attributes type ID = Long case class Edge(src: ID, dst: ID) type VertexRDD = RDD[(ID, Unit)] type EdgeRDD = RDD[(ID, Edge)] type AttributeRDD[T] = RDD[(ID, T)] // Vertex attribute or edge attribute? // Could be either!
  5. 5. Column-based attributes Can process just the attributes we need Easy to add an attribute Simple and flexible – Edges between vertices of two different graphs – Edges between edges of two different graphs A lot of joining
  6. 6. Idea 2: Make joins fast
  7. 7. Co-located loading Join is faster for co-partitioned RDDs – Spark only has to fetch one partition Even faster for co-located RDDs – The partition is already in the right place When loading attributes we make a seemingly useless join that causes two RDDs to be co-located
  8. 8. Co-located loading val attributeRDD = sc.loadObjectFile[(ID, T)](path)
  9. 9. Co-located loading val rawRDD = sc.loadObjectFile[(ID, T)](path) val attributeRDD = vertexRDD.join(rawRDD).mapValues(_._2)
  10. 10. Co-located loading val rawRDD = sc.loadObjectFile[(ID, T)](path) val attributeRDD = vertexRDD.join(rawRDD).mapValues(_._2) attributeRDD.cache
  11. 11. Scheduler delay What's the ideal number of partitions for speed? At least N partitions for N cores, otherwise some cores will be wasted. But any more than that just wastes time on scheduling tasks.
  12. 12. sc.parallelize(1 to 1000, n).count 20 40 60 80 100 0 100 200 300 400 Number of partitions Milliseconds Scheduler delay
  13. 13. GC pauses Can be tens of seconds on high-memory machines Bad for interactive experience Need to avoid creating big objects
  14. 14. GC pauses sc.parallelize(1 to 10000000, 1).cache.count 0 250 500 750 1000 Milliseconds
  15. 15. Sorted RDDs Speeds up the last step of the join The insides of the partitions are kept sorted Merging sorted sequences is fast Doesn't require building a large hashmap 10 × speedup + GC benefit – 2 × if cost of sorting is included – sorting cost is amortized across many joins Benefits other operations too (e.g. distinct)
  16. 16. Sorted RDDs Join performance on 4 partitions normal join sorted RDD join 1500000 3000000 4500000 6000000 7500000 0 2500 5000 7500 10000 RDD rows Milliseconds
  17. 17. Idea 3: Do not read/compute all the data
  18. 18. Are all numbers positive? def allPositive(rdd: RDD[Double]): Boolean = rdd.filter(_ > 0).count == rdd.count // Terrible. It executes the RDD twice.
  19. 19. Are all numbers positive? def allPositive(rdd: RDD[Double]): Boolean = rdd.filter(_ <= 0).count == 0 // A bit better, // but it still executes the whole RDD.
  20. 20. Are all numbers positive? def allPositive(rdd: RDD[Double]): Boolean = rdd.mapPartitions { p => Iterator(p.forall(_ > 0)) }.collect.forall(_ == true) // Each partition is only processed up to // the first negative value.
  21. 21. Are all numbers positive? def allPositive(rdd: RDD[Double]): Boolean = rdd.mapPartitions { p => Iterator(p.forall(_ > 0)) }.collect.forall(identity) // Each partition is only processed up to // the first negative value.
  22. 22. Prefix sampling Partitions are sorted by the randomly assigned ID Taking the first N elements is an unbiased sample Lazy evaluation means the rest are not even computed Used for histograms and bucketed views
  23. 23. Idea 4: Lookup instead of filtering for small key sets
  24. 24. Restricted ID sets Cannot use sampling when showing 5 vertices Hard to explain why showing 5 million is faster Partitions are already sorted We can use binary search to look up attributes Put partitions into arrays for random access
  25. 25. Restricted ID sets
  26. 26. Summary Column-oriented attributes Small number of co-located, cached partitions Sorted RDDs Prefix sampling Binary search-based lookup
  27. 27. Backup slides
  28. 28. Comparison with GraphX Benchmarked connected components Big data payload (not interactive) Speed dominated by number of shuffle stages Same number of shuffles ⇒ same speed – Despite simpler data structures in LynxKite Better algorithm in LynxKite ⇒ fewer shuffles – From “A Model of Computation for MapReduce” Benchmarked without short-circuit optimization
  29. 29. Connected component search GraphX LynxKite no shortcut LynxKite 200 450 700 950 1200 0 4000 8000 12000 16000 Number of edges Milliseconds Comparison with GraphX
  30. 30. Connected component search GraphX LynxKite no shortcut LynxKite 10000 35000 60000 85000 110000 0 15000 30000 45000 60000 Number of edges Milliseconds Comparison with GraphX

×