Dynamic Draph / Iterative Computation on Apache Giraph

639
-1

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
639
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Dynamic Draph / Iterative Computation on Apache Giraph

  1. 1. Dynamic graph / iterative computation on Apache Giraph 6/3/2014 Avery Ching Hadoop Summit
  2. 2. Motivation
  3. 3. Apache Giraph • Inspired by Google’s Pregel but runs on Hadoop • “Think like a vertex” • Maximum value vertex example Processor 1 Processor 2 Time 5 5 5 5 2 5 5 5 2 1 5 5 2 1
  4. 4. Giraph on Hadoop / Yarn MapReduce YARN Giraph Hadoop 0.20.x Hadoop 0.20.203 Hadoop 2.0.x Hadoop 1.x
  5. 5. Send page rank value to neighbors for 30 iterations Calculate updated page rank value from neighbors Page rank in Giraph ! ! public class PageRankComputation extends BasicComputation<LongWritable, DoubleWritable, FloatWritable, DoubleWritable> { public void compute(Vertex<LongWritable, DoubleWritable, FloatWritable> vertex, Iterable<DoubleWritable> messages) { if (getSuperstep() >= 1) { double sum = 0; for (DoubleWritable message : messages) { sum += message.get(); } vertex.getValue().set(DoubleWritable((0.15d / getTotalNumVertices()) + 0.85d * sum); } if (getSuperstep() < 30) { sendMsgToAllEdges(new DoubleWritable(getVertexValue().get() / getNumOutEdges())); } else { voteToHalt(); } } }
  6. 6. Apache Giraph data flow Loading the graph Input
 format Split0 Split1 Split2 Split3 Master Load/ Send Graph Worker0 Load/ Send Graph Worker1 Storing the graph Worker0Worker1 Output
 format Part0 Part1 Part2 Part3 Part0 Part1 Part2 Part3 Compute / Iterate Compute/ Send Messages Compute/ Send Messages In-memory graph Part0 Part1 Part2 Part3Master Worker0Worker1 Sendstats/iterate!
  7. 7. Pipelined computation Master “computes” • Sets computation, in/out message, combiner for next super step •Can set/modify aggregator values Time Worker 0 Worker 1 Master phase 1a phase 1a phase 1b phase 1b phase 2 phase 2 phase 3 phase 3
  8. 8. Use case
  9. 9. Affinity propagation Frey and Dueck “Clustering by passing messages between data points” Science 2007 Organically discover exemplars based on similarity Initialization Intermediate Convergence
  10. 10. Responsibility r(i,k) • How well suited is k to be an exemplar for i? Availability a(i,k) • How appropriate for point i to choose point k as an exemplar given all of i’s responsibilities? Update exemplars • Based on known responsibilities/availabilities, which vertex should be my exemplar? ! * Dampen responsibility, availability 3 stages
  11. 11. Responsibility Every vertex i with an edge to k maintains responsibility of k for i Sends responsibility to k in ResponsibilityMessage (senderid, responsibility(i,k)) C A D B r(c,a) r(d,a) r(b,d) r(b,a)
  12. 12. Availability Vertex sums positive messages Sends availability to i in AvailabilityMessage (senderid, availability(i,k)) C A D B a(c,a) a(d,a) a(b,d) a(b,a)
  13. 13. Update exemplars Dampens availabilities and scans edges to find exemplar k Updates self-exemplar C A D Bupdate update update update exemplar=a exemplar=d exemplar=a exemplar=a
  14. 14. Master logic calculate responsibility calculate availability update exemplars initial state halt if (exemplars agree they are exemplars && changed exemplars < ∆) then halt, otherwise continue
  15. 15. Performance
  16. 16. Faster than Hive? Application Graph Size CPU Time Speedup Elapsed Time Speedup Page rank
 (single iteration) 400B+ edges 26x 120x Friends of friends score
 71B+ edges 12.5x 48x
  17. 17. Apache Giraph scalability Scalability of workers (200B edges) Seconds 0 125 250 375 500 # of Workers 50 100 150 200 250 300 Giraph Ideal Scalability of edges (50 workers) Seconds 0 125 250 375 500 # of Edges 1E+09 7E+10 1E+11 2E+11 Giraph Ideal
  18. 18. Trillion social edges page rank Minutesperiteration 0 1 2 3 4 6/30/2013 6/2/2014 Improvements • GIRAPH-840 - Netty 4 upgrade • G1 Collector / tuning
  19. 19. Graph partitioning
  20. 20. Why balanced partitioning Random partitioning == good balance BUT ignores entity affinity 0 1 2 3 4 5 6 7 8 9 10 11
  21. 21. Balanced partitioning application Results from one service: Cache hit rate grew from 70% to 85%, bandwidth cut in 1/2 ! ! 0 2 3 5 6 9 11 1 4 7 8 10
  22. 22. Balanced label propagation results * Loosely based on Ugander and Backstrom. Balanced label propagation for partitioning massive graphs, WSDM '13
  23. 23. Partitioning experimentsSecondsperiteration 0 40 80 120 160 Random 47% Local Edges 345B edge page rank Improvements • 56% faster! • Native vertex remapping
  24. 24. Power-law graphs
  25. 25. Avoiding out-of-core Example: Mutual friends calculation between neighbors 1. Send your friends a list of your friends 2. Intersect with your friend list ! 1.23B (as of 1/2014) 200+ average friends (2011 S1) 8-byte ids (longs) = 394 TB / 100 GB machines 3,940 machines (not including the graph) A B C D E A:{D} D:{A,E} E:{D} B:{} C:{D} D:{C} A:{C} C:{A,E} E:{C} ! C:{D} D:{C} ! ! E:{}
  26. 26. Superstep splitting Subsets of sources/destinations edges per superstep * Currently manual - future work automatic! A Sources: A (on), B (off) Destinations: A (on), B (off) B B B A A A Sources: A (on), B (off) Destinations: A (off), B (on) B B B A A A Sources: A (off), B (on) Destinations: A (on), B (off) B B B A A A Sources: A (off), B (on) Destinations: A (off), B (on) B B B A A
  27. 27. Giraph in production Over 1.5 years in production Over 100 jobs processed a week 30+ applications in our internal application repository Sample production job - 700B+ edge
  28. 28. GiraphicJam demo
  29. 29. Giraph related projects Graft: The distributed Giraph debugger
  30. 30. Giraph roadmap 2/12 - 0.1 6/14 - 1.15/13 - 1.0
  31. 31. Future work Automatic checkpointing make scheduling more fair Investigate alternative computing models •Giraph++ (IBM research) •Giraphx (University at Buffalo, SUNY) Performance Lower the barrier to entry Applications
  32. 32. Our team ! Maja Kabiljo Sergey Edunov Pavan Athivarapu Avery Ching Sambavi Muthukrishnan

×