Giraph at Hadoop Summit 2014

11,929 views

Published on

Presentation of Apache Giraph at Hadoop Summit 2014 Amsterdam

Published in: Technology

Giraph at Hadoop Summit 2014

  1. 1. Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella <claudio@apache.org> @claudiomartella Hadoop Summit @ Amsterdam - 3 April 2014
  2. 2. 2
  3. 3. Graphs are simple 3
  4. 4. A computer network 4
  5. 5. A social network 5
  6. 6. A semantic network 6
  7. 7. A map 7
  8. 8. Graphs are huge • Google’s index contains 50B pages • Facebook has around1.1B users • Google+ has around 570M users • Twitter has around 530M users VERY rough estimates! 8
  9. 9. 9
  10. 10. Graphs aren’t easy 10
  11. 11. Graphs are nasty. 11
  12. 12. Each vertex depends on its neighbours, recursively. 12
  13. 13. Recursive problems are nicely solved iteratively. 13
  14. 14. PageRank in MapReduce • Record: < v_i, pr, [ v_j, ..., v_k ] > • Mapper: emits < v_j, pr / #neighbours > • Reducer: sums the partial values 14
  15. 15. MapReduce dataflow 15
  16. 16. Drawbacks • Each job is executed N times • Job bootstrap • Mappers send PR values and structure • Extensive IO at input, shuffle & sort, output 16
  17. 17. 17
  18. 18. Timeline • Inspired by Google Pregel (2010) • Donated to ASF byYahoo! in 2011 • Top-level project in 2012 • 1.0 release in January 2013 • 1.1 release in days 2014 18
  19. 19. Plays well with Hadoop 19
  20. 20. Vertex-centric API 20
  21. 21. BSP machine 21
  22. 22. BSP & Giraph 22
  23. 23. Advantages • No locks: message-based communication • No semaphores: global synchronization • Iteration isolation: massively parallelizable 23
  24. 24. Architecture 24
  25. 25. Giraph job lifetime 25
  26. 26. Designed for iterations • Stateful (in-memory) • Only intermediate values (messages) sent • Hits the disk at input, output, checkpoint • Can go out-of-core 26
  27. 27. A bunch of other things • Combiners (minimises messages) • Aggregators (global aggregations) • MasterCompute (executed on master) • WorkerContext (executed per worker) • PartitionContext (executed per partition) 27
  28. 28. Shortest Paths 28
  29. 29. Shortest Paths 29
  30. 30. Shortest Paths 30
  31. 31. Shortest Paths 31
  32. 32. Shortest Paths 32
  33. 33. Composable API 33
  34. 34. Checkpointing 34
  35. 35. No SPoFs 35
  36. 36. Giraph scales 36 ref: https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-edges/10151617006153920
  37. 37. Giraph is fast • 100x over MR (Pr) • jobs run within minutes • given you have resources ;-) 37
  38. 38. Serialised objects 38
  39. 39. Primitive types • Autoboxing is expensive • Objects overhead (JVM) • Use primitive types on your own • Use primitive types-based libs (e.g. fastutils) 39
  40. 40. Sharded aggregators 40
  41. 41. Many stores with Gora 41
  42. 42. And graph databases 42
  43. 43. Current and next steps • Out-of-core graph and messages • Jython interface • Remove Writable from < IV E M > • Partitioned supernodes • More documentation 43
  44. 44. Giraph in Action • Published by Manning • MEAP now • Complete Q3 2014 (well...) • Part 1: Graphs and Algorithms • Part 2: Giraph Basic Topics • Part 3: Giraph Advanced Topics • http://www.manning.com/martella 44
  45. 45. Okapi • Apache Mahout for graphs • Graph-based recommenders: ALS, SGD, SVD++, etc. • Graph analytics: Graph partitioning, Community Detection, K-Core, etc. 45
  46. 46. Thank you Claudio Martella <claudio@apache.org> @claudiomartella http://giraph.apache.org Some figures gently borrowed from Nitay Joffe: http://www.slideshare.net/nitayj/20130910-giraph-at-london-hadoop-users-group

×