Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Neo4j Integration with ElasticSearc... by Benjamin Nussbaum 26930 views
- Giraph at Hadoop Summit 2014 by Claudio Martella 11756 views
- Hadoop Graph Processing with Apache... by Hadoop Summit 10635 views
- Journey of The Connected Enterprise... by Benjamin Nussbaum 31414 views
- Apache Giraph: start analyzing grap... by rhatr 5353 views
- Large Scale Graph Processing with A... by sscdotopen 17925 views

27,903 views

Published on

No Downloads

Total views

27,903

On SlideShare

0

From Embeds

0

Number of Embeds

6,616

Shares

0

Downloads

566

Comments

0

Likes

46

No embeds

No notes for slide

- 1. Introducing Apache Giraph for Large Scale Graph Processing Sebastian Schelter PhD student at the Database Systems and Information Management Group of TU Berlin Committer and PMC member at Apache Mahout and Apache Giraph mail ssc@apache.org blog http://ssc.io
- 2. Graph recapgraph: abstract representation of a set of objects(vertices), where some pairs of these objects areconnected by links (edges), which can be directed orundirectedGraphs can be used to model arbitrary things likeroad networks, social networks, flows of goods, etc.Majority of graph algorithms Bare iterative and traversethe graph in some way A D C
- 3. Real world graphs are really large!• the World Wide Web has several billion pages with several billion links• Facebook‘s social graph had more than 700 million users and more than 68 billion friendships in 2011• twitter‘s social graph has billions of follower relationships
- 4. Why not use MapReduce/Hadoop?• Example: PageRank, Google‘s famous algorithm for measuring the authority of a webpage based on the underlying network of hyperlinks• defined recursively: each vertex distributes its authority to its neighbors in equal proportions pj pi dj j ( j , i )
- 5. Textbook approach to PageRank in MapReduce• PageRank p is the principal eigenvector of the Markov matrix M defined by the transition probabilities between web pages• it can be obtained by iteratively multiplying an initial PageRank vector by M (power iteration) p M p0 k row 1 of M ∙ row 2 of M ∙ pi pi+1 row n of M ∙
- 6. Drawbacks• Not intuitive: only crazy scientists think in matrices and eigenvectors• Unnecessarily slow: Each iteration is scheduled as separate MapReduce job with lots of overhead – the graph structure is read from disk – the map output is spilled to disk – the intermediary result is written to HDFS• Hard to implement: a join has to be implemented by hand, lots of work, best strategy is data dependent
- 7. Google Pregel• distributed system especially developed for large scale graph processing• intuitive API that let‘s you ‚think like a vertex‘• Bulk Synchronous Parallel (BSP) as execution model• fault tolerance by checkpointing
- 8. Bulk Synchronous Parallel (BSP) processorslocal computation superstepcommunicationbarriersynchronization
- 9. Vertex-centric BSP• each vertex has an id, a value, a list of its adjacent vertex ids and the corresponding edge values• each vertex is invoked in each superstep, can recompute its value and send messages to other vertices, which are delivered over superstep barriers• advanced features : termination votes, combiners, aggregators, topology mutations vertex1 vertex1 vertex1 vertex2 vertex2 vertex2 vertex3 vertex3 vertex3 superstep i superstep i + 1 superstep i + 2
- 10. Master-slave architecture• vertices are partitioned and assigned to workers – default: hash-partitioning – custom partitioning possible• master assigns and coordinates, while workers execute vertices Master and communicate with each other Worker 1 Worker 2 Worker 3
- 11. PageRank in Pregelclass PageRankVertex { void compute(Iterator messages) { if (getSuperstep() > 0) { // recompute own PageRank from the neighbors messages pageRank = sum(messages); pj setVertexValue(pageRank); } pi j ( j , i ) dj if (getSuperstep() < k) { // send updated PageRank to each neighbor sendMessageToAllNeighbors(pageRank / getNumOutEdges()); } else { voteToHalt(); // terminate }}}
- 12. PageRank toy example .17 .33.33 .33 .33 Superstep 0 .17 .17 .17 Input graph .25 .34.17 .50 .34 Superstep 1 A B C .09 .25 .09 .22 .34.25 .43 .34 Superstep 2 .13 .22 .13
- 13. Cool, where can I download it?• Pregel is proprietary, but: – Apache Giraph is an open source implementation of Pregel – runs on standard Hadoop infrastructure – computation is executed in memory – can be a job in a pipeline (MapReduce, Hive) – uses Apache ZooKeeper for synchronization
- 14. Giraph‘s Hadoop usage TaskTracker TaskTracker TaskTrackerworker worker worker worker worker worker TaskTracker ZooKeeper master worker JobTracker NameNode
- 15. Anatomy of an executionSetup Teardown• load the graph from disk • write back result• assign vertices to workers • write back aggregators• validate workers healthCompute Synchronize• assign messages to workers • send messages to workers• iterate on active vertices • compute aggregators• call vertices compute() • checkpoint
- 16. Who is doing what?• ZooKeeper: responsible for computation state – partition/worker mapping – global state: #superstep – checkpoint paths, aggregator values, statistics• Master: responsible for coordination – assigns partitions to workers – coordinates synchronization – requests checkpoints – aggregates aggregator values – collects health statuses• Worker: responsible for vertices – invokes active vertices compute() function – sends, receives and assigns messages – computes local aggregation values
- 17. What do you have to implement?• your algorithm as a Vertex – Subclass one of the many existing implementations: BasicVertex, MutableVertex, EdgeListVertex, HashMapVertex, LongDoubleFloatDoubleVertex,...• a VertexInputFormat to read your graph – e.g. from a text file with adjacency lists like <vertex> <neighbor1> <neighbor2> ...• a VertexOutputFormat to write back the result – e.g. <vertex> <pageRank>
- 18. Starting a Giraph job• no difference to starting a Hadoop job: $ hadoop jar giraph-0.1-jar-with-dependencies.jar o.a.g.GiraphRunner o.a.g.examples.ConnectedComponentsVertex --inputFormat o.a.g.examples.IntIntNullIntTextInputFormat --inputPath hdfs:///wikipedia/pagelinks.txt --outputFormat o.a.g.examples.ComponentOutputFormat --outputPath hdfs:///wikipedia/results/ --workers 89 --combiner o.a.g.examples.MinimumIntCombiner
- 19. What‘s to come?• Current and future work in Giraph – graduation from the incubator – out-of-core messaging – algorithms library• 2-day workshop after Berlin Buzzwords – topic: ‚Parallel Processing beyond MapReduce‘ – meet the developers of Giraph and Stratosphere http://berlinbuzzwords.de/content/workshops-berlin-buzzwords
- 20. Everything is a network!
- 21. Further resources• Apache Giraph homepage http://incubator.apache.org/giraph• Claudio Martella: “Apache Giraph: Distributed Graph Processing in the Cloud” http://prezi.com/9ake_klzwrga/apache-giraph-distributed-graph- processing-in-the-cloud/• Malewicz et al.: „Pregel – a system for large scale graph processing“, PODC 09 http://dl.acm.org/citation.cfm?id=1582723
- 22. Thank you.Questions?

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment