Pregel: A System for Large-Scale Graph Processing
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Pregel: A System for Large-Scale Graph Processing

on

  • 30,648 views

These are the slides for a presentation I recently gave at a seminar on Tools for High-Performance Computing with Big Graphs. It covers Google's Pregel system, in use for processing graph algorithms ...

These are the slides for a presentation I recently gave at a seminar on Tools for High-Performance Computing with Big Graphs. It covers Google's Pregel system, in use for processing graph algorithms in a scalable manner.

Statistics

Views

Total Views
30,648
Views on SlideShare
26,703
Embed Views
3,945

Actions

Likes
61
Downloads
876
Comments
0

26 Embeds 3,945

http://liulixiang.info 1127
http://nosql.mypopescu.com 1117
http://www.graph-database.org 677
http://jmalcaraz.com 605
http://blog.liulixiang.info 239
http://paper.li 52
http://heeha.wordpress.com 29
http://www.symfonylab.com 28
https://twitter.com 11
http://svr242.fla.fujitsu.com 10
http://curation.fla.fujitsu.com 10
http://webcache.googleusercontent.com 9
http://irr.posterous.com 7
http://stiq.tumblr.com 4
http://tedwon.com 4
http://veille.office.twenga.com 3
http://translate.googleusercontent.com 3
http://cache.baidu.com 2
http://graph-databases.org 1
res://ieframe.dll 1
http://clifft135.wordpress.com 1
http://swazzy.com 1
http://graph-database.net 1
https://www.google.co.in 1
http://cache.baidu. 1
http://10.167.228.32 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Pregel: A System for Large-Scale Graph Processing Presentation Transcript

  • 1. Pregel: A System for Large- Scale Graph Processing Written by G. Malewicz et al. at SIGMOD 2010 Presented by Chris Bunch Tuesday, October 12, 2010 1 Wednesday, October 13, 2010
  • 2. Graphs are hard • Poor locality of memory access • Very little work per vertex • Changing degree of parallelism • Running over many machines makes the problem worse 2 Wednesday, October 13, 2010
  • 3. State of the Art Today • Write your own infrastructure • Substantial engineering effort • Use MapReduce • Inefficient - must store graph state in each stage, too much communication between stages 3 Wednesday, October 13, 2010
  • 4. State of the Art Today • Use a single-computer graph library • Not scalable ☹ • Use existing parallel graph systems • No fault tolerance ☹ 4 Wednesday, October 13, 2010
  • 5. Bulk Synchronous Parallel • Series of iterations (supersteps) • Each vertex invokes a function in parallel • Can read messages sent in previous superstep • Can send messages, to be read at the next superstep • Can modify state of outgoing edges 5 Wednesday, October 13, 2010
  • 6. Compute Model • You give Pregel a directed graph • It runs your computation at each vertex • Do this until every vertex votes to halt • Pregel gives you a directed graph back 6 Wednesday, October 13, 2010
  • 7. Primitives • Vertices - first class • Edges - not • Both can be dynamically created and destroyed 7 Wednesday, October 13, 2010
  • 8. Vertex State Machine 8 Wednesday, October 13, 2010
  • 9. C++ API • Your code subclasses Vertex, writes a Compute method • Can get/set vertex value • Can get/set outgoing edges values • Can send/receive messages 9 Wednesday, October 13, 2010
  • 10. C++ API • Message passing: • No guaranteed message delivery order • Messages are delivered exactly once • Can send messages to any node • If dest doesn’t exist, user’s function is called 10 Wednesday, October 13, 2010
  • 11. C++ API • Combiners (off by default): • User specifies a way to reduce many messages into one value (ala Reduce in MR) • Must be commutative and associative • Exceedingly useful in certain contexts (e.g., 4x speedup on shortest-path compuation) 11 Wednesday, October 13, 2010
  • 12. C++ API • Aggregators: • User specifies a function • Each vertex sends it a value • Each vertex receives aggregate(vals) • Can be used for statistics or coordination 12 Wednesday, October 13, 2010
  • 13. C++ API • Topology mutations: • Vertices can create / destroy vertices at will • Resolving conflicting requests: • Partial ordering: E Remove,V Remove,V Add, E Add • User-defined handlers:You fix the conflicts on your own 13 Wednesday, October 13, 2010
  • 14. C++ API • Input and output: • Text file • Vertices in a relational DB • Rows in BigTable • Custom - subclass Reader/Writer classes 14 Wednesday, October 13, 2010
  • 15. Implementation • Executable is copied to many machines • One machine becomes the Master • Coordinates activities • Other machines become Workers • Performs computation 15 Wednesday, October 13, 2010
  • 16. Implementation • Master partitions the graph • Master partitions the input • If a Worker receives input that is not for their vertices, they pass it along • Supersteps begin • Master can tell Workers to save graphs 16 Wednesday, October 13, 2010
  • 17. Fault Tolerance • At each superstep S: • Workers checkpoint V, E, and Messages • Master checkpoints Aggregators • If a node fails, everyone starts over at S • Confined recovery is under development • what happens if the Master fails? 17 Wednesday, October 13, 2010
  • 18. The Worker • Keeps graph in memory • Message queues for supersteps S and S+1 • Remote messages are buffered • Combiner is used when messages are sent or received (save network and disk) 18 Wednesday, October 13, 2010
  • 19. The Master • Master keeps track of which Workers own each partition • Not who owns each Vertex • Coordinates all operations (via barriers) • Maintains statistics and runs a HTTP server for users to view info on 19 Wednesday, October 13, 2010
  • 20. Aggregators • Worker passes values to its aggregator • Aggregator uses tree structure to reduce vals w/ other aggregators • Better parallelism than chain pipelining • Final value is sent to Master 20 Wednesday, October 13, 2010
  • 21. PageRank in Pregel 21 Wednesday, October 13, 2010
  • 22. Shortest Path in Pregel 22 Wednesday, October 13, 2010
  • 23. Evaluation • 300 multicore commodity PCs used • Only running time is counted • Checkpointing disabled • Measures scalability of Worker tasks • Measures scalability w.r.t. # of Vertices • in binary trees and log-normal trees 23 Wednesday, October 13, 2010
  • 24. 24 Wednesday, October 13, 2010
  • 25. 25 Wednesday, October 13, 2010
  • 26. 26 Wednesday, October 13, 2010
  • 27. Current / Future Work • Graph must fit in RAM - working on spilling over to / from disk • Assigning vertices to machines to optimize traffic is an open problem • Want to investigate dynamic re- partitioning 27 Wednesday, October 13, 2010
  • 28. Conclusions • Pregel is production-ready and in use • Usable after a short learning curve • Vertex centric is not always easy to do • Pregel works best on sparse graphs w / communication over edges • Can’t change the API - too many people using it! 28 Wednesday, October 13, 2010
  • 29. Related Work • Hama - from the Apache Hadoop team • BSP model but not vertex centric ala Pregel • Appears not to be ready for real use: • 29 Wednesday, October 13, 2010
  • 30. Related Work • Phoebus, released last week on github • Runs on Mac OS X • Cons (as of this writing): • Doesn’t work on Linux • Must write code in Erlang (since Phoebus is written in it) 30 Wednesday, October 13, 2010
  • 31. Thanks! • To my advisor, Chandra Krintz • To Google for this paper • To all of you for coming! 31 Wednesday, October 13, 2010