Pregel: A System for Large-Scale Graph Processing
Upcoming SlideShare
Loading in...5
×
 

Pregel: A System for Large-Scale Graph Processing

on

  • 29,655 views

These are the slides for a presentation I recently gave at a seminar on Tools for High-Performance Computing with Big Graphs. It covers Google's Pregel system, in use for processing graph algorithms ...

These are the slides for a presentation I recently gave at a seminar on Tools for High-Performance Computing with Big Graphs. It covers Google's Pregel system, in use for processing graph algorithms in a scalable manner.

Statistics

Views

Total Views
29,655
Views on SlideShare
25,736
Embed Views
3,919

Actions

Likes
60
Downloads
861
Comments
0

26 Embeds 3,919

http://liulixiang.info 1127
http://nosql.mypopescu.com 1114
http://www.graph-database.org 677
http://jmalcaraz.com 595
http://blog.liulixiang.info 229
http://paper.li 52
http://heeha.wordpress.com 28
http://www.symfonylab.com 28
http://svr242.fla.fujitsu.com 10
http://curation.fla.fujitsu.com 10
https://twitter.com 9
http://webcache.googleusercontent.com 9
http://irr.posterous.com 7
http://stiq.tumblr.com 4
http://tedwon.com 4
http://veille.office.twenga.com 3
http://translate.googleusercontent.com 3
http://cache.baidu.com 2
http://graph-databases.org 1
res://ieframe.dll 1
http://clifft135.wordpress.com 1
http://swazzy.com 1
http://graph-database.net 1
https://www.google.co.in 1
http://cache.baidu. 1
http://10.167.228.32 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Pregel: A System for Large-Scale Graph Processing Pregel: A System for Large-Scale Graph Processing Presentation Transcript

  • Pregel: A System for Large- Scale Graph Processing Written by G. Malewicz et al. at SIGMOD 2010 Presented by Chris Bunch Tuesday, October 12, 2010 1 Wednesday, October 13, 2010
  • Graphs are hard • Poor locality of memory access • Very little work per vertex • Changing degree of parallelism • Running over many machines makes the problem worse 2 Wednesday, October 13, 2010
  • State of the Art Today • Write your own infrastructure • Substantial engineering effort • Use MapReduce • Inefficient - must store graph state in each stage, too much communication between stages 3 Wednesday, October 13, 2010
  • State of the Art Today • Use a single-computer graph library • Not scalable ☹ • Use existing parallel graph systems • No fault tolerance ☹ 4 Wednesday, October 13, 2010
  • Bulk Synchronous Parallel • Series of iterations (supersteps) • Each vertex invokes a function in parallel • Can read messages sent in previous superstep • Can send messages, to be read at the next superstep • Can modify state of outgoing edges 5 Wednesday, October 13, 2010
  • Compute Model • You give Pregel a directed graph • It runs your computation at each vertex • Do this until every vertex votes to halt • Pregel gives you a directed graph back 6 Wednesday, October 13, 2010
  • Primitives • Vertices - first class • Edges - not • Both can be dynamically created and destroyed 7 Wednesday, October 13, 2010
  • Vertex State Machine 8 Wednesday, October 13, 2010
  • C++ API • Your code subclasses Vertex, writes a Compute method • Can get/set vertex value • Can get/set outgoing edges values • Can send/receive messages 9 Wednesday, October 13, 2010
  • C++ API • Message passing: • No guaranteed message delivery order • Messages are delivered exactly once • Can send messages to any node • If dest doesn’t exist, user’s function is called 10 Wednesday, October 13, 2010
  • C++ API • Combiners (off by default): • User specifies a way to reduce many messages into one value (ala Reduce in MR) • Must be commutative and associative • Exceedingly useful in certain contexts (e.g., 4x speedup on shortest-path compuation) 11 Wednesday, October 13, 2010
  • C++ API • Aggregators: • User specifies a function • Each vertex sends it a value • Each vertex receives aggregate(vals) • Can be used for statistics or coordination 12 Wednesday, October 13, 2010
  • C++ API • Topology mutations: • Vertices can create / destroy vertices at will • Resolving conflicting requests: • Partial ordering: E Remove,V Remove,V Add, E Add • User-defined handlers:You fix the conflicts on your own 13 Wednesday, October 13, 2010
  • C++ API • Input and output: • Text file • Vertices in a relational DB • Rows in BigTable • Custom - subclass Reader/Writer classes 14 Wednesday, October 13, 2010
  • Implementation • Executable is copied to many machines • One machine becomes the Master • Coordinates activities • Other machines become Workers • Performs computation 15 Wednesday, October 13, 2010
  • Implementation • Master partitions the graph • Master partitions the input • If a Worker receives input that is not for their vertices, they pass it along • Supersteps begin • Master can tell Workers to save graphs 16 Wednesday, October 13, 2010
  • Fault Tolerance • At each superstep S: • Workers checkpoint V, E, and Messages • Master checkpoints Aggregators • If a node fails, everyone starts over at S • Confined recovery is under development • what happens if the Master fails? 17 Wednesday, October 13, 2010
  • The Worker • Keeps graph in memory • Message queues for supersteps S and S+1 • Remote messages are buffered • Combiner is used when messages are sent or received (save network and disk) 18 Wednesday, October 13, 2010
  • The Master • Master keeps track of which Workers own each partition • Not who owns each Vertex • Coordinates all operations (via barriers) • Maintains statistics and runs a HTTP server for users to view info on 19 Wednesday, October 13, 2010
  • Aggregators • Worker passes values to its aggregator • Aggregator uses tree structure to reduce vals w/ other aggregators • Better parallelism than chain pipelining • Final value is sent to Master 20 Wednesday, October 13, 2010
  • PageRank in Pregel 21 Wednesday, October 13, 2010
  • Shortest Path in Pregel 22 Wednesday, October 13, 2010
  • Evaluation • 300 multicore commodity PCs used • Only running time is counted • Checkpointing disabled • Measures scalability of Worker tasks • Measures scalability w.r.t. # of Vertices • in binary trees and log-normal trees 23 Wednesday, October 13, 2010
  • 24 Wednesday, October 13, 2010
  • 25 Wednesday, October 13, 2010
  • 26 Wednesday, October 13, 2010
  • Current / Future Work • Graph must fit in RAM - working on spilling over to / from disk • Assigning vertices to machines to optimize traffic is an open problem • Want to investigate dynamic re- partitioning 27 Wednesday, October 13, 2010
  • Conclusions • Pregel is production-ready and in use • Usable after a short learning curve • Vertex centric is not always easy to do • Pregel works best on sparse graphs w / communication over edges • Can’t change the API - too many people using it! 28 Wednesday, October 13, 2010
  • Related Work • Hama - from the Apache Hadoop team • BSP model but not vertex centric ala Pregel • Appears not to be ready for real use: • 29 Wednesday, October 13, 2010
  • Related Work • Phoebus, released last week on github • Runs on Mac OS X • Cons (as of this writing): • Doesn’t work on Linux • Must write code in Erlang (since Phoebus is written in it) 30 Wednesday, October 13, 2010
  • Thanks! • To my advisor, Chandra Krintz • To Google for this paper • To all of you for coming! 31 Wednesday, October 13, 2010