Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

pptx - Distributed Parallel Inference on Large Factor Graphs


Published on

  • Be the first to comment

  • Be the first to like this

pptx - Distributed Parallel Inference on Large Factor Graphs

  1. 1. Distributed Parallel Inference on Large Factor Graphs<br />Joseph E. Gonzalez<br />Yucheng Low<br />Carlos Guestrin<br />David O’Hallaron<br />TexPoint fonts used in EMF. <br />Read the TexPoint manual before you delete this box.: AAAAAAAAAAA<br />
  2. 2. Exponential Parallelism<br />Exponentially Increasing <br />Parallel Performance<br />Exponentially Increasing<br />Sequential Performance<br />Constant Sequential<br />Performance<br />Processor Speed GHz<br />2<br />Release Date<br />
  3. 3. Distributed Parallel Setting<br />3<br />Fast Reliable Network<br />Node<br />Node<br />Memory<br />Memory<br />CPU<br />Bus<br />CPU<br />Bus<br />Cache<br />Cache<br />Opportunities:<br />Access to larger systems: 8 CPUs  1000 CPUs<br />Linear Increase:<br />RAM, Cache Capacity, and Memory Bandwidth<br />Challenges:<br />Distributed state, Communication and Load Balancing<br />
  4. 4. Graphical Models and Parallelism<br />Graphical models provide a common language for general purpose parallel algorithms in machine learning<br />A parallel inference algorithm would improve:<br />4<br />Protein Structure Prediction<br />Movie Recommendation<br />Computer Vision<br />Inference is a key step in Learning Graphical Models<br />
  5. 5. Belief Propagation (BP)<br />Message passing algorithm<br />Naturally Parallel Algorithm<br />5<br />
  6. 6. Parallel Synchronous BP<br />Given the old messages all new messages can be computed in parallel:<br />6<br />New<br />Messages<br />Old<br />Messages<br />CPU 1<br />CPU 2<br />CPU 3<br />CPU n<br />Map-Reduce Ready!<br />
  7. 7. Hidden Sequential Structure<br />7<br />
  8. 8. Hidden Sequential Structure<br />8<br />
  9. 9. Hidden Sequential Structure<br />9<br />Evidence<br />Evidence<br />Running Time:<br />Time for a single<br />parallel iteration<br />Number of Iterations<br />
  10. 10. Optimal Sequential Algorithm<br />Running<br />Time<br />Forward-Backward<br />Naturally Parallel<br />2n2/p<br />Gap<br />2n<br />p ≤ 2n<br />p = 1<br />Optimal Parallel<br />n<br />p = 2<br />10<br />
  11. 11. Parallelism by Approximation<br />11<br />True Messages<br />1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />τε-Approximation<br />τε represents the minimal sequential structure<br />1<br />
  12. 12. Processor 1<br />Processor 2<br />Processor 3<br />In [AIStats 09] we demonstrated that this algorithm is optimal:<br />Synchronous Schedule<br />Optimal Schedule<br />Optimal Parallel Scheduling<br />Gap<br />Parallel<br />Component<br />Sequential<br />Component<br />12<br />
  13. 13. The Splash Operation<br />Generalize the optimal chain algorithm:to arbitrary cyclic graphs:<br />~<br />13<br />Grow a BFS Spanning tree with fixed size<br />Forward Pass computing all messages at each vertex<br />Backward Pass computing all messages at each vertex<br />
  14. 14. Running Parallel Splashes<br />Splash<br />Local State<br />Local State<br />Local State<br />Key Challenges:<br />How do we schedules Splashes?<br />How do we partition the Graph?<br />CPU 2<br />CPU 3<br />CPU 1<br />Splash<br />Splash<br />Partition the graph<br />Schedule Splashes locally<br />Transmit the messages along the boundary of the partition<br />14<br />
  15. 15. Where do we Splash?<br />Assign priorities and use a scheduling queue to select roots:<br />Splash<br />Local State<br />?<br />?<br />?<br />Splash<br />Scheduling Queue<br />How do we assign priorities?<br />CPU 1<br />
  16. 16. Message Scheduling<br />Residual Belief Propagation [Elidan et al., UAI 06]: <br />Assign priorities based on change in inbound messages<br />16<br />Small Change<br />Large Change<br />Large Change<br />Small Change<br />Message<br />Message<br />Small Change:<br />Expensive No-Op<br />Large Change:<br />Informative Update<br />1<br />2<br />Message<br />Message<br />Message<br />Message<br />
  17. 17. Problem with Message Scheduling<br />Small changes in messages do not imply small changes in belief:<br />Large change in<br />belief <br />Small change in<br />all message<br />Message<br />Message<br />Message<br />Belief<br />17<br />Message<br />
  18. 18. Problem with Message Scheduling<br />Large changes in a single message do not imply large changes in belief:<br />Small change<br />in belief<br />Large change in<br />a single message<br />Message<br />18<br />Message<br />Message<br />Belief<br />Message<br />
  19. 19. Belief Residual Scheduling<br />Assign priorities based on the cumulative change in belief:<br />+<br />+<br />rv =<br />1<br />1<br />1<br />A vertex whose belief has <br />changed substantially <br />since last being updated<br />will likely produce <br />informative new messages.<br />Message<br />Change<br />19<br />
  20. 20. Message vs. Belief Scheduling<br />Belief scheduling improves accuracy more quickly<br />Belief scheduling improves convergence<br />20<br />Better<br />
  21. 21. Splash Pruning<br />Belief residuals can be used to dynamically reshape and resize Splashes:<br />Low<br />Beliefs<br />Residual<br />
  22. 22. Splash Size<br />Using Splash Pruning our algorithm is able to dynamically select the optimal splash size<br />22<br />Better<br />
  23. 23. Example<br />Many<br />Updates<br />Synthetic Noisy Image<br />Few<br />Updates<br />Vertex Updates<br />Algorithm identifies and focuses <br />on hidden sequential structure<br />Factor Graph<br />23<br />
  24. 24. Distributed Belief Residual Splash Algorithm<br />Fast Reliable Network<br />Local State<br />Local State<br />Local State<br />CPU 1<br />CPU 2<br />CPU 3<br />Scheduling<br />Queue<br />Scheduling<br />Queue<br />Scheduling<br />Queue<br />Theorem:<br />Splash<br />Splash<br />Splash<br />Given a uniform partitioning of the chain graphical model, DBRSplash will run in time:<br />retaining optimality.<br />Partition factor graph over processors<br />Schedule Splashes locally using belief residuals<br />Transmit messages on boundary<br />24<br />
  25. 25. Partitioning Objective<br />The partitioning of the factor graph determines:<br />Storage, Computation, and Communication<br />Goal: <br />Balance Computation and Minimize Communication<br />CPU 1<br />CPU 2<br />Ensure<br />Balance<br />Comm.<br />cost<br />25<br />
  26. 26. The Partitioning Problem<br />Objective:<br />Depends on:<br />NP-Hard  METIS fast partitioning heuristic <br />Work:<br />Comm:<br />26<br />Minimize <br />Communication<br />Ensure Balance<br />Update counts are not known!<br />
  27. 27. Unknown Update Counts <br />Determined by belief scheduling<br />Depends on: graph structure, factors, …<br />Little correlation between past & future update counts<br />27<br />Uninformed Cut<br />Noisy Image<br />Update Counts<br />
  28. 28. Uniformed Cuts<br />Update Counts <br />Uninformed Cut<br />Optimal Cut<br />Greater imbalance & lower communication cost <br />28<br />Too Much Work<br />Too Little Work<br />Better<br />Better<br />
  29. 29. Over-Partitioning<br />Over-cut graph into k*p partitions and randomly assign CPUs<br />Increase balance<br />Increase communication cost (More Boundary)<br />CPU 1<br />CPU 1<br />CPU 2<br />CPU 2<br />CPU 1<br />CPU 1<br />CPU 2<br />CPU 2<br />CPU 2<br />CPU 1<br />CPU 2<br />CPU 1<br />CPU 2<br />CPU 1<br />k=6<br />Without Over-Partitioning<br />29<br />
  30. 30. Over-Partitioning Results<br />Provides a simple method to trade between work balance and communication cost<br />30<br />Better<br />Better<br />
  31. 31. CPU Utilization<br />Over-partitioning improves CPU utilization:<br />31<br />
  32. 32. DBRSplash Algorithm<br />Fast Reliable Network<br />Local State<br />Local State<br />Local State<br />CPU 1<br />CPU 2<br />CPU 3<br />Scheduling<br />Queue<br />Scheduling<br />Queue<br />Scheduling<br />Queue<br />Splash<br />Splash<br />Splash<br />Over-Partition factor graph <br />Randomly assign pieces to processors<br />Schedule Splashes locally using belief residuals<br />Transmit messages on boundary<br />32<br />
  33. 33. Experiments<br />Implemented in C++ using MPICH2 as a message passing API<br />Ran on Intel OpenCirrus cluster: 120 processors <br />15 Nodes with 2 x Quad Core Intel Xeon Processors<br />Gigabit Ethernet Switch<br />Tested on Markov Logic Networks obtained from Alchemy [Domingos et al. SSPR 08]<br />Present results on largest UW-Systems and smallest UW-Languages MLNs<br />33<br />
  34. 34. Parallel Performance (Large Graph)<br />34<br />UW-Systems<br />8K Variables<br />406K Factors<br />Single Processor Running Time:<br />1 Hour<br />Linear to Super-Linear up to 120 CPUs<br />Cache efficiency<br />Linear<br />Better<br />
  35. 35. Parallel Performance (Small Graph)<br />UW-Languages<br />1K Variables<br />27K Factors<br />Single Processor Running Time:<br />1.5 Minutes<br />Linear to Super-Linear up to 30 CPUs<br />Network costs quickly dominate short running-time<br />35<br />Linear<br />Better<br />
  36. 36. Summary<br />Splash Operation generalization of the optimal parallel schedule on chain graphs<br />Belief-based scheduling<br />Addresses message scheduling issues<br />Improves accuracy and convergence<br />DBRSplash an efficient distributed parallel inference algorithm<br />Over-partitioning to improve work balance<br />Experimental results on large factor graphs:<br />Linear to super-linear speed-up using up to 120 processors<br />36<br />
  37. 37. Thank You<br />Acknowledgements<br />Intel Research Pittsburgh: OpenCirrus Cluster<br />AT&T Labs<br />DARPA<br />
  38. 38. Exponential Parallelism<br />38<br />From SamanAmarasinghe:<br />512<br />Picochip<br />PC102<br />Ambric<br />AM2045<br />256<br />Cisco<br />CSR-1<br />128<br />Intel<br />Tflops<br />64<br />32<br /># of<br />cores<br />Raza<br />XLR<br />Cavium<br />Octeon<br />Raw<br />16<br />8<br />Cell<br />Niagara<br />Xeon MP<br />Opteron 4P<br />Broadcom 1480<br />4<br />Xbox360<br />PA-8800<br />Tanglewood<br />Opteron<br />2<br />Power4<br />PExtreme<br />Power6<br />Yonah<br />4004<br />8086<br />8080<br />286<br />386<br />486<br />Pentium<br />P2<br />P3<br />Itanium<br />1<br />P4<br />8008<br />Itanium 2<br />Athlon<br />1985<br />1990<br />1980<br />1970<br />1975<br />1995<br />2000<br />2005<br />20??<br />
  39. 39. AIStats09 Speedup<br />39<br />3D –Video Prediction<br />Protein Side-Chain Prediction<br />