Parallel Combinatorial Computing and Sparse
                  Matrices

         E. Jason Riedy         James Demmel

    ...
Outline


    Fundamental question: What performance metrics are right?


   Background

   Algorithms
      Sparse Transp...
App: Sparse LU Factorization




                  Characteristics:
                      Large quantities of numerical wo...
App: Sparse LU Factorization




                  Characteristics:
                      Large quantities of numerical wo...
App: Sparse LU Factorization



                  Combinatorial support:
                     Fill-reducing orderings, piv...
Sparse Transpose


  Data structure manipulation
      Dense transpose moves numbers, sparse moves numbers
      and re-in...
Parallel Sparse Transpose

                      Many, many options:
    P4         P3
                           Send to ...
Parallel Sparse Transpose

                       Many, many options:
    P4         P3
                            Send t...
Parallel Sparse Transpose

                       Many, many options:
    P4         P3
                           Send to...
Parallel Sparse Transpose

                      Many, many options:
    P4         P3
                          Send to r...
What Data Is Interesting?




      Time
What Data Is Interesting?




      Time (to solution)
      How much data is communicated.
      Overhead and latency.
  ...
Parallel Transpose Performance
                                       Transpose                                           ...
Weighted Bipartite Matching



                      Not moving data around but finding
                      where it shou...
Weighted Bipartite Matching



                      Not moving data around but finding
                      where it shou...
Algorithms


  Depth-first search
      Reliable performance, code available (MC 64)
      Requires A and AT .
      Difficu...
Parallel Auction Performance
                                                                                             ...
Proposed Performance Goals



  When is a distributed combinatorial algorithm (or code)
  successful?
      Does not redis...
Distributed Matrix Ordering



     Finding a permutation of columns and rows to reduce fill.




            NP-hard to so...
Sequential Orderings

  Bottom-up
      Pick columns (and possibly rows) in sequence.
      Heuristic choices:
          M...
Parallel Bottom-up Hybrid Order




                    1. Separate the graph into chunks.
                           Need...
Merging Partial Orders



                   Respecting partial orders
                         Local, symbolic factorizat...
Preliminary Quality Results

   Merging heuristic
       Pairwise merge.
       Pick the head pivot with least worst-case ...
Summary

     Meeting classical expectations of scaling is difficult.
          Relatively small amounts of computation for...
Upcoming SlideShare
Loading in …5
×

Parallel Combinatorial Computing and Sparse Matrices

753 views

Published on

Re-defining scalability for irregular problems, starting with bipartite matching.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
753
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
19
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Parallel Combinatorial Computing and Sparse Matrices

  1. 1. Parallel Combinatorial Computing and Sparse Matrices E. Jason Riedy James Demmel Computer Science Division EECS Department University of California, Berkeley SIAM Conference on Computational Science and Engineering, 2005
  2. 2. Outline Fundamental question: What performance metrics are right? Background Algorithms Sparse Transpose Weighted Bipartite Matching Setting Performance Goals Ordering Ideas
  3. 3. App: Sparse LU Factorization Characteristics: Large quantities of numerical work. Eats memory and flops. Benefits from parallel work. And needs combinatorial support.
  4. 4. App: Sparse LU Factorization Characteristics: Large quantities of numerical work. Eats memory and flops. Benefits from parallel work. And needs combinatorial support.
  5. 5. App: Sparse LU Factorization Combinatorial support: Fill-reducing orderings, pivot avoidance, data structures. Numerical work is distributed. Supporting algorithms need to be distributed. Memory may be cheap ($100 GB), moving data is costly.
  6. 6. Sparse Transpose Data structure manipulation Dense transpose moves numbers, sparse moves numbers and re-indexes them. Sequentially space-efficient “algorithms” exist, but disagree with most processors. Chains of data-dependent loads Unpredictable memory patterns If the data is already distributed, an unoptimized parallel transpose is better than an optimized sequential one!
  7. 7. Parallel Sparse Transpose Many, many options: P4 P3 Send to root, transpose, distribute. Transpose, send pieces to destinations. Transpose, then rotate data. P1 P2 Replicate the matrix, transpose everywhere. Communicates most of matrix twice. Node stores whole matrix. Note: We should compare with this implementation, not purely sequential.
  8. 8. Parallel Sparse Transpose Many, many options: P4 P3 Send to root, transpose, distribute. Transpose, send pieces to destinations. Transpose, then rotate data. P1 P2 Replicate the matrix, transpose everywhere. All-to-all communication. Some parallel work.
  9. 9. Parallel Sparse Transpose Many, many options: P4 P3 Send to root, transpose, distribute. Transpose, send pieces to destinations. Transpose, then rotate data. P1 P2 Replicate the matrix, transpose everywhere. Serial communication, but may hide latencies.
  10. 10. Parallel Sparse Transpose Many, many options: P4 P3 Send to root, transpose, distribute. Transpose, send pieces to destinations. Transpose, then rotate data. P1 P2 Replicate the matrix, transpose everywhere. Useful in some circumstances.
  11. 11. What Data Is Interesting? Time
  12. 12. What Data Is Interesting? Time (to solution) How much data is communicated. Overhead and latency. Quantity of data resident on a processor.
  13. 13. Parallel Transpose Performance Transpose Transpose v. root q q q q q q q q q q q 1.5 q q q 6 q q q q q q q q q Re−scaled speed−up q q q q q q q q q q q q q q q q q q q q q Speed−up q q q q q q q 1.0 q q q q q q q q q q q 4 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.5 q q q q q q q q q q q q q q q q q q q q q q q q q q 2 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 q 0 q 5 10 15 20 5 10 15 Number of Processors Number of Processors All-to-all is slower than pure sequential code, but distributed. Actual speed-up when the data is already distributed. Hard to keep constant size / node when performance varies by problem. Data from CITRIS cluster, Itanium2s with Myrinet.
  14. 14. Weighted Bipartite Matching Not moving data around but finding where it should go. Find the “best” edges in a bipartite graph. Corresponds to picking the “best” diagonal. Used for static pivoting in factorization. Also in travelling salesman problems, etc.
  15. 15. Weighted Bipartite Matching Not moving data around but finding where it should go. Find the “best” edges in a bipartite graph. Corresponds to picking the “best” diagonal. Used for static pivoting in factorization. Also in travelling salesman problems, etc.
  16. 16. Algorithms Depth-first search Reliable performance, code available (MC 64) Requires A and AT . Difficult to compute on distributed data. Interior point Performance varies wildly; many tuning parameters. Full generality: Solve larger sparse system. Auction algorithms replace solve with iterative bidding. Easy to distribute.
  17. 17. Parallel Auction Performance q Auction Auction v. root q q q q 8 q 8 q q Re−scaled speedup 6 6 Speedup q q q q q q q q q q q q q 4 q q 4 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 2 q q q 2 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 5 10 15 5 10 15 Number of Processors Number of Processors Compare with running an auction on the root, a parallel auction achieves slight speed-up.
  18. 18. Proposed Performance Goals When is a distributed combinatorial algorithm (or code) successful? Does not redistribute the data excessively. Keeps the data distributed. No process sees more than the whole. Performance is competitive with the on-root option. Pure speed-up is a great goal, but not always reasonable.
  19. 19. Distributed Matrix Ordering Finding a permutation of columns and rows to reduce fill. NP-hard to solve, difficult to approximate.
  20. 20. Sequential Orderings Bottom-up Pick columns (and possibly rows) in sequence. Heuristic choices: Minimum degree, deficiency, approximations Maintain symbolic factorization Top-down Separate the matrix into multiple sections. Graph partitioning: A + AT , AT · A, A Needs vertex separators: Difficult. Top-down Hybrid Dissect until small, then order.
  21. 21. Parallel Bottom-up Hybrid Order 1. Separate the graph into chunks. Needs an edge separator, and knowledge of the pivot. 2. Order each chunk separately. Forms local partial orders. 3. Merge the orders. What needs communicated?
  22. 22. Merging Partial Orders Respecting partial orders Local, symbolic factorization done once. Only need to communicate quotient graph. Quotient graph: Implicit edges for Schur complements. No node will communicate more than the whole matrix.
  23. 23. Preliminary Quality Results Merging heuristic Pairwise merge. Pick the head pivot with least worst-case fill (Markowitz cost). Small (tiny) matrices: performance not reliable. Matrix Method NNZ increase west2021 AMD (A + AT ) 1.51× merging 1.68× orani678 AMD 2.37× merging 6.11× Increasing the numerical work drastically spends any savings from computing a distributed order. Need better heuristics?
  24. 24. Summary Meeting classical expectations of scaling is difficult. Relatively small amounts of computation for much communication. Problem-dependent performance makes equi-size scaling hard. But consolidation costs when data is already distributed. In a supporting role, don’t sweat the speed-up. Keep the problem distributed. Open topics Any new ideas for parallel ordering?

×