Your SlideShare is downloading. ×
  • Like
Parallel Combinatorial Computing and Sparse Matrices
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Parallel Combinatorial Computing and Sparse Matrices

  • 563 views
Published

Re-defining scalability for irregular problems, starting with bipartite matching.

Re-defining scalability for irregular problems, starting with bipartite matching.

Published in Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
563
On SlideShare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
17
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Parallel Combinatorial Computing and Sparse Matrices E. Jason Riedy James Demmel Computer Science Division EECS Department University of California, Berkeley SIAM Conference on Computational Science and Engineering, 2005
  • 2. Outline Fundamental question: What performance metrics are right? Background Algorithms Sparse Transpose Weighted Bipartite Matching Setting Performance Goals Ordering Ideas
  • 3. App: Sparse LU Factorization Characteristics: Large quantities of numerical work. Eats memory and flops. Benefits from parallel work. And needs combinatorial support.
  • 4. App: Sparse LU Factorization Characteristics: Large quantities of numerical work. Eats memory and flops. Benefits from parallel work. And needs combinatorial support.
  • 5. App: Sparse LU Factorization Combinatorial support: Fill-reducing orderings, pivot avoidance, data structures. Numerical work is distributed. Supporting algorithms need to be distributed. Memory may be cheap ($100 GB), moving data is costly.
  • 6. Sparse Transpose Data structure manipulation Dense transpose moves numbers, sparse moves numbers and re-indexes them. Sequentially space-efficient “algorithms” exist, but disagree with most processors. Chains of data-dependent loads Unpredictable memory patterns If the data is already distributed, an unoptimized parallel transpose is better than an optimized sequential one!
  • 7. Parallel Sparse Transpose Many, many options: P4 P3 Send to root, transpose, distribute. Transpose, send pieces to destinations. Transpose, then rotate data. P1 P2 Replicate the matrix, transpose everywhere. Communicates most of matrix twice. Node stores whole matrix. Note: We should compare with this implementation, not purely sequential.
  • 8. Parallel Sparse Transpose Many, many options: P4 P3 Send to root, transpose, distribute. Transpose, send pieces to destinations. Transpose, then rotate data. P1 P2 Replicate the matrix, transpose everywhere. All-to-all communication. Some parallel work.
  • 9. Parallel Sparse Transpose Many, many options: P4 P3 Send to root, transpose, distribute. Transpose, send pieces to destinations. Transpose, then rotate data. P1 P2 Replicate the matrix, transpose everywhere. Serial communication, but may hide latencies.
  • 10. Parallel Sparse Transpose Many, many options: P4 P3 Send to root, transpose, distribute. Transpose, send pieces to destinations. Transpose, then rotate data. P1 P2 Replicate the matrix, transpose everywhere. Useful in some circumstances.
  • 11. What Data Is Interesting? Time
  • 12. What Data Is Interesting? Time (to solution) How much data is communicated. Overhead and latency. Quantity of data resident on a processor.
  • 13. Parallel Transpose Performance Transpose Transpose v. root q q q q q q q q q q q 1.5 q q q 6 q q q q q q q q q Re−scaled speed−up q q q q q q q q q q q q q q q q q q q q q Speed−up q q q q q q q 1.0 q q q q q q q q q q q 4 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.5 q q q q q q q q q q q q q q q q q q q q q q q q q q 2 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 q 0 q 5 10 15 20 5 10 15 Number of Processors Number of Processors All-to-all is slower than pure sequential code, but distributed. Actual speed-up when the data is already distributed. Hard to keep constant size / node when performance varies by problem. Data from CITRIS cluster, Itanium2s with Myrinet.
  • 14. Weighted Bipartite Matching Not moving data around but finding where it should go. Find the “best” edges in a bipartite graph. Corresponds to picking the “best” diagonal. Used for static pivoting in factorization. Also in travelling salesman problems, etc.
  • 15. Weighted Bipartite Matching Not moving data around but finding where it should go. Find the “best” edges in a bipartite graph. Corresponds to picking the “best” diagonal. Used for static pivoting in factorization. Also in travelling salesman problems, etc.
  • 16. Algorithms Depth-first search Reliable performance, code available (MC 64) Requires A and AT . Difficult to compute on distributed data. Interior point Performance varies wildly; many tuning parameters. Full generality: Solve larger sparse system. Auction algorithms replace solve with iterative bidding. Easy to distribute.
  • 17. Parallel Auction Performance q Auction Auction v. root q q q q 8 q 8 q q Re−scaled speedup 6 6 Speedup q q q q q q q q q q q q q 4 q q 4 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 2 q q q 2 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 5 10 15 5 10 15 Number of Processors Number of Processors Compare with running an auction on the root, a parallel auction achieves slight speed-up.
  • 18. Proposed Performance Goals When is a distributed combinatorial algorithm (or code) successful? Does not redistribute the data excessively. Keeps the data distributed. No process sees more than the whole. Performance is competitive with the on-root option. Pure speed-up is a great goal, but not always reasonable.
  • 19. Distributed Matrix Ordering Finding a permutation of columns and rows to reduce fill. NP-hard to solve, difficult to approximate.
  • 20. Sequential Orderings Bottom-up Pick columns (and possibly rows) in sequence. Heuristic choices: Minimum degree, deficiency, approximations Maintain symbolic factorization Top-down Separate the matrix into multiple sections. Graph partitioning: A + AT , AT · A, A Needs vertex separators: Difficult. Top-down Hybrid Dissect until small, then order.
  • 21. Parallel Bottom-up Hybrid Order 1. Separate the graph into chunks. Needs an edge separator, and knowledge of the pivot. 2. Order each chunk separately. Forms local partial orders. 3. Merge the orders. What needs communicated?
  • 22. Merging Partial Orders Respecting partial orders Local, symbolic factorization done once. Only need to communicate quotient graph. Quotient graph: Implicit edges for Schur complements. No node will communicate more than the whole matrix.
  • 23. Preliminary Quality Results Merging heuristic Pairwise merge. Pick the head pivot with least worst-case fill (Markowitz cost). Small (tiny) matrices: performance not reliable. Matrix Method NNZ increase west2021 AMD (A + AT ) 1.51× merging 1.68× orani678 AMD 2.37× merging 6.11× Increasing the numerical work drastically spends any savings from computing a distributed order. Need better heuristics?
  • 24. Summary Meeting classical expectations of scaling is difficult. Relatively small amounts of computation for much communication. Problem-dependent performance makes equi-size scaling hard. But consolidation costs when data is already distributed. In a supporting role, don’t sweat the speed-up. Keep the problem distributed. Open topics Any new ideas for parallel ordering?