Upcoming SlideShare
×

# Making Static Pivoting Scalable and Dependable

680 views
636 views

Published on

Thesis presentation

Published in: Technology
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
680
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
2
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Making Static Pivoting Scalable and Dependable

1. 1. Making Static Pivoting Scalable and Dependable Ph.D. Dissertation Talk E. Jason Riedy jason@acm.org EECS Department University of California, Berkeley Committee: Dr. James Demmel (chair), Dr. Katherine Yelick, Dr. Sanjay Govindjee 17 December, 2010
2. 2. Outline1 Introduction2 Solving Ax = b dependably3 Extending dependability to static pivoting4 Distributed matching for static pivoting5 Summary Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 2 / 59
3. 3. Motivation: Ever Larger Ax = bSystems Ax = b are growing larger, more diﬃcult Omega3P: n = 7.5 million with τ = 300 million entries Quantum Mechanics: precondition with blocks of dimension 200-350 thousand Large barrier-based optimization problems: Many solves, similar structure, increasing condition number Huge systems are generated, solved, and analyzed automatically. Large, highly unsymmetric systems need scalable parallel solvers. Low-level routines: No expert in the loop! Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 3 / 59
4. 4. Motivation: Solving Ax = b better Many people work to solve Ax = b faster. Today we start with how to solve it better. Better enables faster. Use extra ﬂoating-point precision within iterative reﬁnement to obtain a dependable solution, adding O(n2 ) work after an O(n3 ) factorization. Accelerate sparse factorization through static pivoting, decoupling symbolic, numeric phases. Reﬁne the perturbed solution without needing extra triangular solves for condition estimation. Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 4 / 59
5. 5. ContributionsIterative reﬁnement Extend iterative reﬁnement to provide small forward errors dependably (to be deﬁned) Set and use a methodology to demonstrate dependability Show that condition estimation (expensive for sparse systems) is not necessary for obtaining a dependable solutionStatic pivoting Improve static pivoting heuristics Demonstrate that an approximate maximum weight bipartite matching is faster and just as accurate Develop a memory-scalable distributed memory auction algorithm for static pivoting Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 5 / 59
6. 6. Deﬁning “dependable”A dependable solver for Ax = b returns a result x with small erroroften enough that you expect success with a small error, and clearlysignals results that likely contain large errors. True error Diﬃculty Alg. reports w/likeliness O(mach. precision) not bad success Very likely failure Somewhat rare larger not bad success (not yet seen) failure Practically certain O(mach. precision) diﬃcult success Whenever feasible failure Practically certain larger diﬃcult success (not yet seen) failure Very likely Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 6 / 59
7. 7. Introducing the errors and targets y1 A b −1 (A, b) (A, b) A −1b x LU: Small backward error LU: Error in y ∝ diﬃculty 2−25 20 2−30 Percent Percent 0.5% 1% 1.0% 2−10 Error 1.5% Error 2% −35 2.0% 2 3% 4% 2.5% 3.0% 3.5% −40 2−20 2 2−45 2−30 25 210 215 220 225 20 25 210 215 220 225 230 Difficulty Difficulty Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 7 / 59
8. 8. Introducing the errors and targets y1 A b −1 yk (A, b) (A, b) yk A −1b x Reﬁned: Accepted with small errors in y , or ﬂagged with unknown error. Successful Flagged 20 2−10 2−20 % of systems 0.2% 2−30 0.4% Error 0.6% 0.8% 1.0% 2−40 1.2% 1.4% 2−50 2−60 210 220 230 240 210 220 230 240 Difficulty Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 7 / 59
9. 9. Iterative reﬁnement Newton’s method applied to Ax = b.Repeat until done: 1 Compute the residual ri = b − Ayi using extra precision εr . 2 Solve Ady i = ri for the correction using working precision εw . 3 Increment yi+1 = yi + dy i , maintaining y to extra precision εx .Precisions:Working precision εw The precision used for storing (and factoring) A: IEEE754 single (εw = 2−24 ), double (εw = 2−53 ), etc.Residual precision εr At least double working precision, εr ≤ ε2 wSolution precision εx At least double working precision, εx ≤ ε2 wLatter two may be implemented in software. Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 8 / 59
10. 10. Deﬁnitions Errors: Backward (relative) error Forward (relative) error Diﬃculty: Condition numbers: sensitivity to perturbations Element growth: error from factorization Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 9 / 59
11. 11. Error measures: Backward error How close is the nearest system satisfying Ay1 = b? y1 A b −1 (A, b) (A, b) A −1b xThree ways, given r1 = b − Ay1 : r1 ∞ |r1 | Normwise A y1 ∞ + b ∞ ∞ Componentwise |A| |y1 |+|b| ∞ r1 ∞Columnwise Note: Elementwise division, 0/0 = 0, (max |A|) |y1 |+ b ∞ and max produces a row vector Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 10 / 59
12. 12. Error measures: Forward error How close is y1 to x? y1 A b −1 (A, b) (A, b) A −1bTwo ways and two measuring sticks: x y1 −x ∞ y1 −x ∞ Normwise x ∞ y1 ∞ y1 −x y1 −x Componentwise x ∞ y1 ∞ Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 10 / 59
13. 13. Error sensitivity: Conditioning How sensitive is y1 to perturbations in A and b? y1 A b −1 (A, b) (A, b) A −1b x forward error ≤ condition number × backward errorEach combination has a condition number. We choose two for use inour diﬃculty measure. Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 10 / 59
14. 14. Diﬃculty: condition number × element growth Condition number: Backward error κ(A−1 ) = κ(A) = A−1 ∞ A ∞ Normwise forw. err. κ(A, x, b) = A−1 ∞ ( A ∞ x ∞ + b ∞ ) Componentwise forw. err. ccond(A, x, b) = |A−1 | (|A| |x| + |b|) ∞ Element growth, est. δAi in (A + δAi )y = b: |δAi | ≤ 3nd |L| |U| ≤ p(nd )g 1r max |A| We use a col.-scaling-indep. expression allowing |L| > 1, (max1≤k≤j maxi |L|(i,k))·(maxi |U|(i,j)) gc = maxj maxi |A|(i,j) Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 11 / 59
15. 15. Dense test systems 30 × 30 single, double, complex, and double complex: 250k, 4 right-hand sides, 1M test systems Size chosen to sample ill-conditioned region well Generated as in Demmel, et al., plus b → x κ∞ (A) = A−1 ∞ A ∞ Single Double 15% 10% 5% Percent of population 0% Complex Double Complex 15% 10% 5% 0% 20 210 220 230 240 250 260 270 20 210 220 230 240 250 260 270 Difficulty Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 12 / 59
16. 16. Dense test systems 30 × 30 single, double, complex, and double complex: 250k, 4 right-hand sides, 1M test systems Size chosen to sample ill-conditioned region well Generated as in Demmel, et al., plus b → x κ(A, x, b) = A−1 ∞ ( A ∞ x ∞ + b ∞) Single Double 14% 12% 10% 8% 6% 4% Percent of population 2% 0% Complex Double Complex 14% 12% 10% 8% 6% 4% 2% 0% 20 210 220 230 240 250 260 270 20 210 220 230 240 250 260 270 Difficulty Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 12 / 59
17. 17. Dense test systems 30 × 30 single, double, complex, and double complex: 250k, 4 right-hand sides, 1M test systems Size chosen to sample ill-conditioned region well Generated as in Demmel, et al., plus b → x ccond(A, x, b) = |A−1 | (|A| |x| + |b|) ∞ Single Double 12% 10% 8% 6% 4% Percent of population 2% 0% Complex Double Complex 12% 10% 8% 6% 4% 2% 0% 20 220 240 260 280 20 220 240 260 280 Difficulty Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 12 / 59
18. 18. Results: Dependable errors nberr colberr cberr nferr nferrx cferr cferrx 20 2−10 Converged 2−20 2−30 2−40 2−50 2−60 20 2−10 No Progress 2−20 2−30 2−40 2−50 % of systems 2−60 10−5 Error 20 10−4 2−10 10−3 2−20 10−2 Unstable 2−30 2−40 2−50 2−60 20 2−10 Iteration Limit −20 2 2−30 2−40 2−50 2−60 20 210220230240 20 210220230240 20 210220230240 20 210220230240 20 210220230240 20 210220230240 20 210220230240 Difficulty Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 13 / 59
19. 19. How? cberr cferr 20 2−10 2−20 % of systems 0.00% 2−30 Error 0.01% 2−40 0.10% 2−50 1.00% 2−60 25 210 215 220 225 230 235 240 25 210 215 220 225 230 235 240 Difficulty Carry the intermediate soln. yi to twice the working precision. Reﬁne the backward error down to nearly ε2 .w By “forward error ≤ conditioning × backward error”, the forward error for well-enough conditioned problems is nearly εw . Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 14 / 59
20. 20. How? cberr cferr 20 2−10 2−20 % of systems 0.00% 2−30 Error 0.01% 2−40 0.10% 2−50 1.00% 2−60 25 210 215 220 225 230 235 240 25 210 215 220 225 230 235 240 Difficulty Carry the intermediate soln. yi to twice the working precision. Reﬁne the backward error down to nearly ε2 .w By “forward error ≤ conditioning × backward error”, the forward error for well-enough conditioned problems is nearly εw . Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 14 / 59
21. 21. Results: Comparison with xGESVXX Precision Accepted Rejected well ill well ill Single 79% 15% 1% 5% Single complex 76% 19% 1% 4% Double 87% 9% 1% 5% Double complex 85% 11% 1% 3% Accepted, ill-conditioned systems are those gained by our routine that xGESVXX rejects. Rejected, well-conditioned systems are those lost by our routine but accepted by xGESVXX. Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 15 / 59
22. 22. Results: Iteration counts, single precision nberr colberr cberr ndx cdx 30 25 Converged 20 15 10 5 30 25 No Progress 20 15 % of systems 10 1% # Iterations 5 2% 30 3% 25 4% 5% Unstable 20 15 6% 10 5 30 Iteration Limit 25 20 15 10 5 20 210 220 230 240 20 210 220 230 240 20 210 220 230 240 20 210 220 230 240 20 210 220 230 240 Difficulty Set limit at ﬁve. Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 16 / 59
23. 23. Results: Iteration counts, single complex precision nberr colberr cberr ndx cdx 30 25 Converged 20 15 10 5 30 25 No Progress 20 15 10 % of systems # Iterations 5 2% 30 4% 25 6% 8% Unstable 20 15 10 5 30 Iteration Limit 25 20 15 10 5 5 5 5 5 5 20 2 210 15 20 25 30 35 20 2 210 15 20 25 30 35 20 2 210 15 20 25 30 35 20 2 210 15 20 25 30 35 20 2 210 15 20 25 30 35 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Difficulty Set limit at seven. Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 17 / 59
24. 24. Results: Iteration counts, double precision nberr colberr cberr ndx cdx 30 25 Converged 20 15 10 5 30 25 No Progress 20 15 % of systems 10 0.5% # Iterations 5 1.0% 30 1.5% 25 2.0% 2.5% Unstable 20 15 3.0% 10 5 30 Iteration Limit 25 20 15 10 5 20 220 240 260 20 220 240 260 20 220 240 260 20 220 240 260 20 220 240 260 Difficulty Set limit at ten. Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 18 / 59
25. 25. Results: Iteration counts, double complex precision nberr colberr cberr ndx cdx 30 25 Converged 20 15 10 5 30 25 No Progress 20 15 % of systems 10 0.5% # Iterations 5 1.0% 1.5% 30 2.0% 25 2.5% Unstable 20 3.0% 15 3.5% 10 5 30 Iteration Limit 25 20 15 10 5 20 210220230240250260 20 210220230240250260 20 210220230240250260 20 210220230240250260 20 210220230240250260 Difficulty Set limit at 15. Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 19 / 59
26. 26. Static pivoting If a pivot |A(j, j)| < T , perturb up to T by adding sign(A(j, j)) · (T − |A(j, j)|). Forcibly increases backward error, decreases element growth In sparse systems, few updates should occur to an entry. Large diagonal entries should remain large...Thresholding heuristics SuperLU γ · A 1 column-relative γ · max |A(:, j)| diagonal-relative γ · |A(j, j)| √ γ = 2−26 ≈ εw , 2−38 , or 2−43 = 210 εw Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 20 / 59
27. 27. Sparse test systems Matrices are from the UF Collection, chosen from existing comparisons between SuperLU, MUMPS, and UMFPACK. Wide range of conditioning and numerical scaling Compute “True” solutions using a doubled-double-extended factorization and quad-double-extended reﬁnement with a modiﬁed TAUCS. Reﬁnement uses LAPACK-style numerical scaling throughout, but the test systems are generated in the matrix’s given scaling. Also tested on singular systems; no solutions accepted.At some point, plan on feeding the “true” solutions into the UFCollection... Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 21 / 59
28. 28. Sparse normwise conditioning 8% Percent of population 6% 4% 2% 0% 210 220 230 240 250 Difficulty Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 22 / 59
29. 29. Sparse componentwise conditioning 8% 6% Percent of population 4% 2% 0% 220 230 240 250 260 Difficulty Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 23 / 59
30. 30. Results: SuperLU perturbation heuristic Before reﬁnement, by max. perturbation amount nberr colberr cberr nferr nferrx cferr cferrx 20 −10 2 2−20 2^10 * eps 2−30 2−40 2−50 2−60 Error / sqrt(max row deg.) 20 2−10 2^−12 * sqrt(eps) % of systems 2−20 0.1% 2−30 0.3% 2−40 1.0% 2−50 3.2% 2−60 20 2−10 2−20 sqrt(eps) 2−30 2−40 2−50 2−60 20 10 20 30 40 50 60 20 10 20 30 40 50 60 20 10 20 30 40 50 60 20 10 20 30 40 50 60 20 10 20 30 40 50 60 20 10 20 30 40 50 60 20 10 20 30 40 50 60 222222 222222 222222 222222 222222 222222 222222 Difficulty Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 24 / 59
31. 31. Results: Column-relative perturbation heuristic Before reﬁnement, by max. perturbation amount nberr colberr cberr nferr nferrx cferr cferrx 20 −10 2 2−20 2^10 * eps 2−30 2−40 2−50 2−60 Error / sqrt(max row deg.) 20 2−10 2^−12 * sqrt(eps) % of systems 2−20 0.1% 2−30 0.3% 2−40 1.0% 2−50 3.2% 2−60 20 2−10 2−20 sqrt(eps) 2−30 2−40 2−50 2−60 20 10 20 30 40 50 60 20 10 20 30 40 50 60 20 10 20 30 40 50 60 20 10 20 30 40 50 60 20 10 20 30 40 50 60 20 10 20 30 40 50 60 20 10 20 30 40 50 60 222222 222222 222222 222222 222222 222222 222222 Difficulty Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 25 / 59
32. 32. Results: Diagonal-relative perturbation heuristic Before reﬁnement, by max. perturbation amount nberr colberr cberr nferr nferrx cferr cferrx 20 −10 2 2−20 2^10 * eps 2−30 2−40 2−50 2−60 Error / sqrt(max row deg.) 20 2−10 2^−12 * sqrt(eps) % of systems 2−20 0.1% 2−30 0.3% 2−40 1.0% 2−50 3.2% 2−60 20 2−10 2−20 sqrt(eps) 2−30 2−40 2−50 2−60 20 10 20 30 40 50 60 20 10 20 30 40 50 60 20 10 20 30 40 50 60 20 10 20 30 40 50 60 20 10 20 30 40 50 60 20 10 20 30 40 50 60 20 10 20 30 40 50 60 222222 222222 222222 222222 222222 222222 222222 Difficulty Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 26 / 59
33. 33. Results: SuperLU perturbation heuristic After reﬁnement, with γ = 2−43 = 210 εw Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 27 / 59
34. 34. Results: Column-relative perturbation heuristic After reﬁnement, with γ = 2−43 = 210 εw Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 28 / 59
35. 35. Results: Diagonal-relative perturbation heuristic After reﬁnement, with γ = 2−43 = 210 εw Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 29 / 59
36. 36. results Level and heuristic Result Trust both Trust nwise Reject 2−43 = 210 · εf SuperLU 42.9% 8.0% 49.0% Column-relative 55.7% 5.7% 38.6% Diagonal-relative 55.8% 5.9% 38.3% −38 √ 2 =≈ 2−12 · εf SuperLU 36.6% 6.7% 56.6% Column-relative 52.4% 6.5% 41.2% Diagonal-relative 53.7% 7.2% 39.1% √ 2−26 ≈ εf SuperLU 32.4% 4.0% 63.6% Column-relative 42.2% 4.2% 53.6% Diagonal-relative 47.4% 4.7% 47.9% Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 30 / 59
37. 37. Sparse Matrix to Bipartite Graph to Pivots Col 1Col 2Col 3Col 4 Col 1Col 2Col 3Col 4Row 1 Row 1 Col 1 Row 2Row 2 Row 2 Col 2 Row 3Row 3 Row 3 Col 3 Row 1Row 4 Row 4 Col 4 Row 4Bipartite model Each row and column is a vertex. Each explicit entry is an edge. Want to chose “largest” entries for pivots. Maximum weight complete bipartite matching: linear assignment problem Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 31 / 59
38. 38. Mathematical Form“Just” a linear optimization problem: B n × n matrix of beneﬁts in ∪ {−∞}, often c + log2 |A| X n × n permutation matrix: the matching pr , πc dual variables, will be price and proﬁt 1r , 1c unit entry vectors corresponding to rows, cols Lin. assignment prob. Dual problem maximize Tr B T X minimize 1T pr + 1T πc r c X∈ n×n pr ,πc subject to X 1c = 1r , subject to pr 1T + 1r πc ≥ B. c T X T 1r = 1c , and X ≥ 0. Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 32 / 59
39. 39. Mathematical Form“Just” a linear optimization problem: B n × n matrix of beneﬁts in ∪ {−∞}, often c + log2 |A| X n × n permutation matrix: the matching pr , πc dual variables, will be price and proﬁt 1r , 1c unit entry vectors corresponding to rows, cols Lin. assignment prob. Dual problem Implicit form: T maximize Tr B X X∈ n×n minimize 1T pr r pr subject to X 1c = 1r , + max(B(i, j) X T 1r = 1c , and i∈R j∈C X ≥ 0. − pr (j)). Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 32 / 59
40. 40. Do We Need a Special Method? The LAP: Standard form: maximize Tr B T X min cT x X∈ n×n x subject to X 1c = 1r , subject to Ax = 1r +c , and T x ≥ 0. X 1r = 1c , and X ≥ 0. A: 2n × τ vertex-edge matrix Network optimization kills simplex methods. (“Smoothed analysis” does not apply.) Interior point algs need to round the solution. (And need to solve Ax = b for a much larger A, although theoretically great in NC.) Combinatorial methods should be faster. (But unpredictable!) Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 33 / 59
41. 41. Properties from OptimizationComplementary slackness X c T (pr 1T + 1r πc − B) = 0. If (i, j) is in the matching (X (i, j) = 0), then pr (i) + πc (j) = B(i, j). Used to chose matching edges and modify dual variables in combinatorial algorithms. Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 34 / 59
42. 42. Properties from OptimizationRelaxed problemIntroduce a parameter µ, two interpretations: from a barrier function related to X ≥ 0, or from the auction algorithm (later).Then Tr B T X∗ ≤ 1T pr + 1T πc ≤ Tr B T X∗ + (n − 1)µ, r cor the computed dual value (and hence computed primal matching) iswithin (n − 1)µ of the optimal primal. Very useful for ﬁnding approximately optimal matchings.Feasibility boundStarting from zero prices: pr (i) ≤ (n − 1)(µ + ﬁnite range of B) Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 35 / 59
43. 43. Algorithms for Solving the LAP Goal: A parallel algorithm that justiﬁes buying big machines. Acceptable: A distributed algorithm; matrix is on many nodes. Choices: Simplex or continuous / interior-point Plain simplex blows up, network simplex diﬃcult to parallelize. Rounding for interior point often falls back on matching. (Optimal IP algorithm: Goldberg, Plotkin, Shmoys, Tardos. Needs factorization.) Augmenting-path based (Mc64: Duﬀ and Koster) Based on depth- or breadth-ﬁrst search. Both are P-complete, inherently sequential (Greenlaw, Reif). Auctions (Bertsekas, et al.) Only length-1 or -2 alternating paths; global sync for duals. Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 36 / 59
44. 44. Auction Algorithms Discussion will be column-major. General structure: 1 Each unmatched column ﬁnds the “best” row, places a bid. The dual variable pr holds the prices. The proﬁt πc is implicit. (No signiﬁcant FP errors!) Each entry’s value: beneﬁt B(i, j)− price p(i). A bid maximally increases the price of the most valuable row. 2 Bids are reconciled. Highest proposed price wins, forms a match. Loser needs to re-bid. Some versions need tie-breaking; here least column. 3 Repeat. Eventually everyone will be matched, or some price will be too high. Seq. implementation in ∼40–50 lines, can compete with Mc64 Some corner cases to handle. . . Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 37 / 59
45. 45. The Bid-Finding LoopFor each unmatched column: Price Row Index Row Entry value = entry − price Save largest and second−largest Bid price incr: diff. in valuesDiﬀerences from sparse matrix-vector products Not all columns, rows used every iteration. (sparse matrix, sparse vector) Hence output price updates are scattered. More local work per entry Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 38 / 59
46. 46. The Bid-Finding LoopFor each unmatched column: Price Row Index Row Entry value = entry − price Save largest and second−largest Bid price incr: diff. in valuesLittle points Increase bid price by µ to avoid loops Needs care in ﬂoating-point for small µ. Single adjacent row → ∞ price Aﬀects feasibility test, computing dual Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 38 / 59
47. 47. Termination Once a row is matched, it stays matched. A new bid may swap it to another column. The matching (primal) increases monotonically. Prices only increase. The dual does not change when a row is newly matched. But the dual may decrease when a row is taken. The dual decreases monotonically. Subtle part: If the dual doesn’t decrease. . . It’s ok. Can show the new edge begins an augmenting path that increases the matching or an alternating path that decreases the dual. Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 39 / 59
48. 48. Successive Approximation (µ-scaling) Simple auctions aren’t really competitive with Mc64. Start with a rough approximation (large µ) and reﬁne. Called -scaling in the literature, but µ-scaling is better. Preserve the prices pr at each step, but clear the matching. Note: Do not clear matches associated with ∞ prices! Equivalent to ﬁnding diagonal scaling Dr ADc and matching again on the new B. Problem: Performance strongly depends on initial scaling. Also depends strongly on hidden parameters. Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 40 / 59
49. 49. Sequential performance: Auction v. MC64 MC64 Group Name Auction (s) MC64 (s) Auction Bai af23560 0.025 0.017 0.68 FEMLAB poisson3Db 0.014 0.040 2.74 FIDAP ex11 0.060 0.015 0.26 GHS indef cont-300 0.007 0.019 2.89 GHS indef ncvxqp5 0.338 0.794 2.35 Hamm scircuit 0.048 0.024 0.50 Hollinger g7jac200 0.355 0.817 2.30 Mallya lhr14 0.044 0.026 0.60 Schenk IBMSDS 3D 51448 3D 0.031 0.010 0.33 Schenk IBMSDS matrix 9 0.074 0.024 0.33 Schenk ISEI barrier2-4 0.291 0.044 0.15 Vavasis av41092 5.462 3.595 0.66 Zhao Zhao2 1.041 3.237 3.11 Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 41 / 59
50. 50. Sequential performance: Highly variable Row Group Name By col (s) By row (s) Col Bai af23560 0.025 0.028 1.13 FEMLAB poisson3Db 0.014 0.016 1.11 FIDAP ex11 0.060 0.060 1.00 GHS indef cont-300 0.007 0.006 0.84 GHS indef ncvxqp5 0.338 0.318 0.94 Hamm scircuit 0.048 0.047 0.99 Hollinger g7jac200 0.355 0.339 0.95 Mallya lhr14 0.044 0.065 1.47 Schenk IBMSDS 3D 51448 3D 0.031 0.282 9.22 Schenk IBMSDS matrix 9 0.074 0.613 8.29 Schenk ISEI barrier2-4 0.291 0.193 0.66 Vavasis av41092 5.462 4.083 0.75 Zhao Zhao2 1.041 0.609 0.58 Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 42 / 59
51. 51. Sequential performance: Highly variable Int Group Name Float (s) Int (s) Float Bai af23560 0.025 0.040 1.61 FEMLAB poisson3Db 0.015 0.016 1.08 FIDAP ex11 0.060 0.029 0.49 GHS indef cont-300 0.007 0.006 0.91 GHS indef ncvxqp5 0.338 0.425 1.26 Hamm scircuit 0.048 0.016 0.34 Hollinger g7jac200 0.355 1.004 2.83 Mallya lhr14 0.044 0.050 1.12 Schenk IBMSDS 3D 51448 3D 0.031 0.020 0.66 Schenk IBMSDS matrix 9 0.074 0.066 0.89 Schenk ISEI barrier2-4 0.291 0.261 0.91 Vavasis av41092 5.462 5.401 0.99 Zhao Zhao2 1.041 2.269 2.18 Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 43 / 59
52. 52. Approximately maximum matchings Terminal µ value Name 0 5.96e-08 2.44e-04 5.00e-01 af23560 Primal 1342850 1342850 1342850 1342670 Time(s) 0.14 0.05 0.03 0 ratio 0.37 0.21 0.02 poisson3Db Primal 2483070 2483070 2483070 2483070 Time(s) 0.02 0.02 0.02 0.02 ratio 1.01 1.04 1.07 g7jac200 Primal 3533980 3533980 3533980 3533340 Time(s) 2.98 1.07 0.28 0.18 ratio 0.36 0.09 0.06 av41092 Primal 3156210 3156210 3156210 3155920 Time(s) 24.51 8.09 2.48 0.11 ratio 0.33 0.10 0.00 Zhao2 Primal 333891 333891 333891 333487 Time(s) 7.69 2.37 3.65 0.02 ratio 0.31 0.47 0.00 Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 44 / 59
53. 53. Setting / Lowering Parallel ExpectationsPerformance scalability? Originally proposed (early 1990s) when cpu speed ≈ memory speed ≈ network speed ≈ slow. Now: cpu speed memory latency > network latency. The number of communication phases dominates matching algorithms (auction and others). Communication patterns are very irregular. Latency and software overhead is not improving. . .Scaled back goal It suﬃces to not slow down much on distributed data. Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 45 / 59
54. 54. Basic Idea: Run Local Auctions, Treat as Bids 1 0 1 0 1 1 0 0 1 1 0 0 1111111 0000000 1 1 0 0 1111111 0000000 1 1 0 0 111 000 111 000 1 0 1 0 1 0 1 0 111 000 111 000 1 0 1 0 1 0 1 0 11111 0000011110000 1 1 0 0 1 1 0 0 1111 0000 1 0 1 0 11 00 111 000 11 00 1 0 1 0111 000 ⇒ 0000 1111 11 00 111 000 11 00 111 000 1 1 0 0 1 0 1 01111000011110000 1 1 0 0 1 1 0 0 1 1 0 0 1111 0000 1 0 1 0 1 0 11 00 111 000 1 0 11 00 1 0 1 0111 00011110000 1 1 0 0 1 1 0 0 B 1 1 0 0 1111 0000 1111 0000 1 0 1 0 11 00 111 000 11 00 111 000 11 00 1 0 1 0111 000 11 00 111 0001111000011110000 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1111 0000 1 0 1 0 1 0 1 0 11 00 111 000 1 0 1 0 11 00 1 0 1 0111 000 1 1 0 0 1 0 1 0 1 1 0 0 1 0 1 0 P1 0 1 P2 0 1 P3 Slice the matrix into pieces, run local auctions. The winning local bids are the slices’ bids. Merge. . . (“And then a miracle occurs. . .”) Need to keep some data in sync for termination. Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 46 / 59
55. 55. Basic Idea: Run Local Auctions, Treat as Bids 1 0 1 0 1 1 0 0 1 1 0 0 1111111 0000000 1 1 0 0 1111111 0000000 1 1 0 0 111 000 111 000 1 0 1 0 1 0 1 0 111 000 111 000 1 0 1 0 1 0 1 0 11111 0000011110000 1 1 0 0 1 1 0 0 1111 0000 1 0 1 0 11 00 111 0001 0 1 0 11 00 111 000 ⇒ 0000 1111 11 00 111 000 11 00 111 000 1 1 0 0 1 0 1 01111000011110000 1 1 0 0 1 1 0 0 1 1 0 0 1111 0000 1 0 1 0 1 0 11 00 111 000 1 0 1 0 1 0 11 00 111 00011110000 1 1 0 0 1 1 0 0 B 1 1 0 0 1111 0000 1111 0000 1 0 1 0 11 00 111 000 11 00 111 000 1 0 1 0 11 00 111 000 11 00 111 0001111000011110000 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1111 0000 1 0 1 0 1 0 1 0 11 00 111 000 1 0 1 0 1 0 1 0 11 00 111 000 1 1 0 0 1 0 1 0 1 1 0 0 1 0 1 0 P1 0 1 P2 0 1 P3 Practically memory scalable: Compact the local pieces. Have not experimented with simple SMP version. Sequential performance is limited by the memory system. Note: Could be useful for multicore w/local memory. Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 46 / 59
56. 56. Speed-up? 104 103 102 101 Speed−up 100 10−1 10−2 10−3 5 10 15 20 Number of processors Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 47 / 59
57. 57. Speed-up: A bit better measuring appropriately 104 103 Speed−up relative to reducing to the root node 102 101 100 10−1 10−2 10−3 5 10 15 20 Number of processors Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 48 / 59
58. 58. Comparing distributed with reduce-to-root 104 103 102 101 Speed−up To root Dist. 100 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 10−1 10−2 10−3 2 3 4 8 12 16 24 Number of processors Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 49 / 59
59. 59. Iteration order still matters av41092 shyy161 G 102 G 1 10 Time (s) Direction G G G G G Row−major G G G Col−major 100 G G G −1 10 G G G G G 5 10 15 20 5 10 15 20 Number of Processors Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 50 / 59
60. 60. Many diﬀerent speed-up proﬁles af23560 bmwcra_1 101 G 100 G G G 10−1 G G G G G G G G G G G 10−2 10−3 10−4 Time (s) garon2 stomach 101 G G G 100 G 10−1 G G G G G G G −2 G 10 G G G G 10−3 10−4 5 10 15 20 5 10 15 20 Number of Processors Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 51 / 59
61. 61. So what happens in some cases? Matrix av41092 has one large strongly connected component. (The square blocks in a Dulmage-Mendelsohn decomposition.) The SCC spans all the processors. Every edge in an SCC is a part of some complete matching. Horrible performance from: starting along a non-max-weight matching, making it almost complete, then an edge-by-edge search for nearby matchings, requiring a communication phase almost per edge. Conjecture: This type of performance land-mine will aﬀect any 0-1 combinatorial algorithm. Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 52 / 59
62. 62. Improvements? Approximate matchings: Speeds up the sequential case, eliminating any “speed-up.” Rearranging deck chairs: few-to-few communication Build a directory of which nodes share rows: collapsed BB T . Send only to/from those neighbors. Minor improvement over MPI Allgatherv for a huge eﬀort. Latency not a major factor... Improving communication may not be worth it. . . The real problem is the number of comm. phases. If diagonal is the matching, everything is overhead. Or if there’s a large SCC. . . Another alternative: Multiple algorithms at once. Run Bora U¸ar’s alg. on one set of nodes, auction on another, c transposed auction on another, . . . Requires some painful software engineering. Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 53 / 59
63. 63. Latency not a dominating factor 103 Speed−up relative to reducing to the root node 102 101 100 10−1 1x3 3x1 1x8 2x4 Number of nodes x number of procs. per node Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 54 / 59
64. 64. So, Could This Ever Be Parallel? For a given matrix-processor layout, constructing a matrix requiring O(n) communication is pretty easy for combinatorial algorithms. Force almost every local action to be undone at every step. Non-fractional combinatorial algorithms are too restricted. Using less-restricted optimization methods is promising, but far slower sequentially. Existing algs (Goldberg, et al.) are PRAM with n3 processors. General purpose methods: Cutting planes, successive SDPs Someone clever might ﬁnd a parallel rounding algorithm. Solving the fractional LAP quickly would become a matter of ﬁnding a magic preconditioner. . . Maybe not a good thing for a direct method? Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 55 / 59
65. 65. Review of contributionsIterative reﬁnement Successfully deliver dependable solutions with a little extra precision. Removed need for condition estimation. Built methodology for evaluating Ax = b solution methods’ accuracy and dependability.Static pivoting Tuned static pivoting heuristics to provide dependability. Demonstrated that an approximate maximum weight bipartite matching is faster and just as dependable. Developed a memory-scalable (although not performance-scalable) distributed memory auction algorithm for static pivoting. Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 56 / 59
66. 66. Future directionsIterative reﬁnement Least-squares reﬁnement demonstrated (Demmel, Hida, Li, & Riedy), but needs... reﬁnement. Perhaps reﬁnement could render an iterative method dependable. Could improve accuracy of Ady i = ri with extra iterations as i increases. Could help build trust in new methods (e.g. CALU).Distributed matching Interesting software problem: Run multiple algorithms on portions of a parallel allotment. How do you signal the others to terminate? Interesting algorithm problem: Is there an eﬃcient rounding method for fractional / interior point algorithms? Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 57 / 59
67. 67. Thank you!Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 58 / 59
68. 68. BoundsBackward error Di−1 ri ∞ ≤ (c − ρ)−1 (3(nd + 1)εr + εx ) ¯Here nd is an expression of size, c is the upper bound on per-iterationdecrease, and ρ is a safety factor for the region around 1/εw . ¯Forward error Di−1 ei ∞ 2(4 + ρ(nd + 1))εw · (c − ρ)−1 ¯ ¯Assuming εr ≤ ε2 , εx ≤ ε2 . Using only one precision, εr = εx = εw , w w (c − ρ) Di−1 ei ¯ ∞ 2(5 + 2(nd + 1) ccond(A, yi ))εd . Jason Riedy (UCB) Static Pivoting 17 Dec, 2010 59 / 59