# Parallel qr decomposition

## by Skills Matter on Jul 28, 2010

• 4,943 views

28/07/2010 : Jon Harrop : London F-Sharp User Group:QR Decomposition

28/07/2010 : Jon Harrop : London F-Sharp User Group:QR Decomposition

### Accessibility

Uploaded via SlideShare as Microsoft PowerPoint

### 2 Embeds1,462

 http://skillsmatter.com 1461 https://skillsmatter.com 1

### Statistics

Likes
0
0
0
Embed Views
1,462
Views on SlideShare
3,481
Total Views
4,943
• What are c_f etc.? Fortran 77 is unoptimized and not generic. F# code is generic. Faster than Intel’s commercial numerical code.

## Parallel qr decompositionPresentation Transcript

• Parallel QR Decomposition
• Overview
Use QR to find linear least squares best fit:
Simple mathematics but 2,077 lines of gnarly Fortran!
Just 10 lines of F# code!
Optimize:
Algorithmic optimizations.
Low-level optimizations and parallelism.
Final F# is 150 LOC and up to 3× faster than Intel’s Math Kernel Library!
• QR Decomposition
Break matrix into product of simpler matrices:
Q has trivial inverse.
R has zeroes below the diagonal.
Now much easier to solve our problem:
• QR Decomposition
> letrecqr_aux (m, n) k (A': matrix) (Q: matrix) (R: matrix) =
if k = n then Q, R else
let v = Vector.init m (funi->ifi < k then 0.0 else A'.[i, k])
v.[k] <- v.[k] - v.Norm
let w = 1.0 / v.Norm * v
letQk = Matrix.identity m - Matrix.init m m (funi j -> 2.0 * w.[i] * w.[j])
qr_aux (m, n) (k + 1) (Qk * A') (Q * Qk.Transpose) (Qk * R);;
val qr_aux : int * int -> int -> matrix -> matrix -> matrix -> matrix * matrix
> letqr (A: matrix) =
let m, n = A.Dimensions
qr_aux (m, n) 0 A (Matrix.identity m) A;;
val qr : matrix -> matrix * matrix
• Algorithmic optimization
Most effective optimizations ⇒ do them first!
Shrink matrices as we zero-out columns.
letrecqr_aux (m, n) k (A': matrix) (Q: matrix) (R: matrix) =
if k = n then Q, R else
let w = Vector.init m (funi->ifi < k then 0.0 else A'.[i, k])
w.[k] <- w.[k] - w.Norm
let w = 1.0 / w.Norm * w
let A' = A' - outer (2.0 * w) (A'.Transpose * w)
let Q = Q - outer (2.0 * Q * w) w
let R = R - outer (2.0 * w) (R.Transpose * w)
qr_aux (m, n) (k + 1) A' Q R
Asymptotically efficient (almost) purely functional solution!
• Let’s parallelize!
Purely functional solutions scale effortlessly, right?
Example, serial version takes 0.12s:
Array.init10000 (funi->Array.init 1000 (fun j ->i+j))
• Let’s parallelize!
Purely functional solutions scale effortlessly, right?
Example, serial version takes 0.12s:
Array.init10000 (funi->Array.init 1000 (fun j ->i+j))
Parallel version takes 0.15s:
Array.Parallel.init10000 (funi->Array.init 1000 (fun j ->i+j))
What went wrong?
• What went wrong?
Looked embarrassingly parallel but…
Contends for hiddenshared resources:
Main memory bandwidth from L2 cache misses.
Garbage collector.
Destroys scalability!
• Solution…
“Don’t be puritanical about purity” – YaronMinsky, Jane St. Capital.
Use in-place mutation to:
Eliminate allocations.
Improve locality.
Great scalability:
• The Code
letq = Array.init m (funi->Array.init m (fun j ->ifi=j then 1.0 else 0.0))
fork=0 to n-1 do
letw = ws.[k]
letqi = q.[i]
letmutableQwi = 0.0
forj=k to m-1 do
Qwi<- Qwi + qi.[j] * w.[j]
letQwi = 2.0 * Qwi
forj=k to m-1 do
qi.[j] <- qi.[j] - Qwi * w.[j])
|> ignore
Matrix.initm m (funi j -> q.[i].[j])
26 lines of F# already faster than 2,077 lines of Fortran!
• Result
Compared to Fortran:
2,077 lines -> 150 lines.
Competitive performance and no FFI overhead.
More generic (float32, float, complexetc.).
Single scheduler.
One example of many: serial F# typically 2-3× slower than serial Fortran but…
Multicore eats into that performance gap!