SlideShare a Scribd company logo
02/25/2016 CS267 Lecture 12 1
CS 267
Dense Linear Algebra:
History and Structure,
Parallel Matrix Multiplication
James Demmel
www.cs.berkeley.edu/~demmel/cs267_Spr16
Quick review of earlier lecture
• What do you call
• A program written in PyGAS, a Global Address
Space language based on Python…
• That uses a Monte Carlo simulation algorithm to
approximate π …
• That has a race condition, so that it gives you a
different funny answer every time you run it?
Monte - π - thon
02/25/2016 CS267 Lecture 12 2
02/25/2016 CS267 Lecture 12 3
Outline
• History and motivation
• What is dense linear algebra?
• Why minimize communication?
• Lower bound on communication
• Parallel Matrix-matrix multiplication
• Attaining the lower bound
• Other Parallel Algorithms (next lecture)
02/25/2016 CS267 Lecture 12 4
Outline
• History and motivation
• What is dense linear algebra?
• Why minimize communication?
• Lower bound on communication
• Parallel Matrix-matrix multiplication
• Attaining the lower bound
• Other Parallel Algorithms (next lecture)
5
Motifs
The Motifs (formerly “Dwarfs”) from
“The Berkeley View” (Asanovic et al.)
Motifs form key computational patterns
What is dense linear algebra?
• Not just matmul!
• Linear Systems: Ax=b
• Least Squares: choose x to minimize ||Ax-b||2
• Overdetermined or underdetermined; Unconstrained, constrained, or weighted
• Eigenvalues and vectors of Symmetric Matrices
• Standard (Ax = λx), Generalized (Ax=λBx)
• Eigenvalues and vectors of Unsymmetric matrices
• Eigenvalues, Schur form, eigenvectors, invariant subspaces
• Standard, Generalized
• Singular Values and vectors (SVD)
• Standard, Generalized
• Different matrix structures
• Real, complex; Symmetric, Hermitian, positive definite; dense, triangular, banded …
• 27 types in LAPACK (and growing…)
• Level of detail
• Simple Driver (“x=Ab”)
• Expert Drivers with error bounds, extra-precision, other options
• Lower level routines (“apply certain kind of orthogonal transformation”, matmul…)
CS267 Lecture 12 6
02/25/2016
Organizing Linear Algebra – in books
www.netlib.org/lapack www.netlib.org/scalapack
www.cs.utk.edu/~dongarra/etemplates
www.netlib.org/templates
gams.nist.gov
A brief history of (Dense) Linear Algebra software (1/7)
• Libraries like EISPACK (for eigenvalue problems)
• Then the BLAS (1) were invented (1973-1977)
• Standard library of 15 operations (mostly) on vectors
• “AXPY” ( y = α·x + y ), dot product, scale (x = α·x ), etc
• Up to 4 versions of each (S/D/C/Z), 46 routines, 3300 LOC
• Goals
• Common “pattern” to ease programming, readability
• Robustness, via careful coding (avoiding over/underflow)
• Portability + Efficiency via machine specific implementations
• Why BLAS 1 ? They do O(n1) ops on O(n1) data
• Used in libraries like LINPACK (for linear systems)
• Source of the name “LINPACK Benchmark” (not the code!)
02/25/2016 CS267 Lecture 12 8
• In the beginning was the do-loop…
02/25/2016 CS267 Lecture 12 9
Current Records for Solving Dense Systems (11/2013)
• Linpack Benchmark
• Fastest machine overall (www.top500.org)
• Tianhe-2 (Guangzhou, China)
• 33.9 Petaflops out of 54.9 Petaflops peak (n=10M)
• 3.1M cores, of which 2.7M are accelerator cores
• Intel Xeon E5-2692 (Ivy Bridge) and
Xeon Phi 31S1P
• 1 Pbyte memory
• 17.8 MWatts of power, 1.9 Gflops/Watt
• Historical data (www.netlib.org/performance)
• Palm Pilot III
• 1.69 Kiloflops
• n = 100
Current Records for Solving Dense Systems (11/2015)
A brief history of (Dense) Linear Algebra software (2/7)
• But the BLAS-1 weren’t enough
• Consider AXPY ( y = α·x + y ): 2n flops on 3n read/writes
• Computational intensity = (2n)/(3n) = 2/3
• Too low to run near peak speed (read/write dominates)
• Hard to vectorize (“SIMD’ize”) on supercomputers of
the day (1980s)
• So the BLAS-2 were invented (1984-1986)
• Standard library of 25 operations (mostly) on
matrix/vector pairs
• “GEMV”: y = α·A·x + β·x, “GER”: A = A + α·x·yT, x = T-1·x
• Up to 4 versions of each (S/D/C/Z), 66 routines, 18K LOC
• Why BLAS 2 ? They do O(n2) ops on O(n2) data
• So computational intensity still just ~(2n2)/(n2) = 2
• OK for vector machines, but not for machine with caches
02/25/2016 CS267 Lecture 12 10
A brief history of (Dense) Linear Algebra software (3/7)
• The next step: BLAS-3 (1987-1988)
• Standard library of 9 operations (mostly) on matrix/matrix pairs
• “GEMM”: C = α·A·B + β·C, C = α·A·AT + β·C, B = T-1·B
• Up to 4 versions of each (S/D/C/Z), 30 routines, 10K LOC
• Why BLAS 3 ? They do O(n3) ops on O(n2) data
• So computational intensity (2n3)/(4n2) = n/2 – big at last!
• Good for machines with caches, other mem. hierarchy levels
• How much BLAS1/2/3 code so far (all at www.netlib.org/blas)
• Source: 142 routines, 31K LOC, Testing: 28K LOC
• Reference (unoptimized) implementation only
• Ex: 3 nested loops for GEMM
• Lots more optimized code (eg Homework 1)
• Motivates “automatic tuning” of the BLAS
• Part of standard math libraries (eg AMD ACML, Intel MKL)
02/25/2016 CS267 Lecture 12 11
02/25/2009 CS267 Lecture 8 12
BLAS Standards Committee to start meeting again May 2016:
Batched BLAS: many independent BLAS operations at once
Reproducible BLAS: getting bitwise identical answers from
run-to-run, despite nonassociate floating point, and dynamic
scheduling of resources (bebop.cs.berkeley.edu/reproblas)
Low-Precision BLAS: 16 bit floating point
See www.netlib.org/blas/blast-forum/ for previous extension attempt
New functions, Sparse BLAS, Extended Precision BLAS
A brief history of (Dense) Linear Algebra software (4/7)
• LAPACK – “Linear Algebra PACKage” - uses BLAS-3 (1989 – now)
• Ex: Obvious way to express Gaussian Elimination (GE) is adding
multiples of one row to other rows – BLAS-1
• How do we reorganize GE to use BLAS-3 ? (details later)
• Contents of LAPACK (summary)
• Algorithms that are (nearly) 100% BLAS 3
– Linear Systems: solve Ax=b for x
– Least Squares: choose x to minimize ||Ax-b||2
• Algorithms that are only 50% BLAS 3
– Eigenproblems: Find l and x where Ax = l x
– Singular Value Decomposition (SVD)
• Generalized problems (eg Ax = l Bx)
• Error bounds for everything
• Lots of variants depending on A’s structure (banded, A=AT, etc)
• How much code? (Release 3.6.0, Nov 2015) (www.netlib.org/lapack)
• Source: 1750 routines, 721K LOC, Testing: 1094 routines, 472K LOC
• Ongoing development (at UCB and elsewhere) (class projects!)
• Next planned release June 2016 13
A brief history of (Dense) Linear Algebra software (5/7)
• Is LAPACK parallel?
• Only if the BLAS are parallel (possible in shared memory)
• ScaLAPACK – “Scalable LAPACK” (1995 – now)
• For distributed memory – uses MPI
• More complex data structures, algorithms than LAPACK
• Only subset of LAPACK’s functionality available
• Details later (class projects!)
• All at www.netlib.org/scalapack
02/25/2016 CS267 Lecture 12 14
02/25/2016 CS267 Lecture 12 15
Success Stories for Sca/LAPACK (6/7)
Cosmic Microwave Background
Analysis, BOOMERanG
collaboration, MADCAP code (Apr.
27, 2000).
• Widely used
• Adopted by Mathworks, Cray,
Fujitsu, HP, IBM, IMSL, Intel,
NAG, NEC, SGI, …
• 7.5M webhits/year @ Netlib
(incl. CLAPACK, LAPACK95)
• New Science discovered through the
solution of dense matrix systems
• Nature article on the flat
universe used ScaLAPACK
• Other articles in Physics
Review B that also use it
• 1998 Gordon Bell Prize
• www.nersc.gov/assets/NewsImages/2003/
newNERSCresults050703.pdf
A brief future look at (Dense) Linear Algebra software (7/7)
• PLASMA, DPLASMA and MAGMA (now)
• Ongoing extensions to Multicore/GPU/Heterogeneous
• Can one software infrastructure accommodate all algorithms
and platforms of current (future) interest?
• How much code generation and tuning can we automate?
• Details later (Class projects!) (icl.cs.utk.edu/{{d}plasma,magma})
• Other related projects
• Elemental (libelemental.org)
• Distributed memory dense linear algebra
• “Balance ease of use and high performance”
• FLAME (z.cs.utexas.edu/wiki/flame.wiki/FrontPage)
• Formal Linear Algebra Method Environment
• Attempt to automate code generation across multiple platforms
• So far, none of these libraries minimize communication in all
cases (not even matmul!)
17
Back to basics:
Why avoiding communication is important (1/3)
Algorithms have two costs:
1.Arithmetic (FLOPS)
2.Communication: moving data between
• levels of a memory hierarchy (sequential case)
• processors over a network (parallel case).
CPU
Cache
DRAM
CPU
DRAM
CPU
DRAM
CPU
DRAM
CPU
DRAM
02/25/2016 CS267 Lecture 12
Why avoiding communication is important (2/3)
• Running time of an algorithm is sum of 3 terms:
• # flops * time_per_flop
• # words moved / bandwidth
• # messages * latency
18
communication
• Time_per_flop << 1/ bandwidth << latency
• Gaps growing exponentially with time
Annual improvements
Time_per_flop Bandwidth Latency
DRAM 26% 15%
Network 23% 5%
59%
02/25/2016
• Minimize communication to save time
CS267 Lecture 12
Why Minimize Communication? (3/3)
Source: John Shalf, LBL
Why Minimize Communication? (3/3)
Source: John Shalf, LBL
Minimize communication to save energy
Goal:
Organize Linear Algebra to Avoid Communication
21
• Between all memory hierarchy levels
• L1 L2 DRAM network, etc
• Not just hiding communication (overlap with arithmetic)
• Speedup  2x
• Arbitrary speedups/energy savings possible
• Later: Same goal for other computational patterns
• Lots of open problems
02/25/2016
CS267 Lecture 12
Review: Blocked Matrix Multiply
• Blocked Matmul C = A·B breaks A, B and C into blocks
with dimensions that depend on cache size
22
… Break Anxn, Bnxn, Cnxn into bxb blocks labeled A(i,j), etc
… b chosen so 3 bxb blocks fit in cache
for i = 1 to n/b, for j=1 to n/b, for k=1 to n/b
C(i,j) = C(i,j) + A(i,k)·B(k,j) … b x b matmul, 4b2 reads/writes
• When b=1, get “naïve” algorithm, want b larger …
• (n/b)3 · 4b2 = 4n3/b reads/writes altogether
• Minimized when 3b2 = cache size = M, yielding O(n3/M1/2) reads/writes
• What if we had more levels of memory? (L1, L2, cache etc)?
• Would need 3 more nested loops per level
• Recursive (cache-oblivious algorithm) also possible
02/25/2016 CS267 Lecture 12
Communication Lower Bounds: Prior Work on Matmul
• Assume n3 algorithm (i.e. not Strassen-like)
• Sequential case, with fast memory of size M
• Lower bound on #words moved to/from slow memory =
 (n3 / M1/2 ) [Hong, Kung, 81]
• Attained using blocked or cache-oblivious algorithms
23
• Parallel case on P processors:
• Let M be memory per processor; assume load balanced
• Lower bound on #words moved
=  ((n3 /p) / M1/2 )) [Irony, Tiskin, Toledo, 04]
• If M = 3n2/p (one copy of each matrix), then
lower bound =  (n2 /p1/2 )
• Attained by SUMMA, Cannon’s algorithm
02/25/2016 CS267 Lecture 12
New lower bound for all “direct” linear algebra
• Holds for
• Matmul, BLAS, LU, QR, eig, SVD, tensor contractions, …
• Some whole programs (sequences of these operations,
no matter how they are interleaved, eg computing Ak)
• Dense and sparse matrices (where #flops << n3 )
• Sequential and parallel algorithms
• Some graph-theoretic algorithms (eg Floyd-Warshall)
• Generalizations later (Strassen-like algorithms, loops accessing arrays)
24
Let M = “fast” memory size per processor
= cache size (sequential case) or O(n2/p) (parallel case)
#flops = number of flops done per processor
#words_moved per processor = (#flops / M1/2 )
#messages_sent per processor =  (#flops / M3/2 )
02/25/2016 CS267 Lecture 12
• Holds for
• Matmul
New lower bound for all “direct” linear algebra
• Sequential case, dense n x n matrices, so O(n3) flops
• #words_moved = (n3/ M1/2 )
• #messages_sent = (n3/ M3/2 )
• Parallel case, dense n x n matrices
• Load balanced, so O(n3/p) flops processor
• One copy of data, load balanced, so M = O(n2/p) per processor
• #words_moved = (n2/ p1/2 )
• #messages_sent = ( p1/2 )
25
Let M = “fast” memory size per processor
= cache size (sequential case) or O(n2/p) (parallel case)
#flops = number of flops done per processor
#words_moved per processor = (#flops / M1/2 )
#messages_sent per processor =  (#flops / M3/2 )
02/25/2016 CS267 Lecture 12
SIAM Linear Algebra Prize, 2012
Can we attain these lower bounds?
• Do conventional dense algorithms as implemented in LAPACK and
ScaLAPACK attain these bounds?
• Mostly not yet, work in progress
• If not, are there other algorithms that do?
• Yes
• Goals for algorithms:
• Minimize #words_moved
• Minimize #messages_sent
• Need new data structures
• Minimize for multiple memory hierarchy levels
• Cache-oblivious algorithms would be simplest
• Fewest flops when matrix fits in fastest memory
• Cache-oblivious algorithms don’t always attain this
• Attainable for nearly all dense linear algebra
• Just a few prototype implementations so far (class projects!)
• Only a few sparse algorithms so far (eg Cholesky)
26
02/25/2016 CS267 Lecture 12
02/25/2016 CS267 Lecture 12 27
Outline
• History and motivation
• What is dense linear algebra?
• Why minimize communication?
• Lower bound on communication
• Parallel Matrix-matrix multiplication
• Attaining the lower bound
• Other Parallel Algorithms (next lecture)
02/25/2016 CS267 Lecture 12 28
Different Parallel Data Layouts for Matrices (not all!)
0123012301230123
0 1 2 3 0 1 2 3
1) 1D Column Blocked Layout 2) 1D Column Cyclic Layout
3) 1D Column Block Cyclic Layout
4) Row versions of the previous layouts
Generalizes others
0 1 0 1 0 1 0 1
2 3 2 3 2 3 2 3
0 1 0 1 0 1 0 1
2 3 2 3 2 3 2 3
0 1 0 1 0 1 0 1
2 3 2 3 2 3 2 3
0 1 0 1 0 1 0 1
2 3 2 3 2 3 2 3
6) 2D Row and Column
Block Cyclic Layout
0 1 2 3
0 1
2 3
5) 2D Row and Column Blocked Layout
b
02/25/2016 CS267 Lecture 12 29
Parallel Matrix-Vector Product
• Compute y = y + A*x, where A is a dense matrix
• Layout:
• 1D row blocked
• A(i) refers to the n by n/p block row
that processor i owns,
• x(i) and y(i) similarly refer to
segments of x,y owned by i
• Algorithm:
• Foreach processor i
• Broadcast x(i)
• Compute y(i) = A(i)*x
• Algorithm uses the formula
y(i) = y(i) + A(i)*x = y(i) + Sj A(i,j)*x(j)
x
y
P0
P1
P2
P3
P0 P1 P2 P3
A(0)
A(1)
A(2)
A(3)
02/25/2016 CS267 Lecture 12 30
Matrix-Vector Product y = y + A*x
• A column layout of the matrix eliminates the broadcast of x
• But adds a reduction to update the destination y
• A 2D blocked layout uses a broadcast and reduction, both
on a subset of processors
• sqrt(p) for square processor grid
P0 P1 P2 P3
P0 P1 P2 P3
P4 P5 P6 P7
P8 P9 P10 P11
P12 P13 P14 P15
02/25/2016 CS267 Lecture 12 31
Parallel Matrix Multiply
• Computing C=C+A*B
• Using basic algorithm: 2*n3 Flops
• Variables are:
• Data layout: 1D? 2D? Other?
• Topology of machine: Ring? Torus?
• Scheduling communication
• Use of performance models for algorithm design
• Message Time = “latency” + #words * time-per-word
= a + n*b
• Efficiency (in any model):
• serial time / (p * parallel time)
• perfect (linear) speedup  efficiency = 1
02/25/2016 CS267 Lecture 12 32
Matrix Multiply with 1D Column Layout
• Assume matrices are n x n and n is divisible by p
• A(i) refers to the n by n/p block column that processor i
owns (similiarly for B(i) and C(i))
• B(i,j) is the n/p by n/p sublock of B(i)
• in rows j*n/p through (j+1)*n/p - 1
• Algorithm uses the formula
C(i) = C(i) + A*B(i) = C(i) + Sj A(j)*B(j,i)
p0 p1 p2 p3 p5
p4 p6 p7
May be a reasonable
assumption for analysis,
not for code
02/25/2016 CS267 Lecture 12 33
Matrix Multiply: 1D Layout on Bus or Ring
• Algorithm uses the formula
C(i) = C(i) + A*B(i) = C(i) + Sj A(j)*B(j,i)
• First consider a bus-connected machine without
broadcast: only one pair of processors can
communicate at a time (ethernet)
• Second consider a machine with processors on a ring:
all processors may communicate with nearest neighbors
simultaneously
02/25/2016 CS267 Lecture 12 34
MatMul: 1D layout on Bus without Broadcast
Naïve algorithm:
C(myproc) = C(myproc) + A(myproc)*B(myproc,myproc)
for i = 0 to p-1
for j = 0 to p-1 except i
if (myproc == i) send A(i) to processor j
if (myproc == j)
receive A(i) from processor i
C(myproc) = C(myproc) + A(i)*B(i,myproc)
barrier
Cost of inner loop:
computation: 2*n*(n/p)2 = 2*n3/p2
communication: a + b*n2 /p
02/25/2016 CS267 Lecture 12 35
Naïve MatMul (continued)
Cost of inner loop:
computation: 2*n*(n/p)2 = 2*n3/p2
communication: a + b*n2 /p … approximately
Only 1 pair of processors (i and j) are active on any iteration,
and of those, only i is doing computation
=> the algorithm is almost entirely serial
Running time:
= (p*(p-1) + 1)*computation + p*(p-1)*communication
 2*n3 + p2*a + p*n2*b
This is worse than the serial time and grows with p.
02/25/2016 CS267 Lecture 12 36
Matmul for 1D layout on a Processor Ring
• Pairs of adjacent processors can communicate simultaneously
Copy A(myproc) into Tmp
C(myproc) = C(myproc) + Tmp*B(myproc , myproc)
for j = 1 to p-1
Send Tmp to processor myproc+1 mod p
Receive Tmp from processor myproc-1 mod p
C(myproc) = C(myproc) + Tmp*B( myproc-j mod p , myproc)
• Same idea as for gravity in simple sharks and fish algorithm
• May want double buffering in practice for overlap
• Ignoring deadlock details in code
• Time of inner loop = 2*(a + b*n2/p) + 2*n*(n/p)2
02/25/2016 CS267 Lecture 12 37
Matmul for 1D layout on a Processor Ring
• Time of inner loop = 2*(a + b*n2/p) + 2*n*(n/p)2
• Total Time = 2*n* (n/p)2 + (p-1) * Time of inner loop
•  2*n3/p + 2*p*a + 2*b*n2
• (Nearly) Optimal for 1D layout on Ring or Bus, even with Broadcast:
• Perfect speedup for arithmetic
• A(myproc) must move to each other processor, costs at least
(p-1)*cost of sending n*(n/p) words
• Parallel Efficiency = 2*n3 / (p * Total Time)
= 1/(1 + a * p2/(2*n3) + b * p/(2*n) )
= 1/ (1 + O(p/n))
• Grows to 1 as n/p increases (or a and b shrink)
• But far from communication lower bound
02/25/2016 CS267 Lecture 12 38
Need to try 2D Matrix layout
0123012301230123
0 1 2 3 0 1 2 3
1) 1D Column Blocked Layout 2) 1D Column Cyclic Layout
3) 1D Column Block Cyclic Layout
4) Row versions of the previous layouts
Generalizes others
0 1 0 1 0 1 0 1
2 3 2 3 2 3 2 3
0 1 0 1 0 1 0 1
2 3 2 3 2 3 2 3
0 1 0 1 0 1 0 1
2 3 2 3 2 3 2 3
0 1 0 1 0 1 0 1
2 3 2 3 2 3 2 3
6) 2D Row and Column
Block Cyclic Layout
0 1 2 3
0 1
2 3
5) 2D Row and Column Blocked Layout
b
Summary of Parallel Matrix Multiply
• SUMMA
• Scalable Universal Matrix Multiply Algorithm
• Attains communication lower bounds (within log p)
• Cannon
• Historically first, attains lower bounds
• More assumptions
• A and B square
• P a perfect square
• 2.5D SUMMA
• Uses more memory to communicate even less
• Parallel Strassen
• Attains different, even lower bounds
02/25/2016 CS267 Lecture 12 39
02/25/2016 CS267 Lecture 12 40
SUMMA Algorithm
• SUMMA = Scalable Universal Matrix Multiply
• Presentation from van de Geijn and Watts
• www.netlib.org/lapack/lawns/lawn96.ps
• Similar ideas appeared many times
• Used in practice in PBLAS = Parallel BLAS
• www.netlib.org/lapack/lawns/lawn100.ps
SUMMA uses Outer Product form of MatMul
• C = A*B means C(i,j) = Sk A(i,k)*B(k,j)
• Column-wise outer product:
C = A*B
= Sk A(:,k)*B(k,:)
= Sk (k-th col of A)*(k-th row of B)
• Block column-wise outer product
(block size = 4 for illustration)
C = A*B
= A(:,1:4)*B(1:4,:) + A(:,5:8)*B(5:8,:) + …
= Sk (k-th block of 4 cols of A)*
(k-th block of 4 rows of B)
02/25/2016 CS267 Lecture 12 41
42
SUMMA – n x n matmul on P1/2 x P1/2 grid
• C[i, j] is n/P1/2 x n/P1/2 submatrix of C on processor Pij
• A[i,k] is n/P1/2 x b submatrix of A
• B[k,j] is b x n/P1/2 submatrix of B
• C[i,j] = C[i,j] + Sk A[i,k]*B[k,j]
• summation over submatrices
• Need not be square processor grid
* =
i
j
A[i,k]
k
k
B[k,j]
C[i,j]
02/25/2016 CS267 Lecture 12
43
SUMMA– n x n matmul on P1/2 x P1/2 grid
* =
i
j
A[i,k]
k
k
B[k,j]
C(i,j)
For k=0 to n/b-1
for all i = 1 to P1/2
owner of A[i,k] broadcasts it to whole processor row (using binary tree)
for all j = 1 to P1/2
owner of B[k,j] broadcasts it to whole processor column (using bin. tree)
Receive A[i,k] into Acol
Receive B[k,j] into Brow
C_myproc = C_myproc + Acol * Brow
Brow
Acol
02/25/2016 CS267 Lecture 12
44
SUMMA Costs
For k=0 to n/b-1
for all i = 1 to P1/2
owner of A[i,k] broadcasts it to whole processor row (using binary tree)
… #words = log P1/2 *b*n/P1/2 , #messages = log P1/2
for all j = 1 to P1/2
owner of B[k,j] broadcasts it to whole processor column (using bin. tree)
… same #words and #messages
Receive A[i,k] into Acol
Receive B[k,j] into Brow
C_myproc = C_myproc + Acol * Brow … #flops = 2n2*b/P
°Total #words = log P * n2 /P1/2
°Within factor of log P of lower bound
°(more complicated implementation removes log P factor)
°Total #messages = log P * n/b
°Choose b close to maximum, n/P1/2, to approach lower bound P1/2
°Total #flops = 2n3/P
02/25/2016 CS267 Lecture 8 45
PDGEMM = PBLAS routine
for matrix multiply
Observations:
For fixed N, as P increases
Mflops increases, but
less than 100% efficiency
For fixed P, as N increases,
Mflops (efficiency) rises
DGEMM = BLAS routine
for matrix multiply
Maximum speed for PDGEMM
= # Procs * speed of DGEMM
Observations (same as above):
Efficiency always at least 48%
For fixed N, as P increases,
efficiency drops
For fixed P, as N increases,
efficiency increases
46
Can we do better?
• Lower bound assumed 1 copy of data: M = O(n2/P) per proc.
• What if matrix small enough to fit c>1 copies, so M = cn2/P ?
• #words_moved = Ω( #flops / M1/2 ) = Ω( n2 / ( c1/2 P1/2 ))
• #messages = Ω( #flops / M3/2 ) = Ω( P1/2 /c3/2)
• Can we attain new lower bound?
• Special case: “3D Matmul”: c = P1/3
• Bernsten 89, Agarwal, Chandra, Snir 90, Aggarwal 95
• Processors arranged in P1/3 x P1/3 x P1/3 grid
• Processor (i,j,k) performs C(i,j) = C(i,j) + A(i,k)*B(k,j), where
each submatrix is n/P1/3 x n/P1/3
• Not always that much memory available…
02/25/2016 CS267 Lecture 12
2.5D Matrix Multiplication
• Assume can fit cn2/P data per processor, c > 1
• Processors form (P/c)1/2 x (P/c)1/2 x c grid
c
(P/c)1/2
Example: P = 32, c = 2
02/25/2016 CS267 Lecture 12
2.5D Matrix Multiplication
• Assume can fit cn2/P data per processor, c > 1
• Processors form (P/c)1/2 x (P/c)1/2 x c grid
k
j
Initially P(i,j,0) owns A(i,j) and B(i,j)
each of size n(c/P)1/2 x n(c/P)1/2
(1) P(i,j,0) broadcasts A(i,j) and B(i,j) to P(i,j,k)
(2) Processors at level k perform 1/c-th of SUMMA, i.e. 1/c-th of Σm A(i,m)*B(m,j)
(3) Sum-reduce partial sums Σm A(i,m)*B(m,j) along k-axis so P(i,j,0) owns C(i,j)
2.5D Matmul on IBM BG/P, n=64K
• As P increases, available memory grows  c increases proportionally to P
• #flops, #words_moved, #messages per proc all decrease proportionally to P
• #words_moved = Ω( #flops / M1/2 ) = Ω( n2 / ( c1/2 P1/2 ))
• #messages = Ω( #flops / M3/2 ) = Ω( P1/2 /c3/2)
• Perfect strong scaling! But only up to c = P1/3
2.5D Matmul on IBM BG/P, 16K nodes / 64K cores
02/25/2016 CS267 Lecture 12
2.5D Matmul on IBM BG/P, 16K nodes / 64K cores
c = 16 copies
Distinguished Paper Award, EuroPar’11
SC’11 paper by Solomonik, Bhatele, D.
02/25/2016
Perfect Strong Scaling – in Time and Energy
• Every time you add a processor, you should use its memory M too
• Start with minimal number of procs: PM = 3n2
• Increase P by a factor of c  total memory increases by a factor of c
• Notation for timing model:
• γT , βT , αT = secs per flop, per word_moved, per message of size m
• T(cP) = n3/(cP) [ γT+ βT/M1/2 + αT/(mM1/2) ]
= T(P)/c
• Notation for energy model:
• γE , βE , αE = joules for same operations
• δE = joules per word of memory used per sec
• εE = joules per sec for leakage, etc.
• E(cP) = cP { n3/(cP) [ γE+ βE/M1/2 + αE/(mM1/2) ] + δEMT(cP) + εET(cP) }
= E(P)
• c cannot increase forever: c <= P1/3 (3D algorithm)
• Corresponds to lower bound on #messages hitting 1
• Perfect scaling extends to Strassen’s matmul, direct N-body, …
• “Perfect Strong Scaling Using No Additional Energy”
• “Strong Scaling of Matmul and Memory-Indep. Comm. Lower Bounds”
• Both at bebop.cs.berkeley.edu
Classical Matmul vs Parallel Strassen
• Complexity of classical Matmul vs Strassen
• Flops: O(n3/p) vs O(nw/p) where w = log2 7 ~ 2.81
• Communication lower bound on #words:
Ω((n3/p)/M1/2) = Ω(M(n/M1/2)3/p) vs Ω(M(n/M1/2)w/p)
• Communication lower bound on #messages:
Ω((n3/p)/M3/2) = Ω((n/M1/2)3/p) vs Ω((n/M1/2)w/p)
• All attainable as M increases past O(n2/p), up to a limit:
can increase M by factor up to p1/3 vs p1-2/w
#words as low as Ω(n/p2/3) vs Ω(n/p2/w)
• Best Paper Prize, SPAA’11, Ballard, D., Holtz, Schwartz
• How well does parallel Strassen work in practice?
02/27/2014 CS267 Lecture 12 53
Strong scaling of Matmul on Hopper (n=94080)
02/25/2016 54
G. Ballard, D., O. Holtz, B. Lipshitz, O. Schwartz
“Communication-Avoiding Parallel Strassen”
bebop.cs.berkeley.edu, Supercomputing’12
02/25/2016 CS267 Lecture 12 55
ScaLAPACK Parallel Library
Extensions of Lower Bound and
Optimal Algorithms
• For each processor that does G flops with fast memory of size M
#words_moved = Ω(G/M1/2)
• Extension: for any program that “smells like”
• Nested loops …
• That access arrays …
• Where array subscripts are linear functions of loop indices
• Ex: A(i,j), B(3*i-4*k+5*j, i-j, 2*k, …), …
• There is a constant s such that
#words_moved = Ω(G/Ms-1)
• s comes from recent generalization of Loomis-Whitney (s=3/2)
• Ex: linear algebra, n-body, database join, …
• Lots of open questions: deriving s, optimal algorithms …
02/25/2016 CS267 Lecture 12 56
Proof of Communication Lower Bound on C = A·B (1/4)
• Proof from Irony/Toledo/Tiskin (2004)
• Think of instruction stream being executed
• Looks like “ … add, load, multiply, store, load, add, …”
• Each load/store moves a word between fast and slow memory
• We want to count the number of loads and stores, given that we are
multiplying n-by-n matrices C = A·B using the usual 2n3 flops, possibly
reordered assuming addition is commutative/associative
• Assuming that at most M words can be stored in fast memory
• Outline:
• Break instruction stream into segments, each with M loads and stores
• Somehow bound the maximum number of flops that can be done in
each segment, call it F
• So F · # segments  T = total flops = 2·n3 , so # segments  T / F
• So # loads & stores = M · #segments  M · T / F
CS267 Lecture 12
02/25/2016 57
Load
Load
Load
Load
Load
Load
Load
Store
Store
Store
Store
FLOP
FLOP
FLOP
FLOP
FLOP
FLOP
FLOP
Time
Segment 1
Segment 2
Segment 3
Illustrating Segments, for M=3
...
02/25/2016 58
Proof of Communication Lower Bound on C = A·B (2/4)
k
“A face”
“C face”
Cube representing
C(1,1) += A(1,3)·B(3,1)
• If we have at most 2M “A squares”, 2M “B squares”, and
2M “C squares” on faces, how many cubes can we have?
i
j
A(2,1)
A(1,3)
B(1,3)
B(3,1)
C(1,1)
C(2,3)
A(1,1)
B(1,1)
A(1,2)
B(2,1)
59
Proof of Communication Lower Bound on C = A·B (3/4)
x
z
z
y
x
y
k
A shadow
C shadow
j
i
# cubes in black box with
side lengths x, y and z
= Volume of black box
= x·y·z
= ( xz · zy · yx)1/2
= (#A□s · #B□s · #C□s )1/2
(i,k) is in A shadow if (i,j,k) in 3D set
(j,k) is in B shadow if (i,j,k) in 3D set
(i,j) is in C shadow if (i,j,k) in 3D set
Thm (Loomis & Whitney, 1949)
# cubes in 3D set = Volume of 3D set
≤ (area(A shadow) · area(B shadow) ·
area(C shadow)) 1/2
61
Proof of Communication Lower Bound on C = A·B (4/4)
• Consider one “segment” of instructions with M loads, stores
• Can be at most 2M entries of A, B, C available in one segment
• Volume of set of cubes representing possible multiply/adds in
one segment is ≤ (2M · 2M · 2M)1/2 = (2M) 3/2 ≡ F
• # Segments  2n3 / F
• # Loads & Stores = M · #Segments  M · 2n3 / F
 n3 / (2M)1/2 – M = (n3 / M1/2 )
• Parallel Case: apply reasoning to one processor out of P
• # Adds and Muls  2n3 / P (at least one proc does this )
• M= n2 / P (each processor gets equal fraction of matrix)
• # “Load & Stores” = # words moved from or to other procs
 M · (2n3 /P) / F= M · (2n3 /P) / (2M)3/2 = n2 / (2P)1/2
62
02/25/2016 CS267 Lecture 12 63
Extra Slides
2/27/08 CS267 Guest Lecture 1 91
Recursive Layouts
• For both cache hierarchies and parallelism, recursive
layouts may be useful
• Z-Morton, U-Morton, and X-Morton Layout
• Also Hilbert layout and others
• What about the user’s view?
• Fortunately, many problems can be solved on a
permutation
• Never need to actually change the user’s layout
02/09/2006 CS267 Lecture 8 92
Gaussian Elimination
0
x
x
x
x
.
.
.
Standard Way
subtract a multiple of a row
0
x
0
0
. . .
0
LINPACK
apply sequence to a column
x
nb
then apply nb to rest of matrix
a3=a3-a1*a2
a3
a2
a1
L
a2 =L-1 a2
0
x
0
0
. . .
0
nb LAPACK
apply sequence to nb
Slide source: Dongarra
02/09/2006 CS267 Lecture 8 93
LU Algorithm:
1: Split matrix into two rectangles (m x n/2)
if only 1 column, scale by reciprocal of pivot & return
2: Apply LU Algorithm to the left part
3: Apply transformations to right part
(triangular solve A12 = L-1A12 and
matrix multiplication A22=A22 -A21*A12 )
4: Apply LU Algorithm to right part
Gaussian Elimination via a Recursive Algorithm
L A12
A21 A22
F. Gustavson and S. Toledo
Most of the work in the matrix multiply
Matrices of size n/2, n/4, n/8, …
Slide source: Dongarra
02/09/2006 CS267 Lecture 8 94
Recursive Factorizations
• Just as accurate as conventional method
• Same number of operations
• Automatic variable blocking
• Level 1 and 3 BLAS only !
• Extreme clarity and simplicity of expression
• Highly efficient
• The recursive formulation is just a rearrangement of the point-wise
LINPACK algorithm
• The standard error analysis applies (assuming the matrix
operations are computed the “conventional” way).
Slide source: Dongarra
02/09/2006 CS267 Lecture 8 95
DGEMM ATLAS & DGETRF Recursive
AMD Athlon 1GHz (~$1100 system)
0
100
200
300
400
500 1000 1500 2000 2500 3000
Order
MFlop/s
Pentium III 550 MHz Dual Processor
LU Factorization
0
200
400
600
800
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Order
Mflop/s
LAPACK
Recursive LU
Recursive LU
LAPACK
Dual-processor
Uniprocessor
Slide source: Dongarra
02/09/2006 CS267 Lecture 8 96
Review: BLAS 3 (Blocked) GEPP
for ib = 1 to n-1 step b … Process matrix b columns at a time
end = ib + b-1 … Point to end of block of b columns
apply BLAS2 version of GEPP to get A(ib:n , ib:end) = P’ * L’ * U’
… let LL denote the strict lower triangular part of A(ib:end , ib:end) + I
A(ib:end , end+1:n) = LL-1 * A(ib:end , end+1:n) … update next b rows of U
A(end+1:n , end+1:n ) = A(end+1:n , end+1:n )
- A(end+1:n , ib:end) * A(ib:end , end+1:n)
… apply delayed updates with single matrix-multiply
… with inner dimension b
BLAS 3
02/09/2006 CS267 Lecture 8 97
Review: Row and Column Block Cyclic Layout
processors and matrix blocks
are distributed in a 2d array
pcol-fold parallelism
in any column, and calls to the
BLAS2 and BLAS3 on matrices of
size brow-by-bcol
serial bottleneck is eased
need not be symmetric in rows and
columns
02/09/2006 CS267 Lecture 8 98
Distributed GE with a 2D Block Cyclic Layout
block size b in the algorithm and the block sizes brow
and bcol in the layout satisfy b=brow=bcol.
shaded regions indicate busy processors or
communication performed.
unnecessary to have a barrier between each
step of the algorithm, e.g.. step 9, 10, and 11 can be
pipelined
02/09/2006 CS267 Lecture 8 99
Distributed GE with a 2D Block Cyclic Layout
02/09/2006 CS267 Lecture 8 100
Matrix
multiply
of
green
=
green
-
blue
*
pink
02/09/2006 CS267 Lecture 8 101
PDGESV = ScaLAPACK
parallel LU routine
Since it can run no faster than its
inner loop (PDGEMM), we measure:
Efficiency =
Speed(PDGESV)/Speed(PDGEMM)
Observations:
Efficiency well above 50% for large
enough problems
For fixed N, as P increases,
efficiency decreases
(just as for PDGEMM)
For fixed P, as N increases
efficiency increases
(just as for PDGEMM)
From bottom table, cost of solving
Ax=b about half of matrix multiply
for large enough matrices.
From the flop counts we would
expect it to be (2*n3)/(2/3*n3) = 3
times faster, but communication
makes it a little slower.
02/09/2006 CS267 Lecture 8 102
02/09/2006 CS267 Lecture 8 103
Scales well,
nearly full machine speed
02/09/2006 CS267 Lecture 8 104
Old version,
pre 1998 Gordon Bell Prize
Still have ideas to accelerate
Project Available!
Old Algorithm,
plan to abandon
02/09/2006 CS267 Lecture 8 105
Have good ideas to speedup
Project available!
Hardest of all to parallelize
Have alternative, and
would like to compare
Project available!
02/09/2006 CS267 Lecture 8 106
Out-of-core means
matrix lives on disk;
too big for main mem
Much harder to hide
latency of disk
QR much easier than LU
because no pivoting
needed for QR
Moral: use QR to solve Ax=b
Projects available
(perhaps very hard…)
02/09/2006 CS267 Lecture 8 107
A small software project ...
02/09/2006 CS267 Lecture 8 108
Work-Depth Model of Parallelism
• The work depth model:
• The simplest model is used
• For algorithm design, independent of a machine
• The work, W, is the total number of operations
• The depth, D, is the longest chain of dependencies
• The parallelism, P, is defined as W/D
• Specific examples include:
• circuit model, each input defines a graph with ops at
nodes
• vector model, each step is an operation on a vector of
elements
• language model, where set of operations defined by
language
02/09/2006 CS267 Lecture 8 109
Latency Bandwidth Model
• Network of fixed number P of processors
• fully connected
• each with local memory
• Latency (a)
• accounts for varying performance with number of
messages
• gap (g) in logP model may be more accurate cost if
messages are pipelined
• Inverse bandwidth (b)
• accounts for performance varying with volume of data
• Efficiency (in any model):
• serial time / (p * parallel time)
• perfect (linear) speedup  efficiency = 1
02/09/2006 CS267 Lecture 8 110
Initial Step to Skew Matrices in Cannon
• Initial blocked input
• After skewing before initial block multiplies
A(0,1) A(0,2)
A(1,0)
A(2,0)
A(1,1) A(1,2)
A(2,1)
A(2,2)
A(0,0)
B(0,1) B(0,2)
B(1,0)
B(2,0)
B(1,1) B(1,2)
B(2,1) B(2,2)
B(0,0)
A(0,1) A(0,2)
A(1,0)
A(2,0)
A(1,1) A(1,2)
A(2,1) A(2,2)
A(0,0)
B(0,1)
B(0,2)
B(1,0)
B(2,0)
B(1,1)
B(1,2)
B(2,1)
B(2,2)
B(0,0)
02/09/2006 CS267 Lecture 8 111
Skewing Steps in Cannon
• First step
• Second
• Third
A(0,1) A(0,2)
A(1,0)
A(2,0)
A(1,1) A(1,2)
A(2,1)
A(2,2)
A(0,0)
B(0,1)
B(0,2)
B(1,0)
B(2,0)
B(1,1)
B(1,2)
B(2,1)
B(2,2)
B(0,0)
A(0,1) A(0,2)
A(1,0)
A(2,0)
A(1,2)
A(2,1)
B(0,1)
B(0,2)
B(1,0)
B(2,0)
B(1,1)
B(1,2)
B(2,1)
B(2,2)
B(0,0)
A(0,1)
A(0,2)
A(1,0)
A(2,0)
A(1,1) A(1,2)
A(2,1) A(2,2)
A(0,0) B(0,1)
B(0,2)
B(1,0)
B(2,0)
B(1,1)
B(1,2)
B(2,1)
B(2,2)
B(0,0)
A(1,1)
A(2,2)
A(0,0)
2/25/2009 CS267 Lecture 8 112
Motivation (1)
3 Basic Linear Algebra Problems
1. Linear Equations: Solve Ax=b for x
2. Least Squares: Find x that minimizes ||r||2  S ri
2
where r=Ax-b
• Statistics: Fitting data with simple functions
3a. Eigenvalues: Find l and x where Ax = l x
• Vibration analysis, e.g., earthquakes, circuits
3b. Singular Value Decomposition: ATAx=2x
• Data fitting, Information retrieval
Lots of variations depending on structure of A
• A symmetric, positive definite, banded, …
2/25/2009 CS267 Lecture 8 113
Motivation (2)
•Why dense A, as opposed to sparse A?
• Many large matrices are sparse, but …
• Dense algorithms easier to understand
• Some applications yields large dense
matrices
• LINPACK Benchmark (www.top500.org)
• “How fast is your computer?” =
“How fast can you solve dense Ax=b?”
• Large sparse matrix algorithms often yield
smaller (but still large) dense problems
• Do ParLab Apps most use small dense matrices?
02/25/2009 CS267 Lecture 8
Algorithms for 2D (3D) Poisson Equation (N = n2 (n3) vars)
Algorithm Serial PRAM Memory #Procs
• Dense LU N3 N N2 N2
• Band LU N2 (N7/3) N N3/2 (N5/3) N (N4/3)
• Jacobi N2 (N5/3) N (N2/3) N N
• Explicit Inv. N2 log N N2 N2
• Conj.Gradients N3/2 (N4/3) N1/2(1/3) *log N N N
• Red/Black SOR N3/2 (N4/3) N1/2 (N1/3) N N
• Sparse LU N3/2 (N2) N1/2 N*log N (N4/3) N
• FFT N*log N log N N N
• Multigrid N log2 N N N
• Lower bound N log N N
PRAM is an idealized parallel model with zero cost communication
Reference: James Demmel, Applied Numerical Linear Algebra, SIAM, 1997
(Note: corrected complexities for 3D case from last lecture!).
Lessons and Questions (1)
• Structure of the problem matters
• Cost of solution can vary dramatically (n3 to n)
• Many other examples
• Some structure can be figured out automatically
• “Ab” can figure out symmetry, some sparsity
• Some structures known only to (smart) user
• If performance not critical, user may be happy to settle for Ab
• How much of this goes into the motifs?
• How much should we try to help user choose?
• Tuning, but at algorithmic choice level (SALSA)
• Motifs overlap
• Dense, sparse, (un)structured grids, spectral
Organizing Linear Algebra (1)
• By Operations
• Low level (eg mat-mul: BLAS)
• Standard level (eg solve Ax=b, Ax=λx: Sca/LAPACK)
• Applications level (eg systems & control: SLICOT)
• By Performance/accuracy tradeoffs
• “Direct methods” with guarantees vs “iterative methods” that
may work faster and accurately enough
• By Structure
• Storage
• Dense
– columnwise, rowwise, 2D block cyclic, recursive space-filling curves
• Banded, sparse (many flavors), black-box, …
• Mathematical
• Symmetries, positive definiteness, conditioning, …
• As diverse as the world being modeled
Organizing Linear Algebra (2)
• By Data Type
• Real vs Complex
• Floating point (fixed or varying length), other
• By Target Platform
• Serial, manycore, GPU, distributed memory, out-of-
DRAM, Grid, …
• By programming interface
• Language bindings
• “Ab” versus access to details
For all linear algebra problems:
Ex: LAPACK Table of Contents
• Linear Systems
• Least Squares
• Overdetermined, underdetermined
• Unconstrained, constrained, weighted
• Eigenvalues and vectors of Symmetric Matrices
• Standard (Ax = λx), Generallzed (Ax=λxB)
• Eigenvalues and vectors of Unsymmetric matrices
• Eigenvalues, Schur form, eigenvectors, invariant subspaces
• Standard, Generalized
• Singular Values and vectors (SVD)
• Standard, Generalized
• Level of detail
• Simple Driver
• Expert Drivers with error bounds, extra-precision, other options
• Lower level routines (“apply certain kind of orthogonal transformation”)
For all matrix/problem structures:
Ex: LAPACK Table of Contents
• BD – bidiagonal
• GB – general banded
• GE – general
• GG – general , pair
• GT – tridiagonal
• HB – Hermitian banded
• HE – Hermitian
• HG – upper Hessenberg, pair
• HP – Hermitian, packed
• HS – upper Hessenberg
• OR – (real) orthogonal
• OP – (real) orthogonal, packed
• PB – positive definite, banded
• PO – positive definite
• PP – positive definite, packed
• PT – positive definite, tridiagonal
• SB – symmetric, banded
• SP – symmetric, packed
• ST – symmetric, tridiagonal
• SY – symmetric
• TB – triangular, banded
• TG – triangular, pair
• TB – triangular, banded
• TP – triangular, packed
• TR – triangular
• TZ – trapezoidal
• UN – unitary
• UP – unitary packed
For all matrix/problem structures:
Ex: LAPACK Table of Contents
• BD – bidiagonal
• GB – general banded
• GE – general
• GG – general , pair
• GT – tridiagonal
• HB – Hermitian banded
• HE – Hermitian
• HG – upper Hessenberg, pair
• HP – Hermitian, packed
• HS – upper Hessenberg
• OR – (real) orthogonal
• OP – (real) orthogonal, packed
• PB – positive definite, banded
• PO – positive definite
• PP – positive definite, packed
• PT – positive definite, tridiagonal
• SB – symmetric, banded
• SP – symmetric, packed
• ST – symmetric, tridiagonal
• SY – symmetric
• TB – triangular, banded
• TG – triangular, pair
• TB – triangular, banded
• TP – triangular, packed
• TR – triangular
• TZ – trapezoidal
• UN – unitary
• UP – unitary packed
For all matrix/problem structures:
Ex: LAPACK Table of Contents
• BD – bidiagonal
• GB – general banded
• GE – general
• GG – general, pair
• GT – tridiagonal
• HB – Hermitian banded
• HE – Hermitian
• HG – upper Hessenberg, pair
• HP – Hermitian, packed
• HS – upper Hessenberg
• OR – (real) orthogonal
• OP – (real) orthogonal, packed
• PB – positive definite, banded
• PO – positive definite
• PP – positive definite, packed
• PT – positive definite, tridiagonal
• SB – symmetric, banded
• SP – symmetric, packed
• ST – symmetric, tridiagonal
• SY – symmetric
• TB – triangular, banded
• TG – triangular, pair
• TB – triangular, banded
• TP – triangular, packed
• TR – triangular
• TZ – trapezoidal
• UN – unitary
• UP – unitary packed
For all matrix/problem structures:
Ex: LAPACK Table of Contents
• BD – bidiagonal
• GB – general banded
• GE – general
• GG – general, pair
• GT – tridiagonal
• HB – Hermitian banded
• HE – Hermitian
• HG – upper Hessenberg, pair
• HP – Hermitian, packed
• HS – upper Hessenberg
• OR – (real) orthogonal
• OP – (real) orthogonal, packed
• PB – positive definite, banded
• PO – positive definite
• PP – positive definite, packed
• PT – positive definite, tridiagonal
• SB – symmetric, banded
• SP – symmetric, packed
• ST – symmetric, tridiagonal
• SY – symmetric
• TB – triangular, banded
• TG – triangular, pair
• TB – triangular, banded
• TP – triangular, packed
• TR – triangular
• TZ – trapezoidal
• UN – unitary
• UP – unitary packed
For all matrix/problem structures:
Ex: LAPACK Table of Contents
• BD – bidiagonal
• GB – general banded
• GE – general
• GG – general, pair
• GT – tridiagonal
• HB – Hermitian banded
• HE – Hermitian
• HG – upper Hessenberg, pair
• HP – Hermitian, packed
• HS – upper Hessenberg
• OR – (real) orthogonal
• OP – (real) orthogonal, packed
• PB – positive definite, banded
• PO – positive definite
• PP – positive definite, packed
• PT – positive definite, tridiagonal
• SB – symmetric, banded
• SP – symmetric, packed
• ST – symmetric, tridiagonal
• SY – symmetric
• TB – triangular, banded
• TG – triangular, pair
• TB – triangular, banded
• TP – triangular, packed
• TR – triangular
• TZ – trapezoidal
• UN – unitary
• UP – unitary packed
For all matrix/problem structures:
Ex: LAPACK Table of Contents
• BD – bidiagonal
• GB – general banded
• GE – general
• GG – general , pair
• GT – tridiagonal
• HB – Hermitian banded
• HE – Hermitian
• HG – upper Hessenberg, pair
• HP – Hermitian, packed
• HS – upper Hessenberg
• OR – (real) orthogonal
• OP – (real) orthogonal, packed
• PB – positive definite, banded
• PO – positive definite
• PP – positive definite, packed
• PT – positive definite, tridiagonal
• SB – symmetric, banded
• SP – symmetric, packed
• ST – symmetric, tridiagonal
• SY – symmetric
• TB – triangular, banded
• TG – triangular, pair
• TB – triangular, banded
• TP – triangular, packed
• TR – triangular
• TZ – trapezoidal
• UN – unitary
• UP – unitary packed
For all matrix/problem structures:
Ex: LAPACK Table of Contents
• BD – bidiagonal
• GB – general banded
• GE – general
• GG – general, pair
• GT – tridiagonal
• HB – Hermitian banded
• HE – Hermitian
• HG – upper Hessenberg, pair
• HP – Hermitian, packed
• HS – upper Hessenberg
• OR – (real) orthogonal
• OP – (real) orthogonal, packed
• PB – positive definite, banded
• PO – positive definite
• PP – positive definite, packed
• PT – positive definite, tridiagonal
• SB – symmetric, banded
• SP – symmetric, packed
• ST – symmetric, tridiagonal
• SY – symmetric
• TB – triangular, banded
• TG – triangular, pair
• TB – triangular, banded
• TP – triangular, packed
• TR – triangular
• TZ – trapezoidal
• UN – unitary
• UP – unitary packed
For all data types:
Ex: LAPACK Table of Contents
• Real and complex
• Single and double precision
• Arbitrary precision in progress
Organizing Linear Algebra (3)
www.netlib.org/lapack www.netlib.org/scalapack
www.cs.utk.edu/~dongarra/etemplates
www.netlib.org/templates
gams.nist.gov
2/27/08 CS267 Guest Lecture 1 128
Review of the BLAS
BLAS level Ex. # mem refs # flops q
1 “Axpy”,
Dot prod
3n 2n1 2/3
2 Matrix-
vector mult
n2 2n2 2
• Building blocks for all linear algebra
• Parallel versions call serial versions on each processor
• So they must be fast!
• Define q = # flops / # mem refs = “computational intensity”
• The larger is q, the faster the algorithm can go in the
presence of memory hierarchy
• “axpy”: y = a*x + y, where a scalar, x and y vectors
02/22/2011 CS267 Lecture 11
130
Summary of Parallel Matrix Multiplication so far
• 1D Layout
• Bus without broadcast - slower than serial
• Nearest neighbor communication on a ring (or bus with
broadcast): Efficiency = 1/(1 + O(p/n))
• 2D Layout – one copy of all matrices (O(n2/p) per processor)
• Cannon
• Efficiency = 1/(1+O(a * ( sqrt(p) /n)3 +b* sqrt(p) /n)) – optimal!
• Hard to generalize for general p, n, block cyclic, alignment
• SUMMA
• Efficiency = 1/(1 + O(a * log p * p / (b*n2) + b*log p * sqrt(p) /n))
• Very General
• b small => less memory, lower efficiency
• b large => more memory, high efficiency
• Used in practice (PBLAS)
Why?

More Related Content

Similar to lecture12_densela_1_jwd16.ppt

Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomate
Anubhav Jain
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
Andy Petrella
 
1. Lecture1_NOSQL_Introduction.pdf
1. Lecture1_NOSQL_Introduction.pdf1. Lecture1_NOSQL_Introduction.pdf
1. Lecture1_NOSQL_Introduction.pdf
ShaimaaMohamedGalal
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
DataWorks Summit/Hadoop Summit
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Gyula Fóra
 
Towards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and BenchmarkingTowards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and Benchmarking
Saliya Ekanayake
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
Evan Chan
 
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingEuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
Jonathan Dursi
 
What does it take to make google work at scale
What does it take to make google work at scale What does it take to make google work at scale
What does it take to make google work at scale
xlight
 
What does it take to make google work at scale
What does it take to make google work at scale What does it take to make google work at scale
What does it take to make google work at scale
君 廖
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
Fabio Fumarola
 
Sparse Data Support in MLlib
Sparse Data Support in MLlibSparse Data Support in MLlib
Sparse Data Support in MLlib
Xiangrui Meng
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
javier ramirez
 
Conquering "big data": An introduction to shard query
Conquering "big data": An introduction to shard queryConquering "big data": An introduction to shard query
Conquering "big data": An introduction to shard query
Justin Swanhart
 
NOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
NOSQL Meets Relational - The MySQL Ecosystem Gains More FlexibilityNOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
NOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
Ivan Zoratti
 
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
MLconf
 
Accelerating apache spark with rdma
Accelerating apache spark with rdmaAccelerating apache spark with rdma
Accelerating apache spark with rdma
inside-BigData.com
 
Chicago Kafka Meetup
Chicago Kafka MeetupChicago Kafka Meetup
Chicago Kafka Meetup
Cliff Gilmore
 
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life ExampleKafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
confluent
 
A shared-filesystem-memory approach for running IDA in parallel over informal...
A shared-filesystem-memory approach for running IDA in parallel over informal...A shared-filesystem-memory approach for running IDA in parallel over informal...
A shared-filesystem-memory approach for running IDA in parallel over informal...
openseesdays
 

Similar to lecture12_densela_1_jwd16.ppt (20)

Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomate
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
1. Lecture1_NOSQL_Introduction.pdf
1. Lecture1_NOSQL_Introduction.pdf1. Lecture1_NOSQL_Introduction.pdf
1. Lecture1_NOSQL_Introduction.pdf
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
 
Towards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and BenchmarkingTowards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and Benchmarking
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
 
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingEuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
 
What does it take to make google work at scale
What does it take to make google work at scale What does it take to make google work at scale
What does it take to make google work at scale
 
What does it take to make google work at scale
What does it take to make google work at scale What does it take to make google work at scale
What does it take to make google work at scale
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Sparse Data Support in MLlib
Sparse Data Support in MLlibSparse Data Support in MLlib
Sparse Data Support in MLlib
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
Conquering "big data": An introduction to shard query
Conquering "big data": An introduction to shard queryConquering "big data": An introduction to shard query
Conquering "big data": An introduction to shard query
 
NOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
NOSQL Meets Relational - The MySQL Ecosystem Gains More FlexibilityNOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
NOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
 
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
 
Accelerating apache spark with rdma
Accelerating apache spark with rdmaAccelerating apache spark with rdma
Accelerating apache spark with rdma
 
Chicago Kafka Meetup
Chicago Kafka MeetupChicago Kafka Meetup
Chicago Kafka Meetup
 
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life ExampleKafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
 
A shared-filesystem-memory approach for running IDA in parallel over informal...
A shared-filesystem-memory approach for running IDA in parallel over informal...A shared-filesystem-memory approach for running IDA in parallel over informal...
A shared-filesystem-memory approach for running IDA in parallel over informal...
 

Recently uploaded

HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.
deepaannamalai16
 
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
Nguyen Thanh Tu Collection
 
The basics of sentences session 7pptx.pptx
The basics of sentences session 7pptx.pptxThe basics of sentences session 7pptx.pptx
The basics of sentences session 7pptx.pptx
heathfieldcps1
 
Juneteenth Freedom Day 2024 David Douglas School District
Juneteenth Freedom Day 2024 David Douglas School DistrictJuneteenth Freedom Day 2024 David Douglas School District
Juneteenth Freedom Day 2024 David Douglas School District
David Douglas School District
 
How Barcodes Can Be Leveraged Within Odoo 17
How Barcodes Can Be Leveraged Within Odoo 17How Barcodes Can Be Leveraged Within Odoo 17
How Barcodes Can Be Leveraged Within Odoo 17
Celine George
 
Bonku-Babus-Friend by Sathyajith Ray (9)
Bonku-Babus-Friend by Sathyajith Ray  (9)Bonku-Babus-Friend by Sathyajith Ray  (9)
Bonku-Babus-Friend by Sathyajith Ray (9)
nitinpv4ai
 
Skimbleshanks-The-Railway-Cat by T S Eliot
Skimbleshanks-The-Railway-Cat by T S EliotSkimbleshanks-The-Railway-Cat by T S Eliot
Skimbleshanks-The-Railway-Cat by T S Eliot
nitinpv4ai
 
Stack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 MicroprocessorStack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 Microprocessor
JomonJoseph58
 
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumPhilippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
MJDuyan
 
Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...
Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...
Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...
imrankhan141184
 
SWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptxSWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptx
zuzanka
 
Data Structure using C by Dr. K Adisesha .ppsx
Data Structure using C by Dr. K Adisesha .ppsxData Structure using C by Dr. K Adisesha .ppsx
Data Structure using C by Dr. K Adisesha .ppsx
Prof. Dr. K. Adisesha
 
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
EduSkills OECD
 
Pharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brubPharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brub
danielkiash986
 
How to Predict Vendor Bill Product in Odoo 17
How to Predict Vendor Bill Product in Odoo 17How to Predict Vendor Bill Product in Odoo 17
How to Predict Vendor Bill Product in Odoo 17
Celine George
 
Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47
MysoreMuleSoftMeetup
 
Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"
National Information Standards Organization (NISO)
 
Standardized tool for Intelligence test.
Standardized tool for Intelligence test.Standardized tool for Intelligence test.
Standardized tool for Intelligence test.
deepaannamalai16
 
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem studentsRHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
Himanshu Rai
 
Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...
PsychoTech Services
 

Recently uploaded (20)

HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.
 
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
 
The basics of sentences session 7pptx.pptx
The basics of sentences session 7pptx.pptxThe basics of sentences session 7pptx.pptx
The basics of sentences session 7pptx.pptx
 
Juneteenth Freedom Day 2024 David Douglas School District
Juneteenth Freedom Day 2024 David Douglas School DistrictJuneteenth Freedom Day 2024 David Douglas School District
Juneteenth Freedom Day 2024 David Douglas School District
 
How Barcodes Can Be Leveraged Within Odoo 17
How Barcodes Can Be Leveraged Within Odoo 17How Barcodes Can Be Leveraged Within Odoo 17
How Barcodes Can Be Leveraged Within Odoo 17
 
Bonku-Babus-Friend by Sathyajith Ray (9)
Bonku-Babus-Friend by Sathyajith Ray  (9)Bonku-Babus-Friend by Sathyajith Ray  (9)
Bonku-Babus-Friend by Sathyajith Ray (9)
 
Skimbleshanks-The-Railway-Cat by T S Eliot
Skimbleshanks-The-Railway-Cat by T S EliotSkimbleshanks-The-Railway-Cat by T S Eliot
Skimbleshanks-The-Railway-Cat by T S Eliot
 
Stack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 MicroprocessorStack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 Microprocessor
 
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumPhilippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
 
Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...
Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...
Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...
 
SWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptxSWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptx
 
Data Structure using C by Dr. K Adisesha .ppsx
Data Structure using C by Dr. K Adisesha .ppsxData Structure using C by Dr. K Adisesha .ppsx
Data Structure using C by Dr. K Adisesha .ppsx
 
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
 
Pharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brubPharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brub
 
How to Predict Vendor Bill Product in Odoo 17
How to Predict Vendor Bill Product in Odoo 17How to Predict Vendor Bill Product in Odoo 17
How to Predict Vendor Bill Product in Odoo 17
 
Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47
 
Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"
 
Standardized tool for Intelligence test.
Standardized tool for Intelligence test.Standardized tool for Intelligence test.
Standardized tool for Intelligence test.
 
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem studentsRHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
 
Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...
 

lecture12_densela_1_jwd16.ppt

  • 1. 02/25/2016 CS267 Lecture 12 1 CS 267 Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr16
  • 2. Quick review of earlier lecture • What do you call • A program written in PyGAS, a Global Address Space language based on Python… • That uses a Monte Carlo simulation algorithm to approximate π … • That has a race condition, so that it gives you a different funny answer every time you run it? Monte - π - thon 02/25/2016 CS267 Lecture 12 2
  • 3. 02/25/2016 CS267 Lecture 12 3 Outline • History and motivation • What is dense linear algebra? • Why minimize communication? • Lower bound on communication • Parallel Matrix-matrix multiplication • Attaining the lower bound • Other Parallel Algorithms (next lecture)
  • 4. 02/25/2016 CS267 Lecture 12 4 Outline • History and motivation • What is dense linear algebra? • Why minimize communication? • Lower bound on communication • Parallel Matrix-matrix multiplication • Attaining the lower bound • Other Parallel Algorithms (next lecture)
  • 5. 5 Motifs The Motifs (formerly “Dwarfs”) from “The Berkeley View” (Asanovic et al.) Motifs form key computational patterns
  • 6. What is dense linear algebra? • Not just matmul! • Linear Systems: Ax=b • Least Squares: choose x to minimize ||Ax-b||2 • Overdetermined or underdetermined; Unconstrained, constrained, or weighted • Eigenvalues and vectors of Symmetric Matrices • Standard (Ax = λx), Generalized (Ax=λBx) • Eigenvalues and vectors of Unsymmetric matrices • Eigenvalues, Schur form, eigenvectors, invariant subspaces • Standard, Generalized • Singular Values and vectors (SVD) • Standard, Generalized • Different matrix structures • Real, complex; Symmetric, Hermitian, positive definite; dense, triangular, banded … • 27 types in LAPACK (and growing…) • Level of detail • Simple Driver (“x=Ab”) • Expert Drivers with error bounds, extra-precision, other options • Lower level routines (“apply certain kind of orthogonal transformation”, matmul…) CS267 Lecture 12 6 02/25/2016
  • 7. Organizing Linear Algebra – in books www.netlib.org/lapack www.netlib.org/scalapack www.cs.utk.edu/~dongarra/etemplates www.netlib.org/templates gams.nist.gov
  • 8. A brief history of (Dense) Linear Algebra software (1/7) • Libraries like EISPACK (for eigenvalue problems) • Then the BLAS (1) were invented (1973-1977) • Standard library of 15 operations (mostly) on vectors • “AXPY” ( y = α·x + y ), dot product, scale (x = α·x ), etc • Up to 4 versions of each (S/D/C/Z), 46 routines, 3300 LOC • Goals • Common “pattern” to ease programming, readability • Robustness, via careful coding (avoiding over/underflow) • Portability + Efficiency via machine specific implementations • Why BLAS 1 ? They do O(n1) ops on O(n1) data • Used in libraries like LINPACK (for linear systems) • Source of the name “LINPACK Benchmark” (not the code!) 02/25/2016 CS267 Lecture 12 8 • In the beginning was the do-loop…
  • 9. 02/25/2016 CS267 Lecture 12 9 Current Records for Solving Dense Systems (11/2013) • Linpack Benchmark • Fastest machine overall (www.top500.org) • Tianhe-2 (Guangzhou, China) • 33.9 Petaflops out of 54.9 Petaflops peak (n=10M) • 3.1M cores, of which 2.7M are accelerator cores • Intel Xeon E5-2692 (Ivy Bridge) and Xeon Phi 31S1P • 1 Pbyte memory • 17.8 MWatts of power, 1.9 Gflops/Watt • Historical data (www.netlib.org/performance) • Palm Pilot III • 1.69 Kiloflops • n = 100 Current Records for Solving Dense Systems (11/2015)
  • 10. A brief history of (Dense) Linear Algebra software (2/7) • But the BLAS-1 weren’t enough • Consider AXPY ( y = α·x + y ): 2n flops on 3n read/writes • Computational intensity = (2n)/(3n) = 2/3 • Too low to run near peak speed (read/write dominates) • Hard to vectorize (“SIMD’ize”) on supercomputers of the day (1980s) • So the BLAS-2 were invented (1984-1986) • Standard library of 25 operations (mostly) on matrix/vector pairs • “GEMV”: y = α·A·x + β·x, “GER”: A = A + α·x·yT, x = T-1·x • Up to 4 versions of each (S/D/C/Z), 66 routines, 18K LOC • Why BLAS 2 ? They do O(n2) ops on O(n2) data • So computational intensity still just ~(2n2)/(n2) = 2 • OK for vector machines, but not for machine with caches 02/25/2016 CS267 Lecture 12 10
  • 11. A brief history of (Dense) Linear Algebra software (3/7) • The next step: BLAS-3 (1987-1988) • Standard library of 9 operations (mostly) on matrix/matrix pairs • “GEMM”: C = α·A·B + β·C, C = α·A·AT + β·C, B = T-1·B • Up to 4 versions of each (S/D/C/Z), 30 routines, 10K LOC • Why BLAS 3 ? They do O(n3) ops on O(n2) data • So computational intensity (2n3)/(4n2) = n/2 – big at last! • Good for machines with caches, other mem. hierarchy levels • How much BLAS1/2/3 code so far (all at www.netlib.org/blas) • Source: 142 routines, 31K LOC, Testing: 28K LOC • Reference (unoptimized) implementation only • Ex: 3 nested loops for GEMM • Lots more optimized code (eg Homework 1) • Motivates “automatic tuning” of the BLAS • Part of standard math libraries (eg AMD ACML, Intel MKL) 02/25/2016 CS267 Lecture 12 11
  • 12. 02/25/2009 CS267 Lecture 8 12 BLAS Standards Committee to start meeting again May 2016: Batched BLAS: many independent BLAS operations at once Reproducible BLAS: getting bitwise identical answers from run-to-run, despite nonassociate floating point, and dynamic scheduling of resources (bebop.cs.berkeley.edu/reproblas) Low-Precision BLAS: 16 bit floating point See www.netlib.org/blas/blast-forum/ for previous extension attempt New functions, Sparse BLAS, Extended Precision BLAS
  • 13. A brief history of (Dense) Linear Algebra software (4/7) • LAPACK – “Linear Algebra PACKage” - uses BLAS-3 (1989 – now) • Ex: Obvious way to express Gaussian Elimination (GE) is adding multiples of one row to other rows – BLAS-1 • How do we reorganize GE to use BLAS-3 ? (details later) • Contents of LAPACK (summary) • Algorithms that are (nearly) 100% BLAS 3 – Linear Systems: solve Ax=b for x – Least Squares: choose x to minimize ||Ax-b||2 • Algorithms that are only 50% BLAS 3 – Eigenproblems: Find l and x where Ax = l x – Singular Value Decomposition (SVD) • Generalized problems (eg Ax = l Bx) • Error bounds for everything • Lots of variants depending on A’s structure (banded, A=AT, etc) • How much code? (Release 3.6.0, Nov 2015) (www.netlib.org/lapack) • Source: 1750 routines, 721K LOC, Testing: 1094 routines, 472K LOC • Ongoing development (at UCB and elsewhere) (class projects!) • Next planned release June 2016 13
  • 14. A brief history of (Dense) Linear Algebra software (5/7) • Is LAPACK parallel? • Only if the BLAS are parallel (possible in shared memory) • ScaLAPACK – “Scalable LAPACK” (1995 – now) • For distributed memory – uses MPI • More complex data structures, algorithms than LAPACK • Only subset of LAPACK’s functionality available • Details later (class projects!) • All at www.netlib.org/scalapack 02/25/2016 CS267 Lecture 12 14
  • 15. 02/25/2016 CS267 Lecture 12 15 Success Stories for Sca/LAPACK (6/7) Cosmic Microwave Background Analysis, BOOMERanG collaboration, MADCAP code (Apr. 27, 2000). • Widely used • Adopted by Mathworks, Cray, Fujitsu, HP, IBM, IMSL, Intel, NAG, NEC, SGI, … • 7.5M webhits/year @ Netlib (incl. CLAPACK, LAPACK95) • New Science discovered through the solution of dense matrix systems • Nature article on the flat universe used ScaLAPACK • Other articles in Physics Review B that also use it • 1998 Gordon Bell Prize • www.nersc.gov/assets/NewsImages/2003/ newNERSCresults050703.pdf
  • 16. A brief future look at (Dense) Linear Algebra software (7/7) • PLASMA, DPLASMA and MAGMA (now) • Ongoing extensions to Multicore/GPU/Heterogeneous • Can one software infrastructure accommodate all algorithms and platforms of current (future) interest? • How much code generation and tuning can we automate? • Details later (Class projects!) (icl.cs.utk.edu/{{d}plasma,magma}) • Other related projects • Elemental (libelemental.org) • Distributed memory dense linear algebra • “Balance ease of use and high performance” • FLAME (z.cs.utexas.edu/wiki/flame.wiki/FrontPage) • Formal Linear Algebra Method Environment • Attempt to automate code generation across multiple platforms • So far, none of these libraries minimize communication in all cases (not even matmul!)
  • 17. 17 Back to basics: Why avoiding communication is important (1/3) Algorithms have two costs: 1.Arithmetic (FLOPS) 2.Communication: moving data between • levels of a memory hierarchy (sequential case) • processors over a network (parallel case). CPU Cache DRAM CPU DRAM CPU DRAM CPU DRAM CPU DRAM 02/25/2016 CS267 Lecture 12
  • 18. Why avoiding communication is important (2/3) • Running time of an algorithm is sum of 3 terms: • # flops * time_per_flop • # words moved / bandwidth • # messages * latency 18 communication • Time_per_flop << 1/ bandwidth << latency • Gaps growing exponentially with time Annual improvements Time_per_flop Bandwidth Latency DRAM 26% 15% Network 23% 5% 59% 02/25/2016 • Minimize communication to save time CS267 Lecture 12
  • 19. Why Minimize Communication? (3/3) Source: John Shalf, LBL
  • 20. Why Minimize Communication? (3/3) Source: John Shalf, LBL Minimize communication to save energy
  • 21. Goal: Organize Linear Algebra to Avoid Communication 21 • Between all memory hierarchy levels • L1 L2 DRAM network, etc • Not just hiding communication (overlap with arithmetic) • Speedup  2x • Arbitrary speedups/energy savings possible • Later: Same goal for other computational patterns • Lots of open problems 02/25/2016 CS267 Lecture 12
  • 22. Review: Blocked Matrix Multiply • Blocked Matmul C = A·B breaks A, B and C into blocks with dimensions that depend on cache size 22 … Break Anxn, Bnxn, Cnxn into bxb blocks labeled A(i,j), etc … b chosen so 3 bxb blocks fit in cache for i = 1 to n/b, for j=1 to n/b, for k=1 to n/b C(i,j) = C(i,j) + A(i,k)·B(k,j) … b x b matmul, 4b2 reads/writes • When b=1, get “naïve” algorithm, want b larger … • (n/b)3 · 4b2 = 4n3/b reads/writes altogether • Minimized when 3b2 = cache size = M, yielding O(n3/M1/2) reads/writes • What if we had more levels of memory? (L1, L2, cache etc)? • Would need 3 more nested loops per level • Recursive (cache-oblivious algorithm) also possible 02/25/2016 CS267 Lecture 12
  • 23. Communication Lower Bounds: Prior Work on Matmul • Assume n3 algorithm (i.e. not Strassen-like) • Sequential case, with fast memory of size M • Lower bound on #words moved to/from slow memory =  (n3 / M1/2 ) [Hong, Kung, 81] • Attained using blocked or cache-oblivious algorithms 23 • Parallel case on P processors: • Let M be memory per processor; assume load balanced • Lower bound on #words moved =  ((n3 /p) / M1/2 )) [Irony, Tiskin, Toledo, 04] • If M = 3n2/p (one copy of each matrix), then lower bound =  (n2 /p1/2 ) • Attained by SUMMA, Cannon’s algorithm 02/25/2016 CS267 Lecture 12
  • 24. New lower bound for all “direct” linear algebra • Holds for • Matmul, BLAS, LU, QR, eig, SVD, tensor contractions, … • Some whole programs (sequences of these operations, no matter how they are interleaved, eg computing Ak) • Dense and sparse matrices (where #flops << n3 ) • Sequential and parallel algorithms • Some graph-theoretic algorithms (eg Floyd-Warshall) • Generalizations later (Strassen-like algorithms, loops accessing arrays) 24 Let M = “fast” memory size per processor = cache size (sequential case) or O(n2/p) (parallel case) #flops = number of flops done per processor #words_moved per processor = (#flops / M1/2 ) #messages_sent per processor =  (#flops / M3/2 ) 02/25/2016 CS267 Lecture 12 • Holds for • Matmul
  • 25. New lower bound for all “direct” linear algebra • Sequential case, dense n x n matrices, so O(n3) flops • #words_moved = (n3/ M1/2 ) • #messages_sent = (n3/ M3/2 ) • Parallel case, dense n x n matrices • Load balanced, so O(n3/p) flops processor • One copy of data, load balanced, so M = O(n2/p) per processor • #words_moved = (n2/ p1/2 ) • #messages_sent = ( p1/2 ) 25 Let M = “fast” memory size per processor = cache size (sequential case) or O(n2/p) (parallel case) #flops = number of flops done per processor #words_moved per processor = (#flops / M1/2 ) #messages_sent per processor =  (#flops / M3/2 ) 02/25/2016 CS267 Lecture 12 SIAM Linear Algebra Prize, 2012
  • 26. Can we attain these lower bounds? • Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these bounds? • Mostly not yet, work in progress • If not, are there other algorithms that do? • Yes • Goals for algorithms: • Minimize #words_moved • Minimize #messages_sent • Need new data structures • Minimize for multiple memory hierarchy levels • Cache-oblivious algorithms would be simplest • Fewest flops when matrix fits in fastest memory • Cache-oblivious algorithms don’t always attain this • Attainable for nearly all dense linear algebra • Just a few prototype implementations so far (class projects!) • Only a few sparse algorithms so far (eg Cholesky) 26 02/25/2016 CS267 Lecture 12
  • 27. 02/25/2016 CS267 Lecture 12 27 Outline • History and motivation • What is dense linear algebra? • Why minimize communication? • Lower bound on communication • Parallel Matrix-matrix multiplication • Attaining the lower bound • Other Parallel Algorithms (next lecture)
  • 28. 02/25/2016 CS267 Lecture 12 28 Different Parallel Data Layouts for Matrices (not all!) 0123012301230123 0 1 2 3 0 1 2 3 1) 1D Column Blocked Layout 2) 1D Column Cyclic Layout 3) 1D Column Block Cyclic Layout 4) Row versions of the previous layouts Generalizes others 0 1 0 1 0 1 0 1 2 3 2 3 2 3 2 3 0 1 0 1 0 1 0 1 2 3 2 3 2 3 2 3 0 1 0 1 0 1 0 1 2 3 2 3 2 3 2 3 0 1 0 1 0 1 0 1 2 3 2 3 2 3 2 3 6) 2D Row and Column Block Cyclic Layout 0 1 2 3 0 1 2 3 5) 2D Row and Column Blocked Layout b
  • 29. 02/25/2016 CS267 Lecture 12 29 Parallel Matrix-Vector Product • Compute y = y + A*x, where A is a dense matrix • Layout: • 1D row blocked • A(i) refers to the n by n/p block row that processor i owns, • x(i) and y(i) similarly refer to segments of x,y owned by i • Algorithm: • Foreach processor i • Broadcast x(i) • Compute y(i) = A(i)*x • Algorithm uses the formula y(i) = y(i) + A(i)*x = y(i) + Sj A(i,j)*x(j) x y P0 P1 P2 P3 P0 P1 P2 P3 A(0) A(1) A(2) A(3)
  • 30. 02/25/2016 CS267 Lecture 12 30 Matrix-Vector Product y = y + A*x • A column layout of the matrix eliminates the broadcast of x • But adds a reduction to update the destination y • A 2D blocked layout uses a broadcast and reduction, both on a subset of processors • sqrt(p) for square processor grid P0 P1 P2 P3 P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15
  • 31. 02/25/2016 CS267 Lecture 12 31 Parallel Matrix Multiply • Computing C=C+A*B • Using basic algorithm: 2*n3 Flops • Variables are: • Data layout: 1D? 2D? Other? • Topology of machine: Ring? Torus? • Scheduling communication • Use of performance models for algorithm design • Message Time = “latency” + #words * time-per-word = a + n*b • Efficiency (in any model): • serial time / (p * parallel time) • perfect (linear) speedup  efficiency = 1
  • 32. 02/25/2016 CS267 Lecture 12 32 Matrix Multiply with 1D Column Layout • Assume matrices are n x n and n is divisible by p • A(i) refers to the n by n/p block column that processor i owns (similiarly for B(i) and C(i)) • B(i,j) is the n/p by n/p sublock of B(i) • in rows j*n/p through (j+1)*n/p - 1 • Algorithm uses the formula C(i) = C(i) + A*B(i) = C(i) + Sj A(j)*B(j,i) p0 p1 p2 p3 p5 p4 p6 p7 May be a reasonable assumption for analysis, not for code
  • 33. 02/25/2016 CS267 Lecture 12 33 Matrix Multiply: 1D Layout on Bus or Ring • Algorithm uses the formula C(i) = C(i) + A*B(i) = C(i) + Sj A(j)*B(j,i) • First consider a bus-connected machine without broadcast: only one pair of processors can communicate at a time (ethernet) • Second consider a machine with processors on a ring: all processors may communicate with nearest neighbors simultaneously
  • 34. 02/25/2016 CS267 Lecture 12 34 MatMul: 1D layout on Bus without Broadcast Naïve algorithm: C(myproc) = C(myproc) + A(myproc)*B(myproc,myproc) for i = 0 to p-1 for j = 0 to p-1 except i if (myproc == i) send A(i) to processor j if (myproc == j) receive A(i) from processor i C(myproc) = C(myproc) + A(i)*B(i,myproc) barrier Cost of inner loop: computation: 2*n*(n/p)2 = 2*n3/p2 communication: a + b*n2 /p
  • 35. 02/25/2016 CS267 Lecture 12 35 Naïve MatMul (continued) Cost of inner loop: computation: 2*n*(n/p)2 = 2*n3/p2 communication: a + b*n2 /p … approximately Only 1 pair of processors (i and j) are active on any iteration, and of those, only i is doing computation => the algorithm is almost entirely serial Running time: = (p*(p-1) + 1)*computation + p*(p-1)*communication  2*n3 + p2*a + p*n2*b This is worse than the serial time and grows with p.
  • 36. 02/25/2016 CS267 Lecture 12 36 Matmul for 1D layout on a Processor Ring • Pairs of adjacent processors can communicate simultaneously Copy A(myproc) into Tmp C(myproc) = C(myproc) + Tmp*B(myproc , myproc) for j = 1 to p-1 Send Tmp to processor myproc+1 mod p Receive Tmp from processor myproc-1 mod p C(myproc) = C(myproc) + Tmp*B( myproc-j mod p , myproc) • Same idea as for gravity in simple sharks and fish algorithm • May want double buffering in practice for overlap • Ignoring deadlock details in code • Time of inner loop = 2*(a + b*n2/p) + 2*n*(n/p)2
  • 37. 02/25/2016 CS267 Lecture 12 37 Matmul for 1D layout on a Processor Ring • Time of inner loop = 2*(a + b*n2/p) + 2*n*(n/p)2 • Total Time = 2*n* (n/p)2 + (p-1) * Time of inner loop •  2*n3/p + 2*p*a + 2*b*n2 • (Nearly) Optimal for 1D layout on Ring or Bus, even with Broadcast: • Perfect speedup for arithmetic • A(myproc) must move to each other processor, costs at least (p-1)*cost of sending n*(n/p) words • Parallel Efficiency = 2*n3 / (p * Total Time) = 1/(1 + a * p2/(2*n3) + b * p/(2*n) ) = 1/ (1 + O(p/n)) • Grows to 1 as n/p increases (or a and b shrink) • But far from communication lower bound
  • 38. 02/25/2016 CS267 Lecture 12 38 Need to try 2D Matrix layout 0123012301230123 0 1 2 3 0 1 2 3 1) 1D Column Blocked Layout 2) 1D Column Cyclic Layout 3) 1D Column Block Cyclic Layout 4) Row versions of the previous layouts Generalizes others 0 1 0 1 0 1 0 1 2 3 2 3 2 3 2 3 0 1 0 1 0 1 0 1 2 3 2 3 2 3 2 3 0 1 0 1 0 1 0 1 2 3 2 3 2 3 2 3 0 1 0 1 0 1 0 1 2 3 2 3 2 3 2 3 6) 2D Row and Column Block Cyclic Layout 0 1 2 3 0 1 2 3 5) 2D Row and Column Blocked Layout b
  • 39. Summary of Parallel Matrix Multiply • SUMMA • Scalable Universal Matrix Multiply Algorithm • Attains communication lower bounds (within log p) • Cannon • Historically first, attains lower bounds • More assumptions • A and B square • P a perfect square • 2.5D SUMMA • Uses more memory to communicate even less • Parallel Strassen • Attains different, even lower bounds 02/25/2016 CS267 Lecture 12 39
  • 40. 02/25/2016 CS267 Lecture 12 40 SUMMA Algorithm • SUMMA = Scalable Universal Matrix Multiply • Presentation from van de Geijn and Watts • www.netlib.org/lapack/lawns/lawn96.ps • Similar ideas appeared many times • Used in practice in PBLAS = Parallel BLAS • www.netlib.org/lapack/lawns/lawn100.ps
  • 41. SUMMA uses Outer Product form of MatMul • C = A*B means C(i,j) = Sk A(i,k)*B(k,j) • Column-wise outer product: C = A*B = Sk A(:,k)*B(k,:) = Sk (k-th col of A)*(k-th row of B) • Block column-wise outer product (block size = 4 for illustration) C = A*B = A(:,1:4)*B(1:4,:) + A(:,5:8)*B(5:8,:) + … = Sk (k-th block of 4 cols of A)* (k-th block of 4 rows of B) 02/25/2016 CS267 Lecture 12 41
  • 42. 42 SUMMA – n x n matmul on P1/2 x P1/2 grid • C[i, j] is n/P1/2 x n/P1/2 submatrix of C on processor Pij • A[i,k] is n/P1/2 x b submatrix of A • B[k,j] is b x n/P1/2 submatrix of B • C[i,j] = C[i,j] + Sk A[i,k]*B[k,j] • summation over submatrices • Need not be square processor grid * = i j A[i,k] k k B[k,j] C[i,j] 02/25/2016 CS267 Lecture 12
  • 43. 43 SUMMA– n x n matmul on P1/2 x P1/2 grid * = i j A[i,k] k k B[k,j] C(i,j) For k=0 to n/b-1 for all i = 1 to P1/2 owner of A[i,k] broadcasts it to whole processor row (using binary tree) for all j = 1 to P1/2 owner of B[k,j] broadcasts it to whole processor column (using bin. tree) Receive A[i,k] into Acol Receive B[k,j] into Brow C_myproc = C_myproc + Acol * Brow Brow Acol 02/25/2016 CS267 Lecture 12
  • 44. 44 SUMMA Costs For k=0 to n/b-1 for all i = 1 to P1/2 owner of A[i,k] broadcasts it to whole processor row (using binary tree) … #words = log P1/2 *b*n/P1/2 , #messages = log P1/2 for all j = 1 to P1/2 owner of B[k,j] broadcasts it to whole processor column (using bin. tree) … same #words and #messages Receive A[i,k] into Acol Receive B[k,j] into Brow C_myproc = C_myproc + Acol * Brow … #flops = 2n2*b/P °Total #words = log P * n2 /P1/2 °Within factor of log P of lower bound °(more complicated implementation removes log P factor) °Total #messages = log P * n/b °Choose b close to maximum, n/P1/2, to approach lower bound P1/2 °Total #flops = 2n3/P
  • 45. 02/25/2016 CS267 Lecture 8 45 PDGEMM = PBLAS routine for matrix multiply Observations: For fixed N, as P increases Mflops increases, but less than 100% efficiency For fixed P, as N increases, Mflops (efficiency) rises DGEMM = BLAS routine for matrix multiply Maximum speed for PDGEMM = # Procs * speed of DGEMM Observations (same as above): Efficiency always at least 48% For fixed N, as P increases, efficiency drops For fixed P, as N increases, efficiency increases
  • 46. 46 Can we do better? • Lower bound assumed 1 copy of data: M = O(n2/P) per proc. • What if matrix small enough to fit c>1 copies, so M = cn2/P ? • #words_moved = Ω( #flops / M1/2 ) = Ω( n2 / ( c1/2 P1/2 )) • #messages = Ω( #flops / M3/2 ) = Ω( P1/2 /c3/2) • Can we attain new lower bound? • Special case: “3D Matmul”: c = P1/3 • Bernsten 89, Agarwal, Chandra, Snir 90, Aggarwal 95 • Processors arranged in P1/3 x P1/3 x P1/3 grid • Processor (i,j,k) performs C(i,j) = C(i,j) + A(i,k)*B(k,j), where each submatrix is n/P1/3 x n/P1/3 • Not always that much memory available… 02/25/2016 CS267 Lecture 12
  • 47. 2.5D Matrix Multiplication • Assume can fit cn2/P data per processor, c > 1 • Processors form (P/c)1/2 x (P/c)1/2 x c grid c (P/c)1/2 Example: P = 32, c = 2 02/25/2016 CS267 Lecture 12
  • 48. 2.5D Matrix Multiplication • Assume can fit cn2/P data per processor, c > 1 • Processors form (P/c)1/2 x (P/c)1/2 x c grid k j Initially P(i,j,0) owns A(i,j) and B(i,j) each of size n(c/P)1/2 x n(c/P)1/2 (1) P(i,j,0) broadcasts A(i,j) and B(i,j) to P(i,j,k) (2) Processors at level k perform 1/c-th of SUMMA, i.e. 1/c-th of Σm A(i,m)*B(m,j) (3) Sum-reduce partial sums Σm A(i,m)*B(m,j) along k-axis so P(i,j,0) owns C(i,j)
  • 49. 2.5D Matmul on IBM BG/P, n=64K • As P increases, available memory grows  c increases proportionally to P • #flops, #words_moved, #messages per proc all decrease proportionally to P • #words_moved = Ω( #flops / M1/2 ) = Ω( n2 / ( c1/2 P1/2 )) • #messages = Ω( #flops / M3/2 ) = Ω( P1/2 /c3/2) • Perfect strong scaling! But only up to c = P1/3
  • 50. 2.5D Matmul on IBM BG/P, 16K nodes / 64K cores 02/25/2016 CS267 Lecture 12
  • 51. 2.5D Matmul on IBM BG/P, 16K nodes / 64K cores c = 16 copies Distinguished Paper Award, EuroPar’11 SC’11 paper by Solomonik, Bhatele, D. 02/25/2016
  • 52. Perfect Strong Scaling – in Time and Energy • Every time you add a processor, you should use its memory M too • Start with minimal number of procs: PM = 3n2 • Increase P by a factor of c  total memory increases by a factor of c • Notation for timing model: • γT , βT , αT = secs per flop, per word_moved, per message of size m • T(cP) = n3/(cP) [ γT+ βT/M1/2 + αT/(mM1/2) ] = T(P)/c • Notation for energy model: • γE , βE , αE = joules for same operations • δE = joules per word of memory used per sec • εE = joules per sec for leakage, etc. • E(cP) = cP { n3/(cP) [ γE+ βE/M1/2 + αE/(mM1/2) ] + δEMT(cP) + εET(cP) } = E(P) • c cannot increase forever: c <= P1/3 (3D algorithm) • Corresponds to lower bound on #messages hitting 1 • Perfect scaling extends to Strassen’s matmul, direct N-body, … • “Perfect Strong Scaling Using No Additional Energy” • “Strong Scaling of Matmul and Memory-Indep. Comm. Lower Bounds” • Both at bebop.cs.berkeley.edu
  • 53. Classical Matmul vs Parallel Strassen • Complexity of classical Matmul vs Strassen • Flops: O(n3/p) vs O(nw/p) where w = log2 7 ~ 2.81 • Communication lower bound on #words: Ω((n3/p)/M1/2) = Ω(M(n/M1/2)3/p) vs Ω(M(n/M1/2)w/p) • Communication lower bound on #messages: Ω((n3/p)/M3/2) = Ω((n/M1/2)3/p) vs Ω((n/M1/2)w/p) • All attainable as M increases past O(n2/p), up to a limit: can increase M by factor up to p1/3 vs p1-2/w #words as low as Ω(n/p2/3) vs Ω(n/p2/w) • Best Paper Prize, SPAA’11, Ballard, D., Holtz, Schwartz • How well does parallel Strassen work in practice? 02/27/2014 CS267 Lecture 12 53
  • 54. Strong scaling of Matmul on Hopper (n=94080) 02/25/2016 54 G. Ballard, D., O. Holtz, B. Lipshitz, O. Schwartz “Communication-Avoiding Parallel Strassen” bebop.cs.berkeley.edu, Supercomputing’12
  • 55. 02/25/2016 CS267 Lecture 12 55 ScaLAPACK Parallel Library
  • 56. Extensions of Lower Bound and Optimal Algorithms • For each processor that does G flops with fast memory of size M #words_moved = Ω(G/M1/2) • Extension: for any program that “smells like” • Nested loops … • That access arrays … • Where array subscripts are linear functions of loop indices • Ex: A(i,j), B(3*i-4*k+5*j, i-j, 2*k, …), … • There is a constant s such that #words_moved = Ω(G/Ms-1) • s comes from recent generalization of Loomis-Whitney (s=3/2) • Ex: linear algebra, n-body, database join, … • Lots of open questions: deriving s, optimal algorithms … 02/25/2016 CS267 Lecture 12 56
  • 57. Proof of Communication Lower Bound on C = A·B (1/4) • Proof from Irony/Toledo/Tiskin (2004) • Think of instruction stream being executed • Looks like “ … add, load, multiply, store, load, add, …” • Each load/store moves a word between fast and slow memory • We want to count the number of loads and stores, given that we are multiplying n-by-n matrices C = A·B using the usual 2n3 flops, possibly reordered assuming addition is commutative/associative • Assuming that at most M words can be stored in fast memory • Outline: • Break instruction stream into segments, each with M loads and stores • Somehow bound the maximum number of flops that can be done in each segment, call it F • So F · # segments  T = total flops = 2·n3 , so # segments  T / F • So # loads & stores = M · #segments  M · T / F CS267 Lecture 12 02/25/2016 57
  • 59. Proof of Communication Lower Bound on C = A·B (2/4) k “A face” “C face” Cube representing C(1,1) += A(1,3)·B(3,1) • If we have at most 2M “A squares”, 2M “B squares”, and 2M “C squares” on faces, how many cubes can we have? i j A(2,1) A(1,3) B(1,3) B(3,1) C(1,1) C(2,3) A(1,1) B(1,1) A(1,2) B(2,1) 59
  • 60. Proof of Communication Lower Bound on C = A·B (3/4) x z z y x y k A shadow C shadow j i # cubes in black box with side lengths x, y and z = Volume of black box = x·y·z = ( xz · zy · yx)1/2 = (#A□s · #B□s · #C□s )1/2 (i,k) is in A shadow if (i,j,k) in 3D set (j,k) is in B shadow if (i,j,k) in 3D set (i,j) is in C shadow if (i,j,k) in 3D set Thm (Loomis & Whitney, 1949) # cubes in 3D set = Volume of 3D set ≤ (area(A shadow) · area(B shadow) · area(C shadow)) 1/2 61
  • 61. Proof of Communication Lower Bound on C = A·B (4/4) • Consider one “segment” of instructions with M loads, stores • Can be at most 2M entries of A, B, C available in one segment • Volume of set of cubes representing possible multiply/adds in one segment is ≤ (2M · 2M · 2M)1/2 = (2M) 3/2 ≡ F • # Segments  2n3 / F • # Loads & Stores = M · #Segments  M · 2n3 / F  n3 / (2M)1/2 – M = (n3 / M1/2 ) • Parallel Case: apply reasoning to one processor out of P • # Adds and Muls  2n3 / P (at least one proc does this ) • M= n2 / P (each processor gets equal fraction of matrix) • # “Load & Stores” = # words moved from or to other procs  M · (2n3 /P) / F= M · (2n3 /P) / (2M)3/2 = n2 / (2P)1/2 62
  • 62. 02/25/2016 CS267 Lecture 12 63 Extra Slides
  • 63. 2/27/08 CS267 Guest Lecture 1 91 Recursive Layouts • For both cache hierarchies and parallelism, recursive layouts may be useful • Z-Morton, U-Morton, and X-Morton Layout • Also Hilbert layout and others • What about the user’s view? • Fortunately, many problems can be solved on a permutation • Never need to actually change the user’s layout
  • 64. 02/09/2006 CS267 Lecture 8 92 Gaussian Elimination 0 x x x x . . . Standard Way subtract a multiple of a row 0 x 0 0 . . . 0 LINPACK apply sequence to a column x nb then apply nb to rest of matrix a3=a3-a1*a2 a3 a2 a1 L a2 =L-1 a2 0 x 0 0 . . . 0 nb LAPACK apply sequence to nb Slide source: Dongarra
  • 65. 02/09/2006 CS267 Lecture 8 93 LU Algorithm: 1: Split matrix into two rectangles (m x n/2) if only 1 column, scale by reciprocal of pivot & return 2: Apply LU Algorithm to the left part 3: Apply transformations to right part (triangular solve A12 = L-1A12 and matrix multiplication A22=A22 -A21*A12 ) 4: Apply LU Algorithm to right part Gaussian Elimination via a Recursive Algorithm L A12 A21 A22 F. Gustavson and S. Toledo Most of the work in the matrix multiply Matrices of size n/2, n/4, n/8, … Slide source: Dongarra
  • 66. 02/09/2006 CS267 Lecture 8 94 Recursive Factorizations • Just as accurate as conventional method • Same number of operations • Automatic variable blocking • Level 1 and 3 BLAS only ! • Extreme clarity and simplicity of expression • Highly efficient • The recursive formulation is just a rearrangement of the point-wise LINPACK algorithm • The standard error analysis applies (assuming the matrix operations are computed the “conventional” way). Slide source: Dongarra
  • 67. 02/09/2006 CS267 Lecture 8 95 DGEMM ATLAS & DGETRF Recursive AMD Athlon 1GHz (~$1100 system) 0 100 200 300 400 500 1000 1500 2000 2500 3000 Order MFlop/s Pentium III 550 MHz Dual Processor LU Factorization 0 200 400 600 800 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Order Mflop/s LAPACK Recursive LU Recursive LU LAPACK Dual-processor Uniprocessor Slide source: Dongarra
  • 68. 02/09/2006 CS267 Lecture 8 96 Review: BLAS 3 (Blocked) GEPP for ib = 1 to n-1 step b … Process matrix b columns at a time end = ib + b-1 … Point to end of block of b columns apply BLAS2 version of GEPP to get A(ib:n , ib:end) = P’ * L’ * U’ … let LL denote the strict lower triangular part of A(ib:end , ib:end) + I A(ib:end , end+1:n) = LL-1 * A(ib:end , end+1:n) … update next b rows of U A(end+1:n , end+1:n ) = A(end+1:n , end+1:n ) - A(end+1:n , ib:end) * A(ib:end , end+1:n) … apply delayed updates with single matrix-multiply … with inner dimension b BLAS 3
  • 69. 02/09/2006 CS267 Lecture 8 97 Review: Row and Column Block Cyclic Layout processors and matrix blocks are distributed in a 2d array pcol-fold parallelism in any column, and calls to the BLAS2 and BLAS3 on matrices of size brow-by-bcol serial bottleneck is eased need not be symmetric in rows and columns
  • 70. 02/09/2006 CS267 Lecture 8 98 Distributed GE with a 2D Block Cyclic Layout block size b in the algorithm and the block sizes brow and bcol in the layout satisfy b=brow=bcol. shaded regions indicate busy processors or communication performed. unnecessary to have a barrier between each step of the algorithm, e.g.. step 9, 10, and 11 can be pipelined
  • 71. 02/09/2006 CS267 Lecture 8 99 Distributed GE with a 2D Block Cyclic Layout
  • 72. 02/09/2006 CS267 Lecture 8 100 Matrix multiply of green = green - blue * pink
  • 73. 02/09/2006 CS267 Lecture 8 101 PDGESV = ScaLAPACK parallel LU routine Since it can run no faster than its inner loop (PDGEMM), we measure: Efficiency = Speed(PDGESV)/Speed(PDGEMM) Observations: Efficiency well above 50% for large enough problems For fixed N, as P increases, efficiency decreases (just as for PDGEMM) For fixed P, as N increases efficiency increases (just as for PDGEMM) From bottom table, cost of solving Ax=b about half of matrix multiply for large enough matrices. From the flop counts we would expect it to be (2*n3)/(2/3*n3) = 3 times faster, but communication makes it a little slower.
  • 75. 02/09/2006 CS267 Lecture 8 103 Scales well, nearly full machine speed
  • 76. 02/09/2006 CS267 Lecture 8 104 Old version, pre 1998 Gordon Bell Prize Still have ideas to accelerate Project Available! Old Algorithm, plan to abandon
  • 77. 02/09/2006 CS267 Lecture 8 105 Have good ideas to speedup Project available! Hardest of all to parallelize Have alternative, and would like to compare Project available!
  • 78. 02/09/2006 CS267 Lecture 8 106 Out-of-core means matrix lives on disk; too big for main mem Much harder to hide latency of disk QR much easier than LU because no pivoting needed for QR Moral: use QR to solve Ax=b Projects available (perhaps very hard…)
  • 79. 02/09/2006 CS267 Lecture 8 107 A small software project ...
  • 80. 02/09/2006 CS267 Lecture 8 108 Work-Depth Model of Parallelism • The work depth model: • The simplest model is used • For algorithm design, independent of a machine • The work, W, is the total number of operations • The depth, D, is the longest chain of dependencies • The parallelism, P, is defined as W/D • Specific examples include: • circuit model, each input defines a graph with ops at nodes • vector model, each step is an operation on a vector of elements • language model, where set of operations defined by language
  • 81. 02/09/2006 CS267 Lecture 8 109 Latency Bandwidth Model • Network of fixed number P of processors • fully connected • each with local memory • Latency (a) • accounts for varying performance with number of messages • gap (g) in logP model may be more accurate cost if messages are pipelined • Inverse bandwidth (b) • accounts for performance varying with volume of data • Efficiency (in any model): • serial time / (p * parallel time) • perfect (linear) speedup  efficiency = 1
  • 82. 02/09/2006 CS267 Lecture 8 110 Initial Step to Skew Matrices in Cannon • Initial blocked input • After skewing before initial block multiplies A(0,1) A(0,2) A(1,0) A(2,0) A(1,1) A(1,2) A(2,1) A(2,2) A(0,0) B(0,1) B(0,2) B(1,0) B(2,0) B(1,1) B(1,2) B(2,1) B(2,2) B(0,0) A(0,1) A(0,2) A(1,0) A(2,0) A(1,1) A(1,2) A(2,1) A(2,2) A(0,0) B(0,1) B(0,2) B(1,0) B(2,0) B(1,1) B(1,2) B(2,1) B(2,2) B(0,0)
  • 83. 02/09/2006 CS267 Lecture 8 111 Skewing Steps in Cannon • First step • Second • Third A(0,1) A(0,2) A(1,0) A(2,0) A(1,1) A(1,2) A(2,1) A(2,2) A(0,0) B(0,1) B(0,2) B(1,0) B(2,0) B(1,1) B(1,2) B(2,1) B(2,2) B(0,0) A(0,1) A(0,2) A(1,0) A(2,0) A(1,2) A(2,1) B(0,1) B(0,2) B(1,0) B(2,0) B(1,1) B(1,2) B(2,1) B(2,2) B(0,0) A(0,1) A(0,2) A(1,0) A(2,0) A(1,1) A(1,2) A(2,1) A(2,2) A(0,0) B(0,1) B(0,2) B(1,0) B(2,0) B(1,1) B(1,2) B(2,1) B(2,2) B(0,0) A(1,1) A(2,2) A(0,0)
  • 84. 2/25/2009 CS267 Lecture 8 112 Motivation (1) 3 Basic Linear Algebra Problems 1. Linear Equations: Solve Ax=b for x 2. Least Squares: Find x that minimizes ||r||2  S ri 2 where r=Ax-b • Statistics: Fitting data with simple functions 3a. Eigenvalues: Find l and x where Ax = l x • Vibration analysis, e.g., earthquakes, circuits 3b. Singular Value Decomposition: ATAx=2x • Data fitting, Information retrieval Lots of variations depending on structure of A • A symmetric, positive definite, banded, …
  • 85. 2/25/2009 CS267 Lecture 8 113 Motivation (2) •Why dense A, as opposed to sparse A? • Many large matrices are sparse, but … • Dense algorithms easier to understand • Some applications yields large dense matrices • LINPACK Benchmark (www.top500.org) • “How fast is your computer?” = “How fast can you solve dense Ax=b?” • Large sparse matrix algorithms often yield smaller (but still large) dense problems • Do ParLab Apps most use small dense matrices?
  • 86. 02/25/2009 CS267 Lecture 8 Algorithms for 2D (3D) Poisson Equation (N = n2 (n3) vars) Algorithm Serial PRAM Memory #Procs • Dense LU N3 N N2 N2 • Band LU N2 (N7/3) N N3/2 (N5/3) N (N4/3) • Jacobi N2 (N5/3) N (N2/3) N N • Explicit Inv. N2 log N N2 N2 • Conj.Gradients N3/2 (N4/3) N1/2(1/3) *log N N N • Red/Black SOR N3/2 (N4/3) N1/2 (N1/3) N N • Sparse LU N3/2 (N2) N1/2 N*log N (N4/3) N • FFT N*log N log N N N • Multigrid N log2 N N N • Lower bound N log N N PRAM is an idealized parallel model with zero cost communication Reference: James Demmel, Applied Numerical Linear Algebra, SIAM, 1997 (Note: corrected complexities for 3D case from last lecture!).
  • 87. Lessons and Questions (1) • Structure of the problem matters • Cost of solution can vary dramatically (n3 to n) • Many other examples • Some structure can be figured out automatically • “Ab” can figure out symmetry, some sparsity • Some structures known only to (smart) user • If performance not critical, user may be happy to settle for Ab • How much of this goes into the motifs? • How much should we try to help user choose? • Tuning, but at algorithmic choice level (SALSA) • Motifs overlap • Dense, sparse, (un)structured grids, spectral
  • 88. Organizing Linear Algebra (1) • By Operations • Low level (eg mat-mul: BLAS) • Standard level (eg solve Ax=b, Ax=λx: Sca/LAPACK) • Applications level (eg systems & control: SLICOT) • By Performance/accuracy tradeoffs • “Direct methods” with guarantees vs “iterative methods” that may work faster and accurately enough • By Structure • Storage • Dense – columnwise, rowwise, 2D block cyclic, recursive space-filling curves • Banded, sparse (many flavors), black-box, … • Mathematical • Symmetries, positive definiteness, conditioning, … • As diverse as the world being modeled
  • 89. Organizing Linear Algebra (2) • By Data Type • Real vs Complex • Floating point (fixed or varying length), other • By Target Platform • Serial, manycore, GPU, distributed memory, out-of- DRAM, Grid, … • By programming interface • Language bindings • “Ab” versus access to details
  • 90. For all linear algebra problems: Ex: LAPACK Table of Contents • Linear Systems • Least Squares • Overdetermined, underdetermined • Unconstrained, constrained, weighted • Eigenvalues and vectors of Symmetric Matrices • Standard (Ax = λx), Generallzed (Ax=λxB) • Eigenvalues and vectors of Unsymmetric matrices • Eigenvalues, Schur form, eigenvectors, invariant subspaces • Standard, Generalized • Singular Values and vectors (SVD) • Standard, Generalized • Level of detail • Simple Driver • Expert Drivers with error bounds, extra-precision, other options • Lower level routines (“apply certain kind of orthogonal transformation”)
  • 91. For all matrix/problem structures: Ex: LAPACK Table of Contents • BD – bidiagonal • GB – general banded • GE – general • GG – general , pair • GT – tridiagonal • HB – Hermitian banded • HE – Hermitian • HG – upper Hessenberg, pair • HP – Hermitian, packed • HS – upper Hessenberg • OR – (real) orthogonal • OP – (real) orthogonal, packed • PB – positive definite, banded • PO – positive definite • PP – positive definite, packed • PT – positive definite, tridiagonal • SB – symmetric, banded • SP – symmetric, packed • ST – symmetric, tridiagonal • SY – symmetric • TB – triangular, banded • TG – triangular, pair • TB – triangular, banded • TP – triangular, packed • TR – triangular • TZ – trapezoidal • UN – unitary • UP – unitary packed
  • 92. For all matrix/problem structures: Ex: LAPACK Table of Contents • BD – bidiagonal • GB – general banded • GE – general • GG – general , pair • GT – tridiagonal • HB – Hermitian banded • HE – Hermitian • HG – upper Hessenberg, pair • HP – Hermitian, packed • HS – upper Hessenberg • OR – (real) orthogonal • OP – (real) orthogonal, packed • PB – positive definite, banded • PO – positive definite • PP – positive definite, packed • PT – positive definite, tridiagonal • SB – symmetric, banded • SP – symmetric, packed • ST – symmetric, tridiagonal • SY – symmetric • TB – triangular, banded • TG – triangular, pair • TB – triangular, banded • TP – triangular, packed • TR – triangular • TZ – trapezoidal • UN – unitary • UP – unitary packed
  • 93. For all matrix/problem structures: Ex: LAPACK Table of Contents • BD – bidiagonal • GB – general banded • GE – general • GG – general, pair • GT – tridiagonal • HB – Hermitian banded • HE – Hermitian • HG – upper Hessenberg, pair • HP – Hermitian, packed • HS – upper Hessenberg • OR – (real) orthogonal • OP – (real) orthogonal, packed • PB – positive definite, banded • PO – positive definite • PP – positive definite, packed • PT – positive definite, tridiagonal • SB – symmetric, banded • SP – symmetric, packed • ST – symmetric, tridiagonal • SY – symmetric • TB – triangular, banded • TG – triangular, pair • TB – triangular, banded • TP – triangular, packed • TR – triangular • TZ – trapezoidal • UN – unitary • UP – unitary packed
  • 94. For all matrix/problem structures: Ex: LAPACK Table of Contents • BD – bidiagonal • GB – general banded • GE – general • GG – general, pair • GT – tridiagonal • HB – Hermitian banded • HE – Hermitian • HG – upper Hessenberg, pair • HP – Hermitian, packed • HS – upper Hessenberg • OR – (real) orthogonal • OP – (real) orthogonal, packed • PB – positive definite, banded • PO – positive definite • PP – positive definite, packed • PT – positive definite, tridiagonal • SB – symmetric, banded • SP – symmetric, packed • ST – symmetric, tridiagonal • SY – symmetric • TB – triangular, banded • TG – triangular, pair • TB – triangular, banded • TP – triangular, packed • TR – triangular • TZ – trapezoidal • UN – unitary • UP – unitary packed
  • 95. For all matrix/problem structures: Ex: LAPACK Table of Contents • BD – bidiagonal • GB – general banded • GE – general • GG – general, pair • GT – tridiagonal • HB – Hermitian banded • HE – Hermitian • HG – upper Hessenberg, pair • HP – Hermitian, packed • HS – upper Hessenberg • OR – (real) orthogonal • OP – (real) orthogonal, packed • PB – positive definite, banded • PO – positive definite • PP – positive definite, packed • PT – positive definite, tridiagonal • SB – symmetric, banded • SP – symmetric, packed • ST – symmetric, tridiagonal • SY – symmetric • TB – triangular, banded • TG – triangular, pair • TB – triangular, banded • TP – triangular, packed • TR – triangular • TZ – trapezoidal • UN – unitary • UP – unitary packed
  • 96. For all matrix/problem structures: Ex: LAPACK Table of Contents • BD – bidiagonal • GB – general banded • GE – general • GG – general , pair • GT – tridiagonal • HB – Hermitian banded • HE – Hermitian • HG – upper Hessenberg, pair • HP – Hermitian, packed • HS – upper Hessenberg • OR – (real) orthogonal • OP – (real) orthogonal, packed • PB – positive definite, banded • PO – positive definite • PP – positive definite, packed • PT – positive definite, tridiagonal • SB – symmetric, banded • SP – symmetric, packed • ST – symmetric, tridiagonal • SY – symmetric • TB – triangular, banded • TG – triangular, pair • TB – triangular, banded • TP – triangular, packed • TR – triangular • TZ – trapezoidal • UN – unitary • UP – unitary packed
  • 97. For all matrix/problem structures: Ex: LAPACK Table of Contents • BD – bidiagonal • GB – general banded • GE – general • GG – general, pair • GT – tridiagonal • HB – Hermitian banded • HE – Hermitian • HG – upper Hessenberg, pair • HP – Hermitian, packed • HS – upper Hessenberg • OR – (real) orthogonal • OP – (real) orthogonal, packed • PB – positive definite, banded • PO – positive definite • PP – positive definite, packed • PT – positive definite, tridiagonal • SB – symmetric, banded • SP – symmetric, packed • ST – symmetric, tridiagonal • SY – symmetric • TB – triangular, banded • TG – triangular, pair • TB – triangular, banded • TP – triangular, packed • TR – triangular • TZ – trapezoidal • UN – unitary • UP – unitary packed
  • 98. For all data types: Ex: LAPACK Table of Contents • Real and complex • Single and double precision • Arbitrary precision in progress
  • 99. Organizing Linear Algebra (3) www.netlib.org/lapack www.netlib.org/scalapack www.cs.utk.edu/~dongarra/etemplates www.netlib.org/templates gams.nist.gov
  • 100. 2/27/08 CS267 Guest Lecture 1 128 Review of the BLAS BLAS level Ex. # mem refs # flops q 1 “Axpy”, Dot prod 3n 2n1 2/3 2 Matrix- vector mult n2 2n2 2 • Building blocks for all linear algebra • Parallel versions call serial versions on each processor • So they must be fast! • Define q = # flops / # mem refs = “computational intensity” • The larger is q, the faster the algorithm can go in the presence of memory hierarchy • “axpy”: y = a*x + y, where a scalar, x and y vectors
  • 101. 02/22/2011 CS267 Lecture 11 130 Summary of Parallel Matrix Multiplication so far • 1D Layout • Bus without broadcast - slower than serial • Nearest neighbor communication on a ring (or bus with broadcast): Efficiency = 1/(1 + O(p/n)) • 2D Layout – one copy of all matrices (O(n2/p) per processor) • Cannon • Efficiency = 1/(1+O(a * ( sqrt(p) /n)3 +b* sqrt(p) /n)) – optimal! • Hard to generalize for general p, n, block cyclic, alignment • SUMMA • Efficiency = 1/(1 + O(a * log p * p / (b*n2) + b*log p * sqrt(p) /n)) • Very General • b small => less memory, lower efficiency • b large => more memory, high efficiency • Used in practice (PBLAS) Why?