This talk is a new update based on some of our recent results on doing Tall and Skinny QRs in MapReduce. In particular, the "fast" iterative refinement approximation based on a sample is new.
1. Tall-and-skinny !
QRs and SVDs in
MapReduce
David F. Gleich!
Computer Science!
Purdue University!
A1
A4
A2
A3
A4
Yangyang Hou "
Purdue, CS
Paul G. Constantine "
Austin Benson "
Joe Nichols"
Stanford University
James Demmel "
UC Berkeley
Joe Ruthruff "
Jeremy Templeton"
Sandia CA
Mainly funded by Sandia National Labs
CSAR project, recently by NSF CAREER,"
and Purdue Research Foundation.
ICML2013
David Gleich · Purdue
1
2. 2
A
From tinyimages"
collection
ICML2013
Tall-and-Skinny
matrices
(m ≫ n)
Many rows (like a billion)
A few columns (under 10,000)
regression and!
general linear models!
with many samples!
block iterative methods
panel factorizations
approximate kernel k-means
big-data SVD/PCA!
Used in
David Gleich · Purdue
3. A matrix A : m × n, m ≥ n"
is tall and skinny when O(n2) "
work and storage is “cheap”
compared to m.
ICML2013
David Gleich · Purdue
3
-- Austin Benson
4. Quick review of QR
ICML2013
4
andia)
is given by
the solution of
QR is block normalization
“normalize” a vector
usually generalizes to
computing in the QR
A Q
, real
orthogonal ( )
upper triangular.
0
R
=
MapReduce 2011
David Gleich · Purdue
Let A : m × n, m ≥ n, real
A = QR
Q is m × n orthogonal (QT Q = I )
R is n × n upper triangular
5.
Tall-and-skinny SVD and RSVD
Let A : m × n, m ≥ n, real
A = U𝞢VT
U is m × n orthogonal
𝞢 is m × n nonneg, diag.
V is n × n orthogonal
ICML2013
David Gleich · Purdue
5
A Q
TSQR
R V
SVD
6. There are good MPI
implementations.
What’s left to do?
ICML2013
6
David Gleich · Purdue
7. Moving data to an MPI cluster
may be infeasible or costly
ICML2013
David Gleich · Purdue
7
8. Data computers I’ve used
ICML2013
8
Nebula Cluster @ !
Sandia CA!
2TB/core storage, "
64 nodes, 256 cores, "
GB ethernet
Cost $150k
These systems are good for working with
enormous matrix data!
Student Cluster @ !
Stanford!
3TB/core storage, "
11 nodes, 44 cores, "
GB ethernet
Cost $30k
Magellan Cluster @ !
NERSC!
128GB/core storage, "
80 nodes, 640 cores, "
infiniband
David Gleich · Purdue
9. By 2013(?) all Fortune 500
companies will have a data
computer
ICML2013
9
David Gleich · Purdue
10. How do you program them?
ICML2013
10
David Gleich · Purdue
11. A graphical view of the MapReduce
programming model
David Gleich · Purdue
11
data
Map
data
Map
data
Map
data
Map
key
value
key
value
key
value
key
value
key
value
key
value
()
Shuffle
key
value
value
dataReduce
key
value
value
value
dataReduce
key
value dataReduce
ICML2013
Map tasks read batches of data in
parallel and do some initial filtering
Reduce is often where the
computation happens
Shuffle is a
global comm.
like group-by
or MPIAlltoall
12. MapReduce looks like the first
step of any MPI program that
loads data.
Read data -> assign to process
Receive data -> do compute
ICML2013
David Gleich · Purdue
12
13. Still, isn’t this easy to do?
ICML2013
David Gleich · Purdue
13
Current MapReduce algs use the normal equations
A = QR AT
A
Cholesky
! RT
R Q = AR 1
A1
A4
A2
A3
A4
Map!
Aii to Ai
TAi
Reduce!
RTR = Sum(Ai
TAi)
Map 2!
Aii to Ai R-1
Two problems!
R inaccurate if A ill-conditioned
Q not numerically orthogonal
(Householder assures this)
14. Numerical stability was a
problem for prior approaches
ICML2013
14
Condition number
1020
105
norm(QTQ–I)
AR-1
Prior work
Previous methods
couldn’t ensure
that the matrix Q
was orthogonal
David Gleich · Purdue
15. Four things that are better
1. A simple algorithm to compute R accurately.
(but doesn’t help get Q orthogonal).
2. “Fast algorithm” to get Q numerically
orthogonal in most cases.
3. Multi-pass algorithm to get Q numerically
orthogonal in virtually all cases.
4. A direct algorithm for a numerically
orthogonal Q in all cases.
ICML2013
David Gleich · Purdue
15
Constantine & Gleich MapReduce 2011
Benson, Gleich & Demmel arXiv 2013
16. Numerical stability was a
problem for prior approaches
ICML2013
16
Condition number
1020
105
norm(QTQ–I)
AR-1
AR-1 + "
iterative refinement
Direct TSQR
Benson, Gleich, "
Demmel, Submitted
Prior work
Constantine & Gleich,
MapReduce 2011
Benson, Gleich,
Demmel, Submitted
Previous methods
couldn’t ensure
that the matrix Q
was orthogonal
David Gleich · Purdue
17. How to store tall-and-skinny
matrices in Hadoop
David Gleich · Purdue
17
A1
A4
A2
A3
A4
A : m x n, m ≫ n
Key is an arbitrary row-id
Value is the 1 x n array "
for a row (or b x n block)
Each submatrix Ai is an "
the input to a map task.
ICML2013
18. MapReduce is great for TSQR!!
You don’t need ATA
Data A tall and skinny (TS) matrix by rows
Input 500,000,000-by-50 matrix"
Each record 1-by-50 row"
HDFS Size 183.6 GB
Time to compute read A 253 sec. write A 848 sec.!
Time to compute R in qr(A) 526 sec. w/ Q=AR-1 1618 sec. "
Time to compute Q in qr(A) 3090 sec. (numerically stable)!
ICML2013
David Gleich · Purdue
18
19. Communication avoiding QR
(Demmel et al. 2008)
Communication avoidingTSQR
Demmel et al. 2008. Communicating avoiding parallel and sequential QR.
First, do QR
factorizations
of each local
matrix
Second, compute
a QR factorization
of the new “R”
19
ICML2013
David Gleich · Purdue
20. Serial QR factorizations!
(Demmel et al. 2008)
Fully serialTSQR
Demmel et al. 2008. Communicating avoiding parallel and sequential QR.
Compute QR of ,
read , update QR, …
20
ICML2013
David Gleich · Purdue
21. A1
A2
A3
A1
A2
qr
Q2 R2
A3
qr
Q3 R3
A4
qr
Q4A4
R4
emit
A5
A6
A7
A5
A6
qr
Q6 R6
A7
qr
Q7 R7
A8
qr
Q8A8
R8
emit
Mapper 1
Serial TSQR
R4
R8
Mapper 2
Serial TSQR
R4
R8
qr
Q
emit
R
Reducer 1
Serial TSQR
Algorithm
Data Rows of a matrix
Map QR factorization of rows
Reduce QR factorization of rows
Communication avoiding QR (Demmel et al. 2008) !
on MapReduce (Constantine and Gleich, 2011)
ICML2013
21
David Gleich · Purdue
22. Too many maps cause too
much data to one reducer!
Each image is 5k.
Each HDFS block has "
12,800 images.
6,250 total blocks.
Each map outputs "
1000-by-1000 matrix
One reducer gets a 6.25M-
by-1000 matrix (50GB)
From tinyimages
collection
"
80,000,000images
1000 pixels
ICML2013
David Gleich · Purdue
22
23. S(1)
A
A1
A2
A3
A3
R1
map
Mapper 1-1
Serial TSQR
A2
emit
R2
map
Mapper 1-2
Serial TSQR
A3
emit
R3
map
Mapper 1-3
Serial TSQR
A4
emit
R4
map
Mapper 1-4
Serial TSQR
shuffle
S1
A2
reduce
Reducer 1-1
Serial TSQR
S2
R2,2
reduce
Reducer 1-2
Serial TSQR
R2,1
emit
emit
emit
shuffle
A2S3
R2,3
reduce
Reducer 1-3
Serial TSQR
emit
Iteration 1 Iteration 2
identitymap
A2S(2)
Rreduce
Reducer 2-1
Serial TSQR
emit
ICML2013
David Gleich · Purdue
23
25. Full TSQR code in hadoopy
In hadoopy
import random, numpy, hadoopy
class SerialTSQR:
def __init__(self,blocksize,isreducer):
self.bsize=blocksize
self.data = []
if isreducer: self.__call__ = self.reducer
else: self.__call__ = self.mapper
def compress(self):
R = numpy.linalg.qr(
numpy.array(self.data),'r')
# reset data and re-initialize to R
self.data = []
for row in R:
self.data.append([float(v) for v in row])
def collect(self,key,value):
self.data.append(value)
if len(self.data)>self.bsize*len(self.data[0]):
self.compress()
def close(self):
self.compress()
for row in self.data:
key = random.randint(0,2000000000)
yield key, row
def mapper(self,key,value):
self.collect(key,value)
def reducer(self,key,values):
for value in values: self.mapper(key,value)
if __name__=='__main__':
mapper = SerialTSQR(blocksize=3,isreducer=False)
reducer = SerialTSQR(blocksize=3,isreducer=True)
hadoopy.run(mapper, reducer)
David Gleich (Sandia) 13/22MapReduce 2011
25
ICML2013
David Gleich · Purdue
26. Numerical stability was a
problem for prior approaches
ICML2013
26
Condition number
1020
105
norm(QTQ–I)
AR-1
AR-1 + "
iterative refinement
Direct TSQR
Benson, Gleich, "
Demmel, Submitted
Prior work
Constantine & Gleich,
MapReduce 2011
Benson, Gleich,
Demmel, Submitted
Previous methods
couldn’t ensure
that the matrix Q
was orthogonal
David Gleich · Purdue
27. Iterative refinement helps
ICML2013
David Gleich · Purdue
27
A1
A4
Q1
R-1
Mapper 1
A2
Q2
A3
Q3
A4
Q4
R
TSQR
DistributeR
R-1
R-1
R-1
Iterative refinement is like using Newton’s method to solve Ax = b. There’s a famous
quote that “two iterations of iterative refinement are enough” attributed to Parlett;
TSQR
Q1
A4
Q1
T-1
Mapper 2
Q2
Q2
Q3
Q3
Q4
Q4
T
DistributeT
T-1
T-1
T-1
28. Numerical stability was a
problem for prior approaches
ICML2013
28
Condition number
1020
105
norm(QTQ–I)
AR-1
AR-1 + "
iterative refinement
Direct TSQR
Benson, Gleich, "
Demmel, Submitted
Prior work
Constantine & Gleich,
MapReduce 2011
Benson, Gleich,
Demmel, Submitted
Previous methods
couldn’t ensure
that the matrix Q
was orthogonal
David Gleich · Purdue
29. What if iterative refinement is
too slow?
ICML2013
David Gleich · Purdue
29
A1
A4
Q1
R-1
Mapper 1
A2
Q2
A3
Q3
A4
Q4
S
Sample
ComputeQR,"
distributeR
R-1
R-1
R-1
TSQR
A1
A4
Q1
T-1
Mapper 2
A2
Q2
A3
Q3
A4
Q4
T
DistributeTR
T-1
T-1
T-1
Based on recent work by Ipsen et al. on approximating QR with a random
subset of rows. Also assumes that you can get a subset of rows “cheaply” –
possible, but nontrivial in Hadoop.
R-1
R-1
R-1
R-1
Estimate the “norm” by S
30. Orthogonality of the sampled
solution
ICML2013
30
Condition number
1020
105
norm(QTQ–I)
AR-1
Direct TSQR
Benson, Gleich, "
Demmel, Submitted
Prior work
Constantine & Gleich,
MapReduce 2011
Benson, Gleich,
Demmel, Submitted
David Gleich · Purdue
31. Four things that are better
1. A simple algorithm to compute R accurately.
(but doesn’t help get Q orthogonal).
2. Fast algorithm to get Q numerically
orthogonal in most cases.
3. Two pass algorithm to get Q numerically
orthogonal in virtually all cases.
4. A direct algorithm for a numerically
orthogonal Q in all cases.
ICML2013
David Gleich · Purdue
31
Constantine & Gleich MapReduce 2011
Benson, Gleich & Demmel arXiv 2013
32. Recreate Q by storing the
history of the factorization
ICML2013
David Gleich · Purdue
32
A1
A4
Q1
R1
Mapper 1
A2
Q2
R2
A3
Q3
R3
A4
Q4
Q1
Q2
Q3
Q4
R1
R2
R3
R4
R4
Qoutput
Routput
Q11
Q21
Q31
Q41
R
Task 2
Q11
Q21
Q31
Q41
Q1
Q2
Q3
Q4
Mapper 3
1. Output local Q and
R in separate files
2. Collect R on one
node, compute Qs
for each piece
3. Distribute the
pieces of Q*1 and
form the true Q
33. Theoretical lower bound on runtime
for a few cases on our small cluster
Rows
Cols
Old
R-only
+ no IR
R-only
+ PIR
R-only
+ IR
Direct
TSQR
4.0B
4
1803
1821
1821
2343
2525
2.5B
10
1645
1655
1655
2062
2464
0.6B
25
804
812
812
1000
1237
0.5B
50
1240
1250
1250
1517
2103
ICML2013
David Gleich · Purdue
33
All values in
seconds
Only two params
needed – read and
write bandwidth for
the cluster – in
order to derive a
performance model
of the algorithm.
This simple model
is almost within a
factor of two of the
true runtime. "
(10-node cluster,
60 disks)
Rows
Cols
Old
R-only
+ no IR
R-only
+ PIR
R-only
+ IR
Direct
TSQR
4.0B
4
2931
3460
3620
4741
6128
2.5B
10
2508
2509
3354
4034
4035
0.6B
25
1098
1104
1476
2006
1910
0.5B
50
921
1618
1960
2655
3090
Model
Actual
35. ICML2013
David Gleich · Purdue
35
Nonlinear heat transfer model in
random media
Each run takes 5 hours on 8
processors, we did 8192 runs.
4.2M nodes, 9 time-steps, 128
realizations, 64 size parameters
SVD of 5B x 64 data matrix
50x reduction in wall-clock time
including all pre and post-
processing; “∞x” speedup for a
particular study without pre-
processing (< 60 sec on laptop).
Interpolating adaptive low-rank
approximations.
Constantine, Gleich,"
Hou & Templeton arXiv 2013.
1
error
0
2
(a) Error, s = 0.39 cm
1
std
0
2
(b) Std, s = 0.39 cm
10
error
0
20
(c) Error, s = 1.95 cm
10
std
0
20
(d) Std, s = 1.95 cm
Fig. 4.5: Error in the reduce order model compared to the prediction standard de-
viation for one realization of the bubble locations at the final time for two values of
10
error
0
20
(c) Error, s = 1.95 cm
Fig. 4.5: Error in the reduce order model com
viation for one realization of the bubble locati
the bubble radius, s = 0.39 and s = 1.95 cm
version.)
the varying conductivity fields took approxima
Cubit after substantial optimizations.
Working with the simulation data involved
interpret 4TB of Exodus II files from Aria, glo
TSSVD, and compute predictions and errors.
imately 8-15 hours. We collected precise timi
it as these times are from a multi-tenant, uno
jobs with sizes ranging between 100GB and 2T
37. Is double-precision sufficient?
Double-precision floating point was designed for the
era where “big” was 1000-10000
s = 0; for i=1 to n: s = s + x[i]
A simple summation formula has "
error that is not always small if n is a billion
Watch out for this issue in important computations
ICML2013
David Gleich · Purdue
37
fl(x + y) = (x + y)(1 + ")
fl(
X
i
xi )
X
i
xi nµ
X
i
|xi | µ ⇡ 10 16