Tall and Skinny QRs in MapReduce

Tall-and-skinny !
QRs and SVDs in
MapReduce
David F. Gleich!
Computer Science!
Purdue University!
A1
A4
A2
A3
A4
Yangyang Hou "
Purdue, CS
Paul G. Constantine "
Austin Benson "
Joe Nichols"
Stanford University

James Demmel "
UC Berkeley
Joe Ruthruff "
Jeremy Templeton"
Sandia CA
Mainly funded by Sandia National Labs
CSAR project, recently by NSF CAREER,"
and Purdue Research Foundation.
ICML2013
David Gleich · Purdue
1

2
A
From tinyimages"
collection
ICML2013
Tall-and-Skinny
matrices

(m ≫ n)
Many rows (like a billion)
A few columns (under 10,000)
regression and!
general linear models!
with many samples!

block iterative methods
panel factorizations

approximate kernel k-means

big-data SVD/PCA!
Used in

A matrix A : m × n, m ≥ n"
is tall and skinny when O(n2) "
work and storage is “cheap”
compared to m.
ICML2013
3
-- Austin Benson

Quick review of QR
ICML2013
4
andia)
  is given by
the solution of  
QR is block normalization
“normalize” a vector
usually generalizes to
computing   in the QR
A Q
, real
orthogonal (   )
upper triangular.
0
R
=
MapReduce 2011
Let A : m × n, m ≥ n, real
A = QR
Q is m × n orthogonal (QT Q = I )
R is n × n upper triangular

Tall-and-skinny SVD and RSVD
Let A : m × n, m ≥ n, real
A = U𝞢VT
U is m × n orthogonal
𝞢 is m × n nonneg, diag.
V is n × n orthogonal

ICML2013
5
A Q
TSQR
R V
SVD

There are good MPI
implementations.

What’s left to do?
ICML2013
6

Moving data to an MPI cluster
may be infeasible or costly
ICML2013
7

Data computers I’ve used
ICML2013
8

Nebula Cluster @ !
Sandia CA!
2TB/core storage, "
64 nodes, 256 cores, "
GB ethernet
Cost $150k

These systems are good for working with
enormous matrix data!

Student Cluster @ !
Stanford!
3TB/core storage, "
GB ethernet
Cost $30k

Magellan Cluster @ !
NERSC!
128GB/core storage, "
inﬁniband


By 2013(?) all Fortune 500
companies will have a data
computer
ICML2013
9

How do you program them?
ICML2013
10

A graphical view of the MapReduce
programming model
11
data
Map
data
Map
data
Map
data
Map
key
value
key
value
key
value
key
value
key
value
key
value
()
Shuffle
key
value
value
dataReduce
key
value
value
value
dataReduce
key
value dataReduce
ICML2013
Map tasks read batches of data in
parallel and do some initial ﬁltering
Reduce is often where the
computation happens
Shufﬂe is a
global comm.
like group-by
or MPIAlltoall

MapReduce looks like the ﬁrst
step of any MPI program that
loads data.

Read data -> assign to process
Receive data -> do compute
ICML2013
12

Still, isn’t this easy to do?

ICML2013
13
Current MapReduce algs use the normal equations

A = QR AT
A
Cholesky
! RT
R Q = AR 1
A1
A4
A2
A3
A4
Map!
Aii to Ai
TAi

Reduce!
RTR = Sum(Ai
TAi)

Map 2!
Aii to Ai R-1

Two problems!

R inaccurate if A ill-conditioned

Q not numerically orthogonal
(Householder assures this)

Numerical stability was a
problem for prior approaches
ICML2013
14
Condition number
1020
105
norm(QTQ–I)
AR-1
Prior work
Previous methods
couldn’t ensure
that the matrix Q
was orthogonal

Four things that are better
1.  A simple algorithm to compute R accurately.
(but doesn’t help get Q orthogonal).
2.  “Fast algorithm” to get Q numerically
orthogonal in most cases.
3.  Multi-pass algorithm to get Q numerically
orthogonal in virtually all cases.
4.  A direct algorithm for a numerically
orthogonal Q in all cases.
ICML2013
15
Constantine & Gleich MapReduce 2011
Benson, Gleich & Demmel arXiv 2013

ICML2013
16
Condition number
1020
105
norm(QTQ–I)
AR-1
AR-1 + "
iterative reﬁnement
Direct TSQR
Benson, Gleich, "
Demmel, Submitted
Prior work
Constantine & Gleich,
MapReduce 2011
Benson, Gleich,
Demmel, Submitted
Previous methods
couldn’t ensure
that the matrix Q
was orthogonal

How to store tall-and-skinny
matrices in Hadoop
17
A1
A4
A2
A3
A4
A : m x n, m ≫ n

Key is an arbitrary row-id
Value is the 1 x n array "
for a row (or b x n block)

Each submatrix Ai is an "
the input to a map task.
ICML2013

MapReduce is great for TSQR!!
You don’t need ATA
Data A tall and skinny (TS) matrix by rows

Input 500,000,000-by-50 matrix"
Each record 1-by-50 row"
HDFS Size 183.6 GB

Time to compute read A 253 sec. write A 848 sec.!
Time to compute R in qr(A) 526 sec. w/ Q=AR-1 1618 sec. "
Time to compute Q in qr(A) 3090 sec. (numerically stable)!

ICML2013
18

Communication avoiding QR
(Demmel et al. 2008)
Communication avoidingTSQR
Demmel et al. 2008. Communicating avoiding parallel and sequential QR.
First, do QR
factorizations
of each local
matrix  
Second, compute
a QR factorization
of the new “R”
19
ICML2013

Serial QR factorizations!
(Demmel et al. 2008)
Fully serialTSQR
Demmel et al. 2008. Communicating avoiding parallel and sequential QR.
Compute QR of   ,
read   , update QR, …
20
ICML2013

A1
A2
A3
A1
A2
qr
Q2 R2
A3
qr
Q3 R3
A4
qr
Q4A4
R4
emit
A5
A6
A7
A5
A6
qr
Q6 R6
A7
qr
Q7 R7
A8
qr
Q8A8
R8
emit
Mapper 1
Serial TSQR
R4
R8
Mapper 2
Serial TSQR
R4
R8
qr
Q
emit
R
Reducer 1
Serial TSQR
Algorithm
Data Rows of a matrix
Map QR factorization of rows
Reduce QR factorization of rows
Communication avoiding QR (Demmel et al. 2008) !
on MapReduce (Constantine and Gleich, 2011)
ICML2013
21

Too many maps cause too
much data to one reducer!
Each image is 5k.
Each HDFS block has "
12,800 images.
6,250 total blocks.
Each map outputs "
1000-by-1000 matrix
One reducer gets a 6.25M-
by-1000 matrix (50GB)
From tinyimages
collection
"

80,000,000images
1000 pixels
ICML2013
22

S(1)
A
A1
A2
A3
A3
R1
map
Mapper 1-1
Serial TSQR
A2
emit
R2
map
Mapper 1-2
Serial TSQR
A3
emit
R3
map
Mapper 1-3
Serial TSQR
A4
emit
R4
map
Mapper 1-4
Serial TSQR
shuffle
S1
A2
reduce
Reducer 1-1
Serial TSQR
S2
R2,2
reduce
Reducer 1-2
Serial TSQR
R2,1
emit
emit
emit
shuffle
A2S3
R2,3
reduce
Reducer 1-3
Serial TSQR
emit
Iteration 1 Iteration 2
identitymap
A2S(2)
Rreduce
Reducer 2-1
Serial TSQR
emit
ICML2013
23

Our robust algorithm to
compute R.
ICML2013
24

Full TSQR code in hadoopy
In hadoopy
import random, numpy, hadoopy
class SerialTSQR:
def __init__(self,blocksize,isreducer):
self.bsize=blocksize
self.data = []
if isreducer: self.__call__ = self.reducer
else: self.__call__ = self.mapper
def compress(self):
R = numpy.linalg.qr(
numpy.array(self.data),'r')
# reset data and re-initialize to R
self.data = []
for row in R:
self.data.append([float(v) for v in row])
def collect(self,key,value):
self.data.append(value)
if len(self.data)>self.bsize*len(self.data[0]):
self.compress()
def close(self):
self.compress()
for row in self.data:
key = random.randint(0,2000000000)
yield key, row
def mapper(self,key,value):
self.collect(key,value)
def reducer(self,key,values):
for value in values: self.mapper(key,value)
if __name__=='__main__':
mapper = SerialTSQR(blocksize=3,isreducer=False)
reducer = SerialTSQR(blocksize=3,isreducer=True)
hadoopy.run(mapper, reducer)
David Gleich (Sandia) 13/22MapReduce 2011
25
ICML2013

ICML2013
26
Condition number
1020
105
norm(QTQ–I)
AR-1
AR-1 + "
Direct TSQR
Benson, Gleich, "
Demmel, Submitted
Prior work
MapReduce 2011
Benson, Gleich,
Demmel, Submitted
Previous methods
couldn’t ensure
that the matrix Q
was orthogonal

Iterative refinement helps
ICML2013
27
A1
A4
Q1
R-1
Mapper 1
A2
Q2
A3
Q3
A4
Q4
R
TSQR
DistributeR
R-1
R-1
R-1
Iterative refinement is like using Newton’s method to solve Ax = b. There’s a famous
quote that “two iterations of iterative refinement are enough” attributed to Parlett;
TSQR
Q1
A4
Q1
T-1
Mapper 2
Q2
Q2
Q3
Q3
Q4
Q4
T
DistributeT
T-1
T-1
T-1

ICML2013
28
Condition number
1020
105
norm(QTQ–I)
AR-1
AR-1 + "
Direct TSQR
Benson, Gleich, "
Demmel, Submitted
Prior work
MapReduce 2011
Benson, Gleich,
Demmel, Submitted
Previous methods
couldn’t ensure
that the matrix Q
was orthogonal

What if iterative reﬁnement is
too slow?
ICML2013
29
A1
A4
Q1
R-1
Mapper 1
A2
Q2
A3
Q3
A4
Q4
S
Sample
ComputeQR,"
distributeR
R-1
R-1
R-1
TSQR
A1
A4
Q1
T-1
Mapper 2
A2
Q2
A3
Q3
A4
Q4
T
DistributeTR
T-1
T-1
T-1
Based on recent work by Ipsen et al. on approximating QR with a random
subset of rows. Also assumes that you can get a subset of rows “cheaply” –
possible, but nontrivial in Hadoop.
R-1
R-1
R-1
R-1
Estimate the “norm” by S

Orthogonality of the sampled
solution
ICML2013
30
Condition number
1020
105
norm(QTQ–I)
AR-1
Direct TSQR
Benson, Gleich, "
Demmel, Submitted
Prior work
MapReduce 2011
Benson, Gleich,
Demmel, Submitted

Four things that are better
1.  A simple algorithm to compute R accurately.
(but doesn’t help get Q orthogonal).
2.  Fast algorithm to get Q numerically
orthogonal in most cases.
3.  Two pass algorithm to get Q numerically
orthogonal in virtually all cases.
4.  A direct algorithm for a numerically
orthogonal Q in all cases.
ICML2013
31
Constantine & Gleich MapReduce 2011
Benson, Gleich & Demmel arXiv 2013

Recreate Q by storing the
history of the factorization
ICML2013
32
A1
A4
Q1
R1
Mapper 1
A2
Q2
R2
A3
Q3
R3
A4
Q4
Q1
Q2
Q3
Q4
R1
R2
R3
R4
R4
Qoutput
Routput
Q11
Q21
Q31
Q41
R
Task 2
Q11
Q21
Q31
Q41
Q1
Q2
Q3
Q4
Mapper 3
1. Output local Q and
R in separate ﬁles
2. Collect R on one
node, compute Qs
for each piece
3. Distribute the
pieces of Q*1 and
form the true Q

Theoretical lower bound on runtime
for a few cases on our small cluster
Rows
Cols
Old
R-only
+ no IR
R-only
+ PIR
R-only
+ IR
Direct
TSQR
4.0B
4
1803
1821
1821
2343
2525
2.5B
10
1645
1655
1655
2062
2464
0.6B
25
804
812
812
1000
1237
0.5B
50
1240
1250
1250
1517
2103
ICML2013
33
All values in
seconds

Only two params
needed – read and
write bandwidth for
the cluster – in
order to derive a
performance model
of the algorithm.
This simple model
is almost within a
factor of two of the
true runtime. "
(10-node cluster,
60 disks)
Rows
Cols
Old
R-only
+ no IR
R-only
+ PIR
R-only
+ IR
Direct
TSQR
4.0B
4
2931
3460
3620
4741
6128
2.5B
10
2508
2509
3354
4034
4035
0.6B
25
1098
1104
1476
2006
1910
0.5B
50
921
1618
1960
2655
3090
Model
Actual

My
application is "
big simulation
data.
ICML2013
34

ICML2013
35
Nonlinear heat transfer model in
random media
Each run takes 5 hours on 8
processors, we did 8192 runs.
4.2M nodes, 9 time-steps, 128
realizations, 64 size parameters
SVD of 5B x 64 data matrix
50x reduction in wall-clock time
including all pre and post-
processing; “∞x” speedup for a
particular study without pre-
processing (< 60 sec on laptop).
Interpolating adaptive low-rank
approximations.
Constantine, Gleich,"
Hou & Templeton arXiv 2013.
1
error
0
2
(a) Error, s = 0.39 cm
1
std
0
2
(b) Std, s = 0.39 cm
10
error
0
20
(c) Error, s = 1.95 cm
10
std
0
20
(d) Std, s = 1.95 cm
Fig. 4.5: Error in the reduce order model compared to the prediction standard de-
viation for one realization of the bubble locations at the final time for two values of
10
error
0
20
(c) Error, s = 1.95 cm
Fig. 4.5: Error in the reduce order model com
viation for one realization of the bubble locati
the bubble radius, s = 0.39 and s = 1.95 cm
version.)
the varying conductivity fields took approxima
Cubit after substantial optimizations.
Working with the simulation data involved
interpret 4TB of Exodus II files from Aria, glo
TSSVD, and compute predictions and errors.
imately 8-15 hours. We collected precise timi
it as these times are from a multi-tenant, uno
jobs with sizes ranging between 100GB and 2T

Dynamic Mode
Decomposition
One simulation, ~10TB of data, compute the
SVD of a space-by-time matrix.
ICML2013
36
DMD video

Is double-precision sufficient?
Double-precision floating point was designed for the
era where “big” was 1000-10000
s = 0; for i=1 to n: s = s + x[i]

A simple summation formula has "
error that is not always small if n is a billion
Watch out for this issue in important computations
ICML2013
37
fl(x + y) = (x + y)(1 + ")
fl(
X
i
xi )
X
i
xi  nµ
X
i
|xi | µ ⇡ 10 16

Papers

Constantine & Gleich, MapReduce 2011
Benson, Gleich & Demmel, arXiv 2013

Constantine & Gleich, ICASSP 2012
Constantine, Gleich, Hou & Templeton, "
arXiv 2013
ICML2013
38
Code

https://github.com/arbenson/mrtsqr

https://github.com/dgleich/simform
"

Questions?

Tall and Skinny QRs in MapReduce

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Tall and Skinny QRs in MapReduce

Similar to Tall and Skinny QRs in MapReduce (20)

More from David Gleich

More from David Gleich (11)

Tall and Skinny QRs in MapReduce