Direct tall-and-skinny QR factorizations in MapReduce architectures
1. Tall-and-Skinny QR
Factorizations in
MapReduce
PAUL G. CONSTANTINE
AUSTIN BENSON !
JOE NICHOLS !
STANFORD UNIVERSITY
DAVID F. GLEICH ! JAMES DEMMEL !
PURDUE UNIVERSITY
UC BERKELEY
COMPUTER SCIENCE ! JOE RUTHRUFF !
DEPARTMENT
JEREMY TEMPLETON !
SANDIA
1
David Gleich · Purdue
Cornell CS
2. Questions?
Most recent code at
http://github.com/arbenson/mrtsqr
2
David Gleich · Purdue
Cornell CS
3. the solution of
QR is block nor
Quick review of QR
QR Factorization
is orthogonal ( ) “normalize” a v
usually genera
computing in
is upper triangular.
Let , real Using QR for regression
is given by
the solution of
0
A QR is = Q
block normalization
is orthogonal ( ) “normalize” a vector
R
usually generalizes to
computing in the QR
is upper David Gleich (Sandia)
triangular. MapReduce 2011
0
A = Q
R
3
David Gleich (Sandia) MapReduce 2011 David Gleich · Purdue
Cornell4/22
CS
4. Tall-and-Skinny A
matrices (m ≫ n)
4
David Gleich · Purdue
Cornell CS
5. Tall-and-Skinny matrices !
(m ≫ n) arise in
regression with many samples
block iterative methods
panel factorizations
model reduction problems!
A
general linear models "
with many samples
tall-and-skinny SVD/PCA
From tinyimages"
All of these applications ! collection
need a QR factorization of !
a tall-and-skinny matrix.!
some only need R !
5
David Gleich · Purdue
Cornell CS
6. The Database
Input " Time history" s1 -> f1
Parameters
of simulation
s2 -> f2
s
f"
~100GB
sk -> fk
2 3 A single simulation
The simulation as a vector
q(x1 , t1 , s)
6 .
. 7 at one time step
6 . 7
6 7
6q(xn , t1 , s)7
6 7
6q(x1 , t2 , s)7
6 7 ⇥ ⇤
f(s) = 6 . 7
6
6
.
. 7
7 X = f(s1 ) f(s2 ) ... f(sp )
6q(xn , t2 , s)7
6 7
6 . 7 The database as a very"
4 .
. 5
tall-and-skinny matrix
q(xn , tk , s)
6
David Gleich · Purdue
Cornell CS
10. The MapReduce Framework
Originated at Google for indexing web Data scalable
pages and computing PageRank.
Maps
M M
1
2
1
M
2
M
Reduce
M M
R 3
4
M
Express algorithms in "
3
R
4
M M
data-local operations.
5
M Shuffle
5
Implement one type of Fault-tolerance by design
communication: shuffle.
Input stored in triplicate
Reduce input/"
M
Shuffle moves all data with M
output on disk
R
the same key to the same M
R
M
reducer.
Map output"
persisted to disk"
10
before shuffle
David Gleich · Purdue
Cornell CS
11. Computing variance in MapReduce
Run 1
Run 2
Run 3
T=1
T=2
T=3
T=1
T=2
T=3
T=1
T=2
T=3
11
David Gleich · Purdue
Cornell CS
12. Mesh point variance in MapReduce
Run 1
Run 2
Run 3
T=1
T=2
T=3
T=1
T=2
T=3
T=1
T=2
T=3
M
M
M
1. Each mapper out- 2. Shuffle moves all
puts the mesh points values from the same
with the same key.
mesh point to the
R
R
same reducer.
3. Reducers just
compute a numerical
variance.
12
David Gleich · Purdue
Cornell CS
13. MapReduce vs. Hadoop.
MapReduce! Hadoop!
A computation An implementation
model with:" of MapReduce
Map a local data using the HDFS
transform" parallel file-system.
Shuffle a grouping Others !
function "
Pheonix++, Twisted,
Reduce an Google MapReduce,
aggregation
spark …
13
David Gleich · Purdue
Cornell CS
14. Current state of the art for
MapReduce QR
MapReduce is often used to compute the
principal components of large datasets.
These approaches all form the normal equations
T
A A
and work with it.
14
David Gleich · Purdue
Cornell CS
15. MapReduce is great for TSQR!!
You don’t need ATA
Data A tall and skinny (TS) matrix by rows
Input 500,000,000-by-50 matrix"
Each record 1-by-50 row"
HDFS Size 183.6 GB
Time to compute read A 253 sec. write A 848 sec.!
Time to compute R in qr(A) 526 sec. w/ Q=AR-1 1618 sec. "
Time to compute Q in qr(A) 3090 sec. (numerically stable)!
15/22
David Gleich · Purdue
Cornell CS
17. Communication avoiding QR
Communication avoiding TSQR
(Demmel et al. 2008)
First, do QR Second, compute
factorizations a QR factorization
of each local of the new “R”
matrix
17
Demmel et al.David Communicating avoiding parallel and sequential QR.
2008. Gleich · Purdue
Cornell CS
18. Serial QR factorizations!
Fully serialet al. 2008)
(Demmel TSQR
Compute QR of ,
read , update QR, …
18
Demmel et al. 2008. Communicating avoiding parallel Cornell CS
QR.
David Gleich · Purdue
and sequential
19. Tall-and-skinny matrix
MapReduce matrix storage
storage in MapReduce
A1
Key is an arbitrary row-id
Value is the array for A2
a row.
A3
Each submatrix is an
input split.
A4
You can also store multiple rows
together. It goes a little faster.
19
David Gleich · Purdue
Cornell CS
David Gleich (Sandia) MapReduce 2011 10/2
20. Algorithm
Data Rows of a matrix
A1 A1 Map QR factorization of rows
A2
qr Reduce QR factorization of rows
A2 Q2 R2
Mapper 1 qr
Serial TSQR A3 A3 Q3 R3
A4 qr emit
A4 Q4 R4
A5 A5
qr
A6 A6 Q6 R6
Mapper 2 qr
Serial TSQR A7 A7 Q7 R7
A8 qr emit
A8 Q8 R8
R4 R4
Reducer 1
Serial TSQR qr emit
R8 R8 Q R
20
David Gleich · Purdue
Cornell CS
21. Key Limitation
Computes only R and not Q
Can get Q via Q = AR-1 with another MR
iteration.
Numerical stability: dubious
T
kQ Q Ik is large
although iterative refinement helps.
21
David Gleich · Purdue
Cornell CS
22. Achieving numerical stability
norm ( QTQ – I )
AR-1
AR + "
-1
ent
iterative refinem Direct TSQR
105
1020
Condition number
22
David Gleich · Purdue
Cornell CS
24. In hadoopy
Full code in hadoopy
import random, numpy, hadoopy def close(self):
class SerialTSQR: self.compress()
def __init__(self,blocksize,isreducer): for row in self.data:
key = random.randint(0,2000000000)
self.bsize=blocksize yield key, row
self.data = []
if isreducer: self.__call__ = self.reducer def mapper(self,key,value):
else: self.__call__ = self.mapper self.collect(key,value)
def reducer(self,key,values):
def compress(self): for value in values: self.mapper(key,value)
R = numpy.linalg.qr(
numpy.array(self.data),'r') if __name__=='__main__':
# reset data and re-initialize to R mapper = SerialTSQR(blocksize=3,isreducer=False)
self.data = [] reducer = SerialTSQR(blocksize=3,isreducer=True)
for row in R: hadoopy.run(mapper, reducer)
self.data.append([float(v) for v in row])
def collect(self,key,value):
self.data.append(value)
if len(self.data)>self.bsize*len(self.data[0]):
self.compress()
24
David Gleich (Sandia) MapReduceDavid
2011 Gleich · Purdue
Cornell CS
13/22
25. Fault injection
200
Faults (200M by 200)
Time to completion (sec)
With 1/5
tasks failing,
No faults (200M by 200)
the job only
takes twice
100
Faults (800M by 10)
as long.
No faults "
(800M by 10)
10
100
1000
1/Prob(failure) – mean number of success per failure
25
David Gleich · Purdue
Cornell CS
26. How to get Q?
26
David Gleich · Purdue
Cornell CS
27. Idea 1 (unstable)
Mapper 1
R-1
A1
Q1
R-1
A2
Q2
R
TSQR
R-1
A3
Q3
Dist
ribu
R-1
te
A4
Q4
R
27
David Gleich · Purdue
Cornell CS
28. There’s a famous quote that “two iterations
Idea 2 (better)
of iterative refinement are enough” attributed
to Parlett
Mapper 1
Mapper 2
R-1
T-1
A1
Q1
Q1
Q1
R-1
T-1
A2
Q2
Q2
Q2
R
T
TSQR
TSQR
R-1
T-1
Dist
A3
Q3
Q3
Q3
Dist
ribu
ribu
R-1
T-1
te
te
A4
Q4
Q4
A4 Q4
R
T
28
David Gleich · Purdue
Cornell CS
29. Communication avoiding QR
Communication avoiding TSQR
(Demmel et al. 2008)
First, do QR Second, compute
factorizations a QR factorization
of each local of the new “R”
matrix
29
Demmel et al.David Communicating avoiding parallel and sequential QR.
2008. Gleich · Purdue
Cornell CS
30. Idea 3 (best!)
3. Distribute the
pieces of Q*1 and
form the true Q
Mapper 1
Mapper 3
Task 2
R1
Q11
A1
Q1
R1
Q11
R
Q1
Q1
R2
Q21
Q output
R output
R2
R3
Q31
Q21
A2
Q2
Q2
Q2
R4
Q41
R3
Q31
2. Collect R on one
A3
Q3
Q3
Q3
node, compute Qs
for each piece
R4
Q41
A4
Q4
Q4
Q4
1. Output local Q and
R in separate files
30
David Gleich · Purdue
Cornell CS
31. The price is right!
2500
Full TSQR is
faster than
refinement for … and not any
seconds
few columns
slower for many
columns.
500
31
David Gleich · Purdue
Cornell CS
32. What can we do now?
32
David Gleich · Purdue
Cornell CS
33. PCA of 80,000,000!
images
First 16
columns
of V as
images
1000 pixels
R V
SVD
(principal
TSQR
components)
80,000,000 images
Top 100
A X singular
values
Zero"
mean"
rows
33/22
MapReduce Post Processing
David Gleich · Purdue
Cornell CS
34. A Large Scale Example
Nonlinear heat transfer model
80k nodes, 300 time-steps
104 basis runs
SVD of 24m x 104 data matrix
500x reduction in wall clock time
(100x including the SVD)
34
David Gleich · Purdue
ICASSP
35. What’s next?
Investigate randomized algorithms for
computing SVDs for fatter matrices.
Halko,
9
RANDOMIZED ALGORITHMS FOR MATRIX APPROXIMATION
Algorithm: Randomized PCA Martinsson,
Tropp.
an q, this procedure computes an kapproximate rank-2k factorization
Given
exponent
m × n matrix A, the number of principal components, and an
SIREV 2011
U ΣV ∗ . The columns of U estimate the first 2k principal components of A.
A:
Stage
Generate an n × 2k Gaussian test matrix Ω.
1
2 Form Y = (AA∗ )q AΩ by multiplying alternately with A and A∗
3 Construct a matrix Q whose columns form an orthonormal basis for the
B:
range of Y .
Stage
Form B = Q∗ A.
1
2 Compute an SVD of the small matrix: B = U ΣV ∗ .
3 Set U = QU .
35
singular spectrum of the data matrix often decays quite slowly. To address thisCornell CS
David Gleich · Purdue
diffi-
36. Questions?
Most recent code at
http://github.com/arbenson/mrtsqr
36
David Gleich · Purdue
Cornell CS