Direct tall-and-skinny QR factorizations in MapReduce architectures

Tall-and-Skinny QR
Factorizations in
MapReduce
PAUL G. CONSTANTINE
AUSTIN BENSON !
JOE NICHOLS !
STANFORD UNIVERSITY
DAVID F. GLEICH ! JAMES DEMMEL !
PURDUE UNIVERSITY
UC BERKELEY
COMPUTER SCIENCE ! JOE RUTHRUFF !
DEPARTMENT
JEREMY TEMPLETON !
SANDIA

1
David Gleich · Purdue
Cornell CS

Questions?

Most recent code at
http://github.com/arbenson/mrtsqr

2
Cornell CS

the solution of

QR is block nor

Quick review of QR
QR Factorization
  is   orthogonal (   ) “normalize” a v
usually genera
computing   in
  is   upper triangular.
Let   , real Using QR for regression

  is given by
  the solution of  
0
A QR is = Q
block normalization
  is   orthogonal (   ) “normalize” a vector
R
usually generalizes to
computing   in the QR
  is   upper David Gleich (Sandia)
triangular. MapReduce 2011

0
A = Q
R

3
David Gleich (Sandia) MapReduce 2011 David Gleich · Purdue
Cornell4/22
CS

Tall-and-Skinny A
matrices (m ≫ n)

4
Cornell CS

Tall-and-Skinny matrices !
(m ≫ n) arise in

regression with many samples
block iterative methods
panel factorizations
model reduction problems!
A
general linear models "
with many samples
tall-and-skinny SVD/PCA

From tinyimages"
All of these applications ! collection
need a QR factorization of !
a tall-and-skinny matrix.!
some only need R !

5
Cornell CS

The Database

Input " Time history" s1 -> f1
Parameters
of simulation
s2 -> f2
s
f"
~100GB
sk -> fk

2 3 A single simulation
The simulation as a vector

q(x1 , t1 , s)
6 .
. 7 at one time step
6 . 7
6 7
6q(xn , t1 , s)7
6 7
6q(x1 , t2 , s)7
6 7 ⇥ ⇤
f(s) = 6 . 7
6
6
.
. 7
7 X = f(s1 ) f(s2 ) ... f(sp )
6q(xn , t2 , s)7
6 7
6 . 7 The database as a very"
4 .
. 5
tall-and-skinny matrix
q(xn , tk , s)

6
Cornell CS

Dynamic Mode
Decomposition
One simulation, ~10TB of data, compute the
SVD of a space-by-time matrix.

DMD video

7
Cornell CS

MapReduce

It’s a computational model "
and a framework.

8
Cornell CS

MapReduce

9
Cornell CS

The MapReduce Framework
Originated at Google for indexing web Data scalable
pages and computing PageRank.
Maps
M M
1
2
1
M

2
M
Reduce
M M
R 3
4
M
Express algorithms in "
3
R
4
M M
data-local operations.
5
M Shuffle
5

Implement one type of Fault-tolerance by design
communication: shuffle.
Input stored in triplicate
Reduce input/"
M
Shuffle moves all data with M
output on disk
R
the same key to the same M
R
M
reducer.
Map output"
persisted to disk"

10
before shuffle
Cornell CS

Computing variance in MapReduce
Run 1
Run 2
Run 3

T=1
T=2
T=3
T=1
T=2
T=3
T=1
T=2
T=3

11
Cornell CS

Mesh point variance in MapReduce
Run 1
Run 2
Run 3

T=1
T=2
T=3
T=1
T=2
T=3
T=1
T=2
T=3
M
M
M

1. Each mapper out- 2. Shufﬂe moves all
puts the mesh points values from the same
with the same key.
mesh point to the
R
R
same reducer.

3. Reducers just
compute a numerical
variance.

12
Cornell CS

MapReduce vs. Hadoop.

MapReduce! Hadoop!
A computation An implementation
model with:" of MapReduce
Map a local data using the HDFS
transform" parallel ﬁle-system.
Shufﬂe a grouping Others !
function "
Pheonix++, Twisted,
Reduce an Google MapReduce,
aggregation
spark …

13
Cornell CS

Current state of the art for
MapReduce QR

MapReduce is often used to compute the
principal components of large datasets.

These approaches all form the normal equations

T
A A
and work with it.

14
Cornell CS

MapReduce is great for TSQR!!
You don’t need ATA
Data A tall and skinny (TS) matrix by rows

Input 500,000,000-by-50 matrix"
Each record 1-by-50 row"
HDFS Size 183.6 GB

Time to compute read A 253 sec. write A 848 sec.!
Time to compute R in qr(A) 526 sec. w/ Q=AR-1 1618 sec. "
Time to compute Q in qr(A) 3090 sec. (numerically stable)!

15/22
Cornell CS

Tall-and-Skinny QR

16
Cornell CS

Communication avoiding QR
Communication avoiding TSQR
(Demmel et al. 2008)

First, do QR Second, compute
factorizations a QR factorization
of each local of the new “R”
matrix  

17
Demmel et al.David Communicating avoiding parallel and sequential QR.
2008. Gleich · Purdue
Cornell CS

Serial QR factorizations!
Fully serialet al. 2008)
(Demmel TSQR

Compute QR of   ,
read   , update QR, …

18
Demmel et al. 2008. Communicating avoiding parallel Cornell CS
QR.
and sequential

Tall-and-skinny matrix
MapReduce matrix storage
storage in MapReduce

 
A1

Key is an arbitrary row-id
Value is the   array for A2
a row.

A3
Each submatrix   is an
input split.
A4

You can also store multiple rows
together. It goes a little faster.

19
Cornell CS
David Gleich (Sandia) MapReduce 2011 10/2

Algorithm
Data Rows of a matrix
A1 A1 Map QR factorization of rows
A2
qr Reduce QR factorization of rows
A2 Q2 R2
Mapper 1 qr
Serial TSQR A3 A3 Q3 R3
A4 qr emit
A4 Q4 R4

A5 A5
qr
A6 A6 Q6 R6
Mapper 2 qr
Serial TSQR A7 A7 Q7 R7

A8 qr emit
A8 Q8 R8

R4 R4
Reducer 1
Serial TSQR qr emit
R8 R8 Q R

20
Cornell CS

Key Limitation
Computes only R and not Q

Can get Q via Q = AR-1 with another MR
iteration.
Numerical stability: dubious
T

kQ Q Ik is large
although iterative reﬁnement helps.

21
Cornell CS

Achieving numerical stability
norm ( QTQ – I )

AR-1

AR + "
-1

ent
iterative reﬁnem Direct TSQR

105
1020
Condition number

22
Cornell CS

Why MapReduce?

23
Cornell CS

In hadoopy
Full code in hadoopy
import random, numpy, hadoopy def close(self):
class SerialTSQR: self.compress()
def __init__(self,blocksize,isreducer): for row in self.data:
key = random.randint(0,2000000000)
self.bsize=blocksize yield key, row
self.data = []
if isreducer: self.__call__ = self.reducer def mapper(self,key,value):
else: self.__call__ = self.mapper self.collect(key,value)

def reducer(self,key,values):
def compress(self): for value in values: self.mapper(key,value)
R = numpy.linalg.qr(
numpy.array(self.data),'r') if __name__=='__main__':
# reset data and re-initialize to R mapper = SerialTSQR(blocksize=3,isreducer=False)
self.data = [] reducer = SerialTSQR(blocksize=3,isreducer=True)
for row in R: hadoopy.run(mapper, reducer)
self.data.append([float(v) for v in row])

def collect(self,key,value):
self.data.append(value)
if len(self.data)>self.bsize*len(self.data[0]):
self.compress()

24
David Gleich (Sandia) MapReduceDavid
2011 Gleich · Purdue
Cornell CS
13/22

Fault injection
200
Faults (200M by 200)
Time to completion (sec)

With 1/5
tasks failing,
No faults (200M by 200)
the job only
takes twice
100
Faults (800M by 10)
as long.

No faults "
(800M by 10)

10
100
1000

1/Prob(failure) – mean number of success per failure

25
Cornell CS

How to get Q?

26
Cornell CS

Idea 1 (unstable)

Mapper 1
R-1
A1
Q1

R-1
A2
Q2
R
TSQR
R-1
A3
Q3
Dist
ribu

R-1
te

A4
Q4
R

27
Cornell CS

There’s a famous quote that “two iterations
Idea 2 (better)
of iterative reﬁnement are enough” attributed
to Parlett

Mapper 1
Mapper 2
R-1
T-1
A1
Q1
Q1
Q1

R-1
T-1
A2
Q2
Q2
Q2
R
T
TSQR
TSQR
R-1
T-1
Dist

A3
Q3
Q3
Q3

Dist
ribu

ribu
R-1
T-1
te

te
A4
Q4
Q4
A4 Q4
R

T

28
Cornell CS

Communication avoiding QR
Communication avoiding TSQR
(Demmel et al. 2008)

First, do QR Second, compute
factorizations a QR factorization
of each local of the new “R”
matrix  

29
Demmel et al.David Communicating avoiding parallel and sequential QR.
2008. Gleich · Purdue
Cornell CS

Idea 3 (best!)
3. Distribute the
pieces of Q*1 and
form the true Q

Mapper 1
Mapper 3
Task 2
R1
Q11
A1
Q1
R1
Q11
R
Q1
Q1
R2
Q21

Q output
R output
R2
R3
Q31
Q21
A2
Q2
Q2
Q2
R4
Q41
R3
Q31
2. Collect R on one
A3
Q3
Q3
Q3
node, compute Qs
for each piece
R4
Q41
A4
Q4
Q4
Q4

1. Output local Q and
R in separate ﬁles

30
Cornell CS

The price is right!

2500
Full TSQR is
faster than
reﬁnement for … and not any
seconds

few columns
slower for many
columns.

500

31
Cornell CS

What can we do now?

32
Cornell CS

PCA of 80,000,000!
images
First 16
columns
of V as
images
1000 pixels
R   V
SVD
(principal
TSQR
components)
80,000,000 images

Top 100
A X singular
values
Zero"
mean"
rows

33/22
MapReduce Post Processing

Cornell CS

A Large Scale Example

Nonlinear heat transfer model
80k nodes, 300 time-steps
104 basis runs
SVD of 24m x 104 data matrix
500x reduction in wall clock time
(100x including the SVD)

34
ICASSP

What’s next?
Investigate randomized algorithms for
computing SVDs for fatter matrices.
Halko,
9

RANDOMIZED ALGORITHMS FOR MATRIX APPROXIMATION

Algorithm: Randomized PCA Martinsson,
Tropp.

an q, this procedure computes an kapproximate rank-2k factorization
Given
exponent
m × n matrix A, the number of principal components, and an
SIREV 2011
U ΣV ∗ . The columns of U estimate the ﬁrst 2k principal components of A.

A:
Stage
Generate an n × 2k Gaussian test matrix Ω.

1
2 Form Y = (AA∗ )q AΩ by multiplying alternately with A and A∗
3 Construct a matrix Q whose columns form an orthonormal basis for the

B:
range of Y .

Stage
Form B = Q∗ A.

1
2 Compute an SVD of the small matrix: B = U ΣV ∗ .
3 Set U = QU .

35
singular spectrum of the data matrix often decays quite slowly. To address thisCornell CS
diﬃ-

Questions?

Most recent code at
http://github.com/arbenson/mrtsqr

36
Cornell CS

Direct tall-and-skinny QR factorizations in MapReduce architectures

Recommended

Recommended

More Related Content

What's hot

What's hot (12)

Viewers also liked

Viewers also liked (20)

Similar to Direct tall-and-skinny QR factorizations in MapReduce architectures

Similar to Direct tall-and-skinny QR factorizations in MapReduce architectures (9)

More from David Gleich

More from David Gleich (13)

Direct tall-and-skinny QR factorizations in MapReduce architectures