What you can do with a tall-and-skinny QR factorization in Hadoop: Principal components and large regressions

What you can do with a Tall-and-Skinny !
QR Factorization on Hadoop: !
Large regressions, Principal Components

Slides bit.ly/16LS8Vk
@dgleich
Code github.com/dgleich/mrtsqr
dgleich@purdue.edu

DAVID F. GLEICH
ASSISTANT PROFESSOR !
COMPUTER SCIENCE !
PURDUE UNIVERSITY

1
David Gleich · Purdue
bit.ly/16LS8Vk

Why you should stay …
you like advanced machine learning techniques

you want to understand how to compute the
singular values and vectors of a huge matrix
(that’s tall and skinny)

you want to learn about large-scale regression,
and principal components from a matrix
perspective

2
bit.ly/16LS8Vk

What I’m going to assume
you know

MapReduce

Python

Some simple matrix manipulation

3
bit.ly/16LS8Vk

Tall-and-Skinny
matrices

(m ≫ n)
Many rows (like a billion)
A
A few columns (under 10,000)

regression and!
general linear models!
with many samples! From tinyimages"

collection
Used in
block iterative methods
panel factorizations

approximate kernel k-means

big-data SVD/PCA!

4
bit.ly/16LS8Vk

If you have tons of small
records, then there is probably
a tall-and-skinny matrix
somwhere

5
bit.ly/16LS8Vk

Tall-and-skinny matrices are
common in BigData

A : m x n, m ≫ n
A1

Key is an arbitrary row-id
A2
Value is the 1 x n array "
for a row
A3

Each submatrix Ai is an "
A4
the input to a map task.

6
bit.ly/16LS8Vk

PCA of 80,000,000!
images
1000 pixels
1

0.8 0

Fraction of variance
Fraction of variance
80,000,000 images

0.6 0

A 0.4 0

0.2 0

First 16 columns of V as 0
20 40 60 80 100
images
Principal Components

Figure 5: The 16 most impo
nent basis functions (by row

7
Constantine & Gleich, MapReduce 2011.
bit.ly/16LS8Vk

via the sum of red-pixel values in each image as a linear combi-
nation of the gray values in each image. Formally, if ri is the
time
and
Regression with 80,000,000
sum of the red components in all pixels of image i, and Gi,j
is the gray value of the jth pixel in image i, then we wanted
per-
ates
images
q q
to find min i (ri ≠ j Gi,j sj )2 . There is no particular im-
(for portance to this regression problem, we use it merely as a
demonstration.
1000 pixels
on),
split The coe cients sj are dis-
file played as an image to approx.
The goal was at the right.
d by They reveal regionsthere was
how much red of the im-
test age in a picture fromimportant
that are not as the
80,000,000 images

the in determining the overall red
value of the grayscale
r in component of an image. The
pixels only.
A color scale varies from light-
final
size blue (strongly measure of blue
We get a negative) to
pers (0) howred (strongly positive).
and much “redness”
The computation took 30 min-
each pixel contributes to
1000 utes using the Dumbo frame-
the whole.
work and a two-iteration job with 250 intermediate reducers.
h is
the We also solved a principal component problem to find a
hav- principal component basis for each image. Let G be matrix
final of Gi,j ’s from the regression andDavidui be the meanbit.ly/16LS8Vk
let Gleich · Purdue
of the ith

8

Let’s talk about QR!

9
bit.ly/16LS8Vk

QR Factorization and the
Gram Schmidt process

Consider a set of vectors v1 to
vn. Set u1 to be v1.

Create a new vector u2 by
removing any “component” of
u1 from v2.

Create a new vector u3 by
removing any “component” of
u1 and u2 from v3.

…
“Gram-Schmidt process” "

10
from Wikipedia
bit.ly/16LS8Vk


v1 = a1 u1
v2 = b1 u1 + b2 u2
v3 = c1 u1 + c2 u2 + c3 u3
⇥ ⇤
v1 v2 v3 ...
2 3
a1 b1 c1 ...
⇥ ⇤6 0
6 b2 c2 ... 77
= u1 u2 v3 ... 6 0 0 c3 ... 7
4 5
.
. .
. .
. ..
. . . .

11
bit.ly/16LS8Vk


v1 = a1 u1
v2 = b1 u1 + b2 u2
v3 = c1 u1 + c2 u2 + c3 u3
For this problem
V = UR All vectors in U
are at right
angles, i.e. they
What it’s usually"
written as by others
A = QR are decoupled

12
bit.ly/16LS8Vk


R
v1 = a1 u1
v2 = b1 u1 + b2 u2
v3 = c1 u1 + c2 u2 + c3 u3

A =
Q All vectors in U
are at right
angles, i.e. they
are decoupled

13
bit.ly/16LS8Vk

PCA of 80,000,000!
images
First 16
columns
of V as
images
1000 pixels
R   V
SVD
(principal
TSQR
components)
80,000,000 images

Top 100
A X singular
values
Zero"
mean"
rows

MapReduce Post Processing

14
Constantine & Gleich, MapReduce 2010.
bit.ly/16LS8Vk

Input 500,000,000-by-100 matrix
Each record 1-by-100 row
HDFS Size 423.3 GB
Time to compute colsum( A ) 161 sec.
Time to compute R in qr( A ) 387 sec.

15
bit.ly/16LS8Vk

The rest of the talk!
Full TSQR code in hadoopy
import random, numpy, hadoopy def close(self):
class SerialTSQR: self.compress()
def __init__(self,blocksize,isreducer): for row in self.data:
self.bsize=blocksize key = random.randint(0,2000000000)
self.data = [] yield key, row
if isreducer: self.__call__ = self.reducer
else: self.__call__ = self.mapper def mapper(self,key,value):
self.collect(key,value)
def compress(self):
R = numpy.linalg.qr( def reducer(self,key,values):
numpy.array(self.data),'r') for value in values: self.mapper(key,value)
# reset data and re-initialize to R
self.data = [] if __name__=='__main__':
for row in R: mapper = SerialTSQR(blocksize=3,isreducer=False)
self.data.append([float(v) for v in row]) reducer = SerialTSQR(blocksize=3,isreducer=True)
hadoopy.run(mapper, reducer)
def collect(self,key,value):
self.data.append(value)
if len(self.data)>self.bsize*len(self.data[0]):
self.compress()

16
bit.ly/16LS8Vk

Communication avoiding QR (Demmel et al. 2008) !
on MapReduce (Constantine and Gleich, 2010)
Algorithm
Data Rows of a matrix
A1 A1 Map QR factorization of rows
A2
qr Reduce QR factorization of rows
A2 Q2 R2
Mapper 1 qr
Serial TSQR A3 A3 Q3 R3

A4 qr emit
A4 Q4 R4

A5 A5
qr
A6 A6 Q6 R6
Mapper 2 qr
Serial TSQR A7 A7 Q7 R7

A8 qr emit
A8 Q8 R8

R4 R4
Reducer 1
Serial TSQR qr emit
R8 R8 Q R

17
bit.ly/16LS8Vk

The rest of the talk!
Full TSQR code in hadoopy
def __init__(self,blocksize,isreducer): for row in self.data:
self.bsize=blocksize key = random.randint(0,2000000000)
self.data = [] yield key, row
if isreducer: self.__call__ = self.reducer
else: self.__call__ = self.mapper def mapper(self,key,value):
self.collect(key,value)
def compress(self):
R = numpy.linalg.qr( def reducer(self,key,values):
numpy.array(self.data),'r') for value in values: self.mapper(key,value)
self.data = [] if __name__=='__main__':
for row in R: mapper = SerialTSQR(blocksize=3,isreducer=False)
self.data.append([float(v) for v in row]) reducer = SerialTSQR(blocksize=3,isreducer=True)
hadoopy.run(mapper, reducer)
def collect(self,key,value):
self.data.append(value)
self.compress()

18
bit.ly/16LS8Vk

Too many maps cause too
much data to one reducer!

Each image is 5k.
Each HDFS block has "
12,800 images.
6,250 total blocks.
Each map outputs "
1000-by-1000 matrix
One reducer gets a 6.25M-
by-1000 matrix (50GB)

19
bit.ly/16LS8Vk

map emit reduce emit reduce emit
R1 R2,1 R
A1 Mapper 1-1
S1 Reducer 1-1
S(2)
A2 Reducer 2-1
Serial TSQR Serial TSQR Serial TSQR

shuffle
identity map
map emit reduce emit
R2 R2,2
A2 Mapper 1-2 S(1) S
A2 Reducer 1-2
shuffle

Serial TSQR Serial TSQR

A
map emit reduce emit
R3 R2,3
A3 Mapper 1-3
S3
A2 Reducer 1-3
Serial TSQR Serial TSQR

map emit
R4
A3
4 Mapper 1-4
Serial TSQR

20
Iteration 1 Iteration 2
bit.ly/16LS8Vk

Input 500,000,000-by-100 matrix
Each record 1-by-100 row
HDFS Size 423.3 GB
Time to compute colsum( A ) 161 sec.
Time to compute R in qr( A ) 387 sec.

21
bit.ly/16LS8Vk

Hadoop streaming isn’t
always slow!
Synthetic data test on 100,000,000-by-500 matrix (~500GB)
Codes implemented in MapReduce streaming
Matrix stored as TypedBytes lists of doubles
Python frameworks use Numpy+ATLAS matrix.
Custom C++ TypedBytes reader/writer with ATLAS matrix.

Iter 1
Iter 2
Overall"
Total (secs.)
Total (secs.)
Total (secs.)
Dumbo
960
217
1177
Hadoopy
612
118
730
C++! 350! 37! 387!
Java
436
66
502

22
bit.ly/16LS8Vk

Use multiple iterations for
problems with many columns
Cols.
Iters.
Split" Maps
Secs.
(MB)
Increasing split size
50
1
64
8000
388
improves performance
(accounts for Hadoop –
–
256
2000
184

data movement)
–
–
512
1000
149

Increasing iterations –
2
64
8000
425
helps for problems with –
–
256
2000
220
many columns.
–
–
512
1000
191

(1000 columns with 64- 1000
1
512
1000
666
MB split size overloaded
–
2
64
6000
590
the single reducer.)
–
–
256
2000
432

–
–
512
1000
337

23
bit.ly/16LS8Vk

More about how to !
compute a regression

2
min kAx bk
XX
2
= min ( Aij xj bi )
i j
A b1
A1 A1
Q2 b2 = Q2T b1
qr
A2 A2 R2
Mapper 1 qr
Serial TSQR A3 A3

b
A4

24
bit.ly/16LS8Vk

TSQR code in hadoopy for
regressions
def __init__(self,blocksize,isreducer): for i,row in enumerate(self.data):
[…] key = random.randint(0,2000000000)
yield key, (row, self.rhs[i])
def compress(self):
Q,R = numpy.linalg.qr( def mapper(self,key,value):
numpy.array(self.data), ‘full’) self.collect(key,unpack(value))
self.data = [] def reducer(self,key,values):
for row in R: for value in values: self.mapper(key,
self.data.append([float(v) for v in row]) unpack(value))
self.rhs = list( numpy.dot(Q.T,
numpy.array(self.rhs) ) if __name__=='__main__':
mapper = SerialTSQR(blocksize=3,isreducer=False)
def collect(self,key,valuerhs): reducer = SerialTSQR(blocksize=3,isreducer=True)
self.data.append(valuerhs[0]) hadoopy.run(mapper, reducer)
self.rhs.append(valuerhs[1])
self.compress()

25
bit.ly/16LS8Vk

More about how to !
compute a regression
min kAx bk2
= min kQRx bk2
Orthogonal or “right angle” matrices"
don’t change vector magnitude
T T 2
QT b
= min kQ QRx Q bk
A R = min kRx Q T bk2
QR"
for " This is a tiny linear system!
Regression
def compute_x(output):!
R,y = load_from_hdfs(output)!
x = numpy.linalg.solve(R,y)!
write_output(x,output+’-x’)!
b

26
bit.ly/16LS8Vk

We do a similar step for the
PCA and compute the 1000-
by-1000 SVD on one machine

27
bit.ly/16LS8Vk

Getting the matrix Q is tricky!

28
bit.ly/16LS8Vk

What about the matrix Q?

We want Q to be Constantine & Gleich,
MapReduce 2011
numerically
orthogonal.
Prior work
norm ( QTQ – I )

AR-1
A condition number
measures problem Benson, Gleich,
sensitivity.
Demmel, Submitted
AR + "
-1

nt
Direct TSQR
reﬁneme
iterative Benson, Gleich, "
Prior methods all Demmel, Submitted
failed without any 105
1020
warning.
Condition number

29
bit.ly/16LS8Vk

Taking care of business by
keeping track of Q
3. Distribute the
pieces of Q*1 and
form the true Q

Mapper 1
Mapper 3
Task 2
R1
Q11
A1
Q1
R1
Q11
R
Q1
Q1
R2
Q21

Q output
R output
R2
R3
Q31
Q21
A2
Q2
Q2
Q2
R4
Q41
R3
Q31
2. Collect R on one
A3
Q3
Q3
Q3
node, compute Qs
for each piece
R4
Q41
A4
Q4
Q4
Q4

1. Output local Q and
R in separate ﬁles

30
bit.ly/16LS8Vk

Code available from
github.com/arbenson/mrtsqr
…
it isn’t too bad.

31
bit.ly/16LS8Vk

Future work … more columns!

With ~3000 columns, one 64MB chunk is a local
QR computation.

Could “iterate in blocks of 3000” columns to
continue … maybe “efﬁcient” for 10,000 columns

Need different ideas for 100,000 columns
(randomized methods?)

32
bit.ly/16LS8Vk

Questions?

www.cs.purdue.edu/~dgleich
@dgleich
dgleich@purdue.edu

33
bit.ly/16LS8Vk

What you can do with a tall-and-skinny QR factorization in Hadoop: Principal components and large regressions

More Related Content

What's hot

Viewers also liked

Similar to What you can do with a tall-and-skinny QR factorization in Hadoop: Principal components and large regressions

More from David Gleich

Recently uploaded

What you can do with a tall-and-skinny QR factorization in Hadoop: Principal components and large regressions

Editor's Notes