What you can do with a Tall-and-Skinny !QR Factorization on Hadoop: !Large regressions, Principal ComponentsSlides bit.ly/...
Why you should stay …you like advanced machine learning techniquesyou want to understand how to compute thesingular values...
What I’m going to assumeyou know MapReducePythonSome simple matrix manipulation                                           ...
Tall-and-Skinny    matrices        (m ≫ n)     Many rows (like a billion)A    A few columns (under 10,000)                ...
If you have tons of smallrecords, then there is probablya tall-and-skinny matrixsomwhere                                  ...
Tall-and-skinny matrices arecommon in BigDataA : m x n, m ≫ n                                             A1Key is an arbi...
PCA of 80,000,000!         images1000 pixels                                                                              ...
via       the sum of red-pixel values in each image as a linear combi-            nation of the gray values in each image....
Let’s talk about QR!                                                            9                 David Gleich · Purdue   ...
QR Factorization and theGram Schmidt process                                    Consider a set of vectors v1 to           ...
QR Factorization and theGram Schmidt process                                   v1 = a1 u1                                 ...
QR Factorization and theGram Schmidt process                              v1 = a1 u1                              v2 = b1 ...
QR Factorization and theGram Schmidt process              R                       v1 = a1 u1                       v2 = b1...
PCA of 80,000,000!         images                                                       First 16                          ...
Input 500,000,000-by-100 matrixEach record 1-by-100 rowHDFS Size 423.3 GBTime to compute  colsum( A ) 161 sec.Time to comp...
The rest of the talk!     Full TSQR code in hadoopyimport random, numpy, hadoopy                       def close(self):cla...
Communication avoiding QR (Demmel et al. 2008) !     on MapReduce (Constantine and Gleich, 2010)                          ...
The rest of the talk!     Full TSQR code in hadoopyimport random, numpy, hadoopy                       def close(self):cla...
Too many maps cause toomuch data to one reducer!                Each image is 5k.                Each HDFS block has "    ...
map           emit                         reduce          emit                                      reduce        emit   ...
Input 500,000,000-by-100 matrixEach record 1-by-100 rowHDFS Size 423.3 GBTime to compute  colsum( A ) 161 sec.Time to comp...
Hadoop streaming isn’talways slow!Synthetic data test on 100,000,000-by-500 matrix (~500GB)Codes implemented in MapReduce ...
Use multiple iterations forproblems with many columns                           Cols.     Iters.   Split"    Maps     Secs...
More about how to !compute a regression                                      2          min kAx bk                XX      ...
TSQR code in hadoopy for    regressionsimport random, numpy, hadoopy                       def close(self):class SerialTSQ...
More about how to !compute a regression                              min kAx         bk2                               = m...
We do a similar step for thePCA and compute the 1000-by-1000 SVD on one machine                                           ...
Getting the matrix Q is tricky!                                                             28                  David Glei...
What about the matrix Q?We want Q to be                                           Constantine & Gleich,                   ...
Taking care of business bykeeping track of Q                                        3. Distribute the                     ...
Code available fromgithub.com/arbenson/mrtsqr…it isn’t too bad.                                                           ...
Future work … more columns!With ~3000 columns, one 64MB chunk is a localQR computation. Could “iterate in blocks of 3000” ...
Questions?www.cs.purdue.edu/~dgleich@dgleichdgleich@purdue.edu                                                           3...
Upcoming SlideShare
Loading in …5
×

What you can do with a tall-and-skinny QR factorization in Hadoop: Principal components and large regressions

2,156 views
1,994 views

Published on

Some techniques that work with the tall-and-skinny QR factorization of a matrix.

Published in: Technology, Education
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,156
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
33
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • I think this took 30 minutes using our slowest codes. Our fastest codes should take it down to about 3-4 minutes. You’ll probably wait longer to get your job scheduled.
  • I think this took 30 minutes using our slowest codes. Our fastest codes should take it down to about 3-4 minutes. You’ll probably wait longer to get your job scheduled.
  • What you can do with a tall-and-skinny QR factorization in Hadoop: Principal components and large regressions

    1. 1. What you can do with a Tall-and-Skinny !QR Factorization on Hadoop: !Large regressions, Principal ComponentsSlides bit.ly/16LS8Vk @dgleichCode github.com/dgleich/mrtsqr dgleich@purdue.eduDAVID F. GLEICHASSISTANT PROFESSOR !COMPUTER SCIENCE !PURDUE UNIVERSITY 1 David Gleich · Purdue bit.ly/16LS8Vk
    2. 2. Why you should stay …you like advanced machine learning techniquesyou want to understand how to compute thesingular values and vectors of a huge matrix(that’s tall and skinny)you want to learn about large-scale regression,and principal components from a matrixperspective 2 David Gleich · Purdue bit.ly/16LS8Vk
    3. 3. What I’m going to assumeyou know MapReducePythonSome simple matrix manipulation 3 David Gleich · Purdue bit.ly/16LS8Vk
    4. 4. Tall-and-Skinny matrices (m ≫ n) Many rows (like a billion)A A few columns (under 10,000) regression and! general linear models! with many samples! From tinyimages" collection Used in block iterative methods panel factorizations approximate kernel k-means big-data SVD/PCA! 4 David Gleich · Purdue bit.ly/16LS8Vk
    5. 5. If you have tons of smallrecords, then there is probablya tall-and-skinny matrixsomwhere 5 David Gleich · Purdue bit.ly/16LS8Vk
    6. 6. Tall-and-skinny matrices arecommon in BigDataA : m x n, m ≫ n A1Key is an arbitrary row-id A2Value is the 1 x n array "for a row A3Each submatrix Ai is an " A4 the input to a map task. 6 David Gleich · Purdue bit.ly/16LS8Vk
    7. 7. PCA of 80,000,000! images1000 pixels 1 0.8 0 Fraction of variance Fraction of variance 80,000,000 images 0.6 0 A 0.4 0 0.2 0 First 16 columns of V as 0 20 40 60 80 100 images Principal Components Figure 5: The 16 most impo nent basis functions (by row 7 Constantine & Gleich, MapReduce 2011. David Gleich · Purdue bit.ly/16LS8Vk
    8. 8. via the sum of red-pixel values in each image as a linear combi- nation of the gray values in each image. Formally, if ri is thetime and Regression with 80,000,000 sum of the red components in all pixels of image i, and Gi,j is the gray value of the jth pixel in image i, then we wantedper- ates images q q to find min i (ri ≠ j Gi,j sj )2 . There is no particular im- (for portance to this regression problem, we use it merely as a demonstration. 1000 pixels on),split The coe cients sj are dis- file played as an image to approx. The goal was at the right.d by They reveal regionsthere was how much red of the im- test age in a picture fromimportant that are not as the 80,000,000 images the in determining the overall red value of the grayscale r in component of an image. The pixels only. A color scale varies from light-final size blue (strongly measure of blue We get a negative) topers (0) howred (strongly positive). and much “redness” The computation took 30 min- each pixel contributes to1000 utes using the Dumbo frame- the whole. work and a two-iteration job with 250 intermediate reducers. h is the We also solved a principal component problem to find ahav- principal component basis for each image. Let G be matrixfinal of Gi,j ’s from the regression andDavidui be the meanbit.ly/16LS8Vk let Gleich · Purdue of the ith 8
    9. 9. Let’s talk about QR! 9 David Gleich · Purdue bit.ly/16LS8Vk
    10. 10. QR Factorization and theGram Schmidt process Consider a set of vectors v1 to vn. Set u1 to be v1. Create a new vector u2 by removing any “component” of u1 from v2. Create a new vector u3 by removing any “component” of u1 and u2 from v3. … “Gram-Schmidt process” " 10 from Wikipedia David Gleich · Purdue bit.ly/16LS8Vk
    11. 11. QR Factorization and theGram Schmidt process v1 = a1 u1 v2 = b1 u1 + b2 u2 v3 = c1 u1 + c2 u2 + c3 u3 ⇥ ⇤ v1 v2 v3 ... 2 3 a1 b1 c1 ... ⇥ ⇤6 0 6 b2 c2 ... 77 = u1 u2 v3 ... 6 0 0 c3 ... 7 4 5 . . . . . . .. . . . . 11 David Gleich · Purdue bit.ly/16LS8Vk
    12. 12. QR Factorization and theGram Schmidt process v1 = a1 u1 v2 = b1 u1 + b2 u2 v3 = c1 u1 + c2 u2 + c3 u3 For this problem V = UR All vectors in U are at right angles, i.e. theyWhat it’s usually"written as by others A = QR are decoupled 12 David Gleich · Purdue bit.ly/16LS8Vk
    13. 13. QR Factorization and theGram Schmidt process R v1 = a1 u1 v2 = b1 u1 + b2 u2 v3 = c1 u1 + c2 u2 + c3 u3 A = Q All vectors in U are at right angles, i.e. they are decoupled 13 David Gleich · Purdue bit.ly/16LS8Vk
    14. 14. PCA of 80,000,000! images First 16 columns of V as images1000 pixels R    V SVD (principal TSQR components) 80,000,000 images Top 100 A X singular values Zero" mean" rows MapReduce Post Processing 14 Constantine & Gleich, MapReduce 2010. David Gleich · Purdue bit.ly/16LS8Vk
    15. 15. Input 500,000,000-by-100 matrixEach record 1-by-100 rowHDFS Size 423.3 GBTime to compute  colsum( A ) 161 sec.Time to compute R in qr( A ) 387 sec. 15 David Gleich · Purdue bit.ly/16LS8Vk
    16. 16. The rest of the talk! Full TSQR code in hadoopyimport random, numpy, hadoopy def close(self):class SerialTSQR: self.compress() def __init__(self,blocksize,isreducer): for row in self.data: self.bsize=blocksize key = random.randint(0,2000000000) self.data = [] yield key, row if isreducer: self.__call__ = self.reducer else: self.__call__ = self.mapper def mapper(self,key,value): self.collect(key,value) def compress(self): R = numpy.linalg.qr( def reducer(self,key,values): numpy.array(self.data),r) for value in values: self.mapper(key,value) # reset data and re-initialize to R self.data = [] if __name__==__main__: for row in R: mapper = SerialTSQR(blocksize=3,isreducer=False) self.data.append([float(v) for v in row]) reducer = SerialTSQR(blocksize=3,isreducer=True) hadoopy.run(mapper, reducer) def collect(self,key,value): self.data.append(value) if len(self.data)>self.bsize*len(self.data[0]): self.compress() 16 David Gleich · Purdue bit.ly/16LS8Vk
    17. 17. Communication avoiding QR (Demmel et al. 2008) ! on MapReduce (Constantine and Gleich, 2010) Algorithm Data Rows of a matrix A1 A1 Map QR factorization of rows A2 qr Reduce QR factorization of rows A2 Q2 R2Mapper 1 qrSerial TSQR A3 A3 Q3 R3 A4 qr emit A4 Q4 R4 A5 A5 qr A6 A6 Q6 R6Mapper 2 qrSerial TSQR A7 A7 Q7 R7 A8 qr emit A8 Q8 R8 R4 R4Reducer 1Serial TSQR qr emit R8 R8 Q R 17 David Gleich · Purdue bit.ly/16LS8Vk
    18. 18. The rest of the talk! Full TSQR code in hadoopyimport random, numpy, hadoopy def close(self):class SerialTSQR: self.compress() def __init__(self,blocksize,isreducer): for row in self.data: self.bsize=blocksize key = random.randint(0,2000000000) self.data = [] yield key, row if isreducer: self.__call__ = self.reducer else: self.__call__ = self.mapper def mapper(self,key,value): self.collect(key,value) def compress(self): R = numpy.linalg.qr( def reducer(self,key,values): numpy.array(self.data),r) for value in values: self.mapper(key,value) # reset data and re-initialize to R self.data = [] if __name__==__main__: for row in R: mapper = SerialTSQR(blocksize=3,isreducer=False) self.data.append([float(v) for v in row]) reducer = SerialTSQR(blocksize=3,isreducer=True) hadoopy.run(mapper, reducer) def collect(self,key,value): self.data.append(value) if len(self.data)>self.bsize*len(self.data[0]): self.compress() 18 David Gleich · Purdue bit.ly/16LS8Vk
    19. 19. Too many maps cause toomuch data to one reducer! Each image is 5k. Each HDFS block has " 12,800 images. 6,250 total blocks. Each map outputs " 1000-by-1000 matrix One reducer gets a 6.25M- by-1000 matrix (50GB) 19 David Gleich · Purdue bit.ly/16LS8Vk
    20. 20. map emit reduce emit reduce emit R1 R2,1 R A1 Mapper 1-1 S1 Reducer 1-1 S(2) A2 Reducer 2-1 Serial TSQR Serial TSQR Serial TSQR shuffle identity map map emit reduce emit R2 R2,2 A2 Mapper 1-2 S(1) S A2 Reducer 1-2 shuffle Serial TSQR Serial TSQRA map emit reduce emit R3 R2,3 A3 Mapper 1-3 S3 A2 Reducer 1-3 Serial TSQR Serial TSQR map emit R4 A3 4 Mapper 1-4 Serial TSQR 20 Iteration 1 Iteration 2 David Gleich · Purdue bit.ly/16LS8Vk
    21. 21. Input 500,000,000-by-100 matrixEach record 1-by-100 rowHDFS Size 423.3 GBTime to compute  colsum( A ) 161 sec.Time to compute R in qr( A ) 387 sec. 21 David Gleich · Purdue bit.ly/16LS8Vk
    22. 22. Hadoop streaming isn’talways slow!Synthetic data test on 100,000,000-by-500 matrix (~500GB)Codes implemented in MapReduce streamingMatrix stored as TypedBytes lists of doublesPython frameworks use Numpy+ATLAS matrix.Custom C++ TypedBytes reader/writer with ATLAS matrix. Iter 1 Iter 2 Overall" Total (secs.) Total (secs.) Total (secs.) Dumbo 960 217 1177 Hadoopy 612 118 730 C++! 350! 37! 387! Java 436 66 502 22 David Gleich · Purdue bit.ly/16LS8Vk
    23. 23. Use multiple iterations forproblems with many columns Cols. Iters. Split" Maps Secs. (MB)Increasing split size 50 1 64 8000 388improves performance(accounts for Hadoop – – 256 2000 184data movement) – – 512 1000 149Increasing iterations – 2 64 8000 425helps for problems with – – 256 2000 220many columns. – – 512 1000 191(1000 columns with 64- 1000 1 512 1000 666MB split size overloaded – 2 64 6000 590the single reducer.) – – 256 2000 432 – – 512 1000 337 23 David Gleich · Purdue bit.ly/16LS8Vk
    24. 24. More about how to !compute a regression 2 min kAx bk XX 2 = min ( Aij xj bi ) i j A b1 A1 A1 Q2 b2 = Q2T b1 qr A2 A2 R2 Mapper 1 qr Serial TSQR A3 A3 b A4 24 David Gleich · Purdue bit.ly/16LS8Vk
    25. 25. TSQR code in hadoopy for regressionsimport random, numpy, hadoopy def close(self):class SerialTSQR: self.compress() def __init__(self,blocksize,isreducer): for i,row in enumerate(self.data): […] key = random.randint(0,2000000000) yield key, (row, self.rhs[i]) def compress(self): Q,R = numpy.linalg.qr( def mapper(self,key,value): numpy.array(self.data), ‘full’) self.collect(key,unpack(value)) # reset data and re-initialize to R self.data = [] def reducer(self,key,values): for row in R: for value in values: self.mapper(key, self.data.append([float(v) for v in row]) unpack(value)) self.rhs = list( numpy.dot(Q.T, numpy.array(self.rhs) ) if __name__==__main__: mapper = SerialTSQR(blocksize=3,isreducer=False) def collect(self,key,valuerhs): reducer = SerialTSQR(blocksize=3,isreducer=True) self.data.append(valuerhs[0]) hadoopy.run(mapper, reducer) self.rhs.append(valuerhs[1]) if len(self.data)>self.bsize*len(self.data[0]): self.compress() 25 David Gleich · Purdue bit.ly/16LS8Vk
    26. 26. More about how to !compute a regression min kAx bk2 = min kQRx bk2 Orthogonal or “right angle” matrices" don’t change vector magnitude T T 2 QT b = min kQ QRx Q bk A R = min kRx Q T bk2 QR" for " This is a tiny linear system! Regression def compute_x(output):! R,y = load_from_hdfs(output)! x = numpy.linalg.solve(R,y)! write_output(x,output+’-x’)! b 26 David Gleich · Purdue bit.ly/16LS8Vk
    27. 27. We do a similar step for thePCA and compute the 1000-by-1000 SVD on one machine 27 David Gleich · Purdue bit.ly/16LS8Vk
    28. 28. Getting the matrix Q is tricky! 28 David Gleich · Purdue bit.ly/16LS8Vk
    29. 29. What about the matrix Q?We want Q to be Constantine & Gleich, MapReduce 2011numericallyorthogonal. Prior work norm ( QTQ – I ) AR-1A condition numbermeasures problem Benson, Gleich,sensitivity. Demmel, Submitted AR + " -1 nt Direct TSQR refineme iterative Benson, Gleich, "Prior methods all Demmel, Submittedfailed without any 105 1020warning. Condition number 29 David Gleich · Purdue bit.ly/16LS8Vk
    30. 30. Taking care of business bykeeping track of Q 3. Distribute the pieces of Q*1 and form the true Q Mapper 1 Mapper 3 Task 2 R1 Q11 A1 Q1 R1 Q11 R Q1 Q1 R2 Q21 Q output R output R2 R3 Q31 Q21 A2 Q2 Q2 Q2 R4 Q41 R3 Q31 2. Collect R on one A3 Q3 Q3 Q3 node, compute Qs for each piece R4 Q41 A4 Q4 Q4 Q4 1. Output local Q and R in separate files 30 David Gleich · Purdue bit.ly/16LS8Vk
    31. 31. Code available fromgithub.com/arbenson/mrtsqr…it isn’t too bad. 31 David Gleich · Purdue bit.ly/16LS8Vk
    32. 32. Future work … more columns!With ~3000 columns, one 64MB chunk is a localQR computation. Could “iterate in blocks of 3000” columns tocontinue … maybe “efficient” for 10,000 columnsNeed different ideas for 100,000 columns(randomized methods?) 32 David Gleich · Purdue bit.ly/16LS8Vk
    33. 33. Questions?www.cs.purdue.edu/~dgleich@dgleichdgleich@purdue.edu 33 David Gleich · Purdue bit.ly/16LS8Vk

    ×