Successfully reported this slideshow.
Upcoming SlideShare
×

# What you can do with a tall-and-skinny QR factorization in Hadoop: Principal components and large regressions

2,798 views

Published on

Some techniques that work with the tall-and-skinny QR factorization of a matrix.

Published in: Technology, Education
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

### What you can do with a tall-and-skinny QR factorization in Hadoop: Principal components and large regressions

1. 1. What you can do with a Tall-and-Skinny !QR Factorization on Hadoop: !Large regressions, Principal ComponentsSlides bit.ly/16LS8Vk @dgleichCode github.com/dgleich/mrtsqr dgleich@purdue.eduDAVID F. GLEICHASSISTANT PROFESSOR !COMPUTER SCIENCE !PURDUE UNIVERSITY 1 David Gleich · Purdue bit.ly/16LS8Vk
2. 2. Why you should stay …you like advanced machine learning techniquesyou want to understand how to compute thesingular values and vectors of a huge matrix(that’s tall and skinny)you want to learn about large-scale regression,and principal components from a matrixperspective 2 David Gleich · Purdue bit.ly/16LS8Vk
3. 3. What I’m going to assumeyou know MapReducePythonSome simple matrix manipulation 3 David Gleich · Purdue bit.ly/16LS8Vk
4. 4. Tall-and-Skinny matrices (m ≫ n) Many rows (like a billion)A A few columns (under 10,000) regression and! general linear models! with many samples! From tinyimages" collection Used in block iterative methods panel factorizations approximate kernel k-means big-data SVD/PCA! 4 David Gleich · Purdue bit.ly/16LS8Vk
5. 5. If you have tons of smallrecords, then there is probablya tall-and-skinny matrixsomwhere 5 David Gleich · Purdue bit.ly/16LS8Vk
6. 6. Tall-and-skinny matrices arecommon in BigDataA : m x n, m ≫ n A1Key is an arbitrary row-id A2Value is the 1 x n array "for a row A3Each submatrix Ai is an " A4 the input to a map task. 6 David Gleich · Purdue bit.ly/16LS8Vk
7. 7. PCA of 80,000,000! images1000 pixels 1 0.8 0 Fraction of variance Fraction of variance 80,000,000 images 0.6 0 A 0.4 0 0.2 0 First 16 columns of V as 0 20 40 60 80 100 images Principal Components Figure 5: The 16 most impo nent basis functions (by row 7 Constantine & Gleich, MapReduce 2011. David Gleich · Purdue bit.ly/16LS8Vk
8. 8. via the sum of red-pixel values in each image as a linear combi- nation of the gray values in each image. Formally, if ri is thetime and Regression with 80,000,000 sum of the red components in all pixels of image i, and Gi,j is the gray value of the jth pixel in image i, then we wantedper- ates images q q to ﬁnd min i (ri ≠ j Gi,j sj )2 . There is no particular im- (for portance to this regression problem, we use it merely as a demonstration. 1000 pixels on),split The coe cients sj are dis- ﬁle played as an image to approx. The goal was at the right.d by They reveal regionsthere was how much red of the im- test age in a picture fromimportant that are not as the 80,000,000 images the in determining the overall red value of the grayscale r in component of an image. The pixels only. A color scale varies from light-ﬁnal size blue (strongly measure of blue We get a negative) topers (0) howred (strongly positive). and much “redness” The computation took 30 min- each pixel contributes to1000 utes using the Dumbo frame- the whole. work and a two-iteration job with 250 intermediate reducers. h is the We also solved a principal component problem to ﬁnd ahav- principal component basis for each image. Let G be matrixﬁnal of Gi,j ’s from the regression andDavidui be the meanbit.ly/16LS8Vk let Gleich · Purdue of the ith 8
9. 9. Let’s talk about QR! 9 David Gleich · Purdue bit.ly/16LS8Vk
10. 10. QR Factorization and theGram Schmidt process Consider a set of vectors v1 to vn. Set u1 to be v1. Create a new vector u2 by removing any “component” of u1 from v2. Create a new vector u3 by removing any “component” of u1 and u2 from v3. … “Gram-Schmidt process” " 10 from Wikipedia David Gleich · Purdue bit.ly/16LS8Vk
11. 11. QR Factorization and theGram Schmidt process v1 = a1 u1 v2 = b1 u1 + b2 u2 v3 = c1 u1 + c2 u2 + c3 u3 ⇥ ⇤ v1 v2 v3 ... 2 3 a1 b1 c1 ... ⇥ ⇤6 0 6 b2 c2 ... 77 = u1 u2 v3 ... 6 0 0 c3 ... 7 4 5 . . . . . . .. . . . . 11 David Gleich · Purdue bit.ly/16LS8Vk
12. 12. QR Factorization and theGram Schmidt process v1 = a1 u1 v2 = b1 u1 + b2 u2 v3 = c1 u1 + c2 u2 + c3 u3 For this problem V = UR All vectors in U are at right angles, i.e. theyWhat it’s usually"written as by others A = QR are decoupled 12 David Gleich · Purdue bit.ly/16LS8Vk
13. 13. QR Factorization and theGram Schmidt process R v1 = a1 u1 v2 = b1 u1 + b2 u2 v3 = c1 u1 + c2 u2 + c3 u3 A = Q All vectors in U are at right angles, i.e. they are decoupled 13 David Gleich · Purdue bit.ly/16LS8Vk
14. 14. PCA of 80,000,000! images First 16 columns of V as images1000 pixels R    V SVD (principal TSQR components) 80,000,000 images Top 100 A X singular values Zero" mean" rows MapReduce Post Processing 14 Constantine & Gleich, MapReduce 2010. David Gleich · Purdue bit.ly/16LS8Vk
15. 15. Input 500,000,000-by-100 matrixEach record 1-by-100 rowHDFS Size 423.3 GBTime to compute  colsum( A ) 161 sec.Time to compute R in qr( A ) 387 sec. 15 David Gleich · Purdue bit.ly/16LS8Vk
16. 16. The rest of the talk! Full TSQR code in hadoopyimport random, numpy, hadoopy def close(self):class SerialTSQR: self.compress() def __init__(self,blocksize,isreducer): for row in self.data: self.bsize=blocksize key = random.randint(0,2000000000) self.data = [] yield key, row if isreducer: self.__call__ = self.reducer else: self.__call__ = self.mapper def mapper(self,key,value): self.collect(key,value) def compress(self): R = numpy.linalg.qr( def reducer(self,key,values): numpy.array(self.data),r) for value in values: self.mapper(key,value) # reset data and re-initialize to R self.data = [] if __name__==__main__: for row in R: mapper = SerialTSQR(blocksize=3,isreducer=False) self.data.append([float(v) for v in row]) reducer = SerialTSQR(blocksize=3,isreducer=True) hadoopy.run(mapper, reducer) def collect(self,key,value): self.data.append(value) if len(self.data)>self.bsize*len(self.data[0]): self.compress() 16 David Gleich · Purdue bit.ly/16LS8Vk
17. 17. Communication avoiding QR (Demmel et al. 2008) ! on MapReduce (Constantine and Gleich, 2010) Algorithm Data Rows of a matrix A1 A1 Map QR factorization of rows A2 qr Reduce QR factorization of rows A2 Q2 R2Mapper 1 qrSerial TSQR A3 A3 Q3 R3 A4 qr emit A4 Q4 R4 A5 A5 qr A6 A6 Q6 R6Mapper 2 qrSerial TSQR A7 A7 Q7 R7 A8 qr emit A8 Q8 R8 R4 R4Reducer 1Serial TSQR qr emit R8 R8 Q R 17 David Gleich · Purdue bit.ly/16LS8Vk
18. 18. The rest of the talk! Full TSQR code in hadoopyimport random, numpy, hadoopy def close(self):class SerialTSQR: self.compress() def __init__(self,blocksize,isreducer): for row in self.data: self.bsize=blocksize key = random.randint(0,2000000000) self.data = [] yield key, row if isreducer: self.__call__ = self.reducer else: self.__call__ = self.mapper def mapper(self,key,value): self.collect(key,value) def compress(self): R = numpy.linalg.qr( def reducer(self,key,values): numpy.array(self.data),r) for value in values: self.mapper(key,value) # reset data and re-initialize to R self.data = [] if __name__==__main__: for row in R: mapper = SerialTSQR(blocksize=3,isreducer=False) self.data.append([float(v) for v in row]) reducer = SerialTSQR(blocksize=3,isreducer=True) hadoopy.run(mapper, reducer) def collect(self,key,value): self.data.append(value) if len(self.data)>self.bsize*len(self.data[0]): self.compress() 18 David Gleich · Purdue bit.ly/16LS8Vk
19. 19. Too many maps cause toomuch data to one reducer! Each image is 5k. Each HDFS block has " 12,800 images. 6,250 total blocks. Each map outputs " 1000-by-1000 matrix One reducer gets a 6.25M- by-1000 matrix (50GB) 19 David Gleich · Purdue bit.ly/16LS8Vk
20. 20. map emit reduce emit reduce emit R1 R2,1 R A1 Mapper 1-1 S1 Reducer 1-1 S(2) A2 Reducer 2-1 Serial TSQR Serial TSQR Serial TSQR shuffle identity map map emit reduce emit R2 R2,2 A2 Mapper 1-2 S(1) S A2 Reducer 1-2 shuffle Serial TSQR Serial TSQRA map emit reduce emit R3 R2,3 A3 Mapper 1-3 S3 A2 Reducer 1-3 Serial TSQR Serial TSQR map emit R4 A3 4 Mapper 1-4 Serial TSQR 20 Iteration 1 Iteration 2 David Gleich · Purdue bit.ly/16LS8Vk
21. 21. Input 500,000,000-by-100 matrixEach record 1-by-100 rowHDFS Size 423.3 GBTime to compute  colsum( A ) 161 sec.Time to compute R in qr( A ) 387 sec. 21 David Gleich · Purdue bit.ly/16LS8Vk
22. 22. Hadoop streaming isn’talways slow!Synthetic data test on 100,000,000-by-500 matrix (~500GB)Codes implemented in MapReduce streamingMatrix stored as TypedBytes lists of doublesPython frameworks use Numpy+ATLAS matrix.Custom C++ TypedBytes reader/writer with ATLAS matrix. Iter 1 Iter 2 Overall" Total (secs.) Total (secs.) Total (secs.) Dumbo 960 217 1177 Hadoopy 612 118 730 C++! 350! 37! 387! Java 436 66 502 22 David Gleich · Purdue bit.ly/16LS8Vk
23. 23. Use multiple iterations forproblems with many columns Cols. Iters. Split" Maps Secs. (MB)Increasing split size 50 1 64 8000 388improves performance(accounts for Hadoop – – 256 2000 184data movement) – – 512 1000 149Increasing iterations – 2 64 8000 425helps for problems with – – 256 2000 220many columns. – – 512 1000 191(1000 columns with 64- 1000 1 512 1000 666MB split size overloaded – 2 64 6000 590the single reducer.) – – 256 2000 432 – – 512 1000 337 23 David Gleich · Purdue bit.ly/16LS8Vk
24. 24. More about how to !compute a regression 2 min kAx bk XX 2 = min ( Aij xj bi ) i j A b1 A1 A1 Q2 b2 = Q2T b1 qr A2 A2 R2 Mapper 1 qr Serial TSQR A3 A3 b A4 24 David Gleich · Purdue bit.ly/16LS8Vk
25. 25. TSQR code in hadoopy for regressionsimport random, numpy, hadoopy def close(self):class SerialTSQR: self.compress() def __init__(self,blocksize,isreducer): for i,row in enumerate(self.data): […] key = random.randint(0,2000000000) yield key, (row, self.rhs[i]) def compress(self): Q,R = numpy.linalg.qr( def mapper(self,key,value): numpy.array(self.data), ‘full’) self.collect(key,unpack(value)) # reset data and re-initialize to R self.data = [] def reducer(self,key,values): for row in R: for value in values: self.mapper(key, self.data.append([float(v) for v in row]) unpack(value)) self.rhs = list( numpy.dot(Q.T, numpy.array(self.rhs) ) if __name__==__main__: mapper = SerialTSQR(blocksize=3,isreducer=False) def collect(self,key,valuerhs): reducer = SerialTSQR(blocksize=3,isreducer=True) self.data.append(valuerhs[0]) hadoopy.run(mapper, reducer) self.rhs.append(valuerhs[1]) if len(self.data)>self.bsize*len(self.data[0]): self.compress() 25 David Gleich · Purdue bit.ly/16LS8Vk
26. 26. More about how to !compute a regression min kAx bk2 = min kQRx bk2 Orthogonal or “right angle” matrices" don’t change vector magnitude T T 2 QT b = min kQ QRx Q bk A R = min kRx Q T bk2 QR" for " This is a tiny linear system! Regression def compute_x(output):! R,y = load_from_hdfs(output)! x = numpy.linalg.solve(R,y)! write_output(x,output+’-x’)! b 26 David Gleich · Purdue bit.ly/16LS8Vk
27. 27. We do a similar step for thePCA and compute the 1000-by-1000 SVD on one machine 27 David Gleich · Purdue bit.ly/16LS8Vk
28. 28. Getting the matrix Q is tricky! 28 David Gleich · Purdue bit.ly/16LS8Vk
29. 29. What about the matrix Q?We want Q to be Constantine & Gleich, MapReduce 2011numericallyorthogonal. Prior work norm ( QTQ – I ) AR-1A condition numbermeasures problem Benson, Gleich,sensitivity. Demmel, Submitted AR + " -1 nt Direct TSQR reﬁneme iterative Benson, Gleich, "Prior methods all Demmel, Submittedfailed without any 105 1020warning. Condition number 29 David Gleich · Purdue bit.ly/16LS8Vk
30. 30. Taking care of business bykeeping track of Q 3. Distribute the pieces of Q*1 and form the true Q Mapper 1 Mapper 3 Task 2 R1 Q11 A1 Q1 R1 Q11 R Q1 Q1 R2 Q21 Q output R output R2 R3 Q31 Q21 A2 Q2 Q2 Q2 R4 Q41 R3 Q31 2. Collect R on one A3 Q3 Q3 Q3 node, compute Qs for each piece R4 Q41 A4 Q4 Q4 Q4 1. Output local Q and R in separate ﬁles 30 David Gleich · Purdue bit.ly/16LS8Vk
31. 31. Code available fromgithub.com/arbenson/mrtsqr…it isn’t too bad. 31 David Gleich · Purdue bit.ly/16LS8Vk
32. 32. Future work … more columns!With ~3000 columns, one 64MB chunk is a localQR computation. Could “iterate in blocks of 3000” columns tocontinue … maybe “efﬁcient” for 10,000 columnsNeed different ideas for 100,000 columns(randomized methods?) 32 David Gleich · Purdue bit.ly/16LS8Vk
33. 33. Questions?www.cs.purdue.edu/~dgleich@dgleichdgleich@purdue.edu 33 David Gleich · Purdue bit.ly/16LS8Vk