Matrix methods for Hadoop

14,006 views

Published on

A quick tutorial on how to tackle problems from a matrix-vector perspective in Hadoop

Published in: Technology
1 Comment
30 Likes
Statistics
Notes
No Downloads
Views
Total views
14,006
On SlideShare
0
From Embeds
0
Number of Embeds
3,245
Actions
Shares
0
Downloads
271
Comments
1
Likes
30
Embeds 0
No embeds

No notes for slide

Matrix methods for Hadoop

  1. 1. Matrix Methodswith HadoopSlides bit.ly/10SIe1ACode github.com/dgleich/matrix-hadoop-tutorialDAVID F. GLEICHASSISTANT PROFESSOR "COMPUTER SCIENCE "PURDUE UNIVERSITY 1 David Gleich · Purdue bit.ly/10SIe1A
  2. 2. 2David Gleich · Purdue bit.ly/10SIe1A
  3. 3. A bit of philosophy … Image from rockysprings, deviantart, CC share-alike 3
  4. 4. 4David Gleich · Purdue bit.ly/10SIe1A
  5. 5. Matrix computations 2 3 A1,1 A1,2 ··· A1,n 6 . 7 . 7 6 A2,1 A2,2 ··· . 7 A=6 . 6 7 4 . .. .. . . . Am 1,n 5 Am,1 ··· Am,n 1 Am,n Ax Ax = b min kAx bk Ax = xOperations Linear " Least squares Eigenvalues systems 5 David Gleich · Purdue bit.ly/10SIe1A
  6. 6. OutcomesRecognize relationships between matrix methods andthings you’ve already been doing" Example SQL queries as matrix computationsUnderstand how to use Hadoop to compute thesematrix methods at scale for BigData" Example Recommenders with social network infoUnderstand some of the issues that could arise. 6 David Gleich · Purdue bit.ly/10SIe1A
  7. 7. Ideal outcomes How to use techniques from " matrix computations in order " to solve your problems quickly! 1986 7 David Gleich · Purdue bit.ly/10SIe1A
  8. 8. Taking the red pill … Image from rockysprings, deviantart, CC share-alike 8
  9. 9. Matrix computations Physics Databases Statistics Machine learning Engineering Information retrieval Graphics Computer vision Bioinformatics Social networks bit.ly/10SIe1A 9David Gleich · Purdue
  10. 10. matrix computations " ≠" linear algebra 10 David Gleich · Purdue bit.ly/10SIe1A
  11. 11. A SQL statement as a "matrix computation http://stackoverflow.com/questions/4217449/returning-average-rating-from-a-database-sqlHow do I find theaverage rating foreach product? 11 David Gleich · Purdue bit.ly/10SIe1A
  12. 12. A SQL statement as a "matrix computation SELECT! p.product_id,! p.name,! AVG(pr.rating) AS rating_average! http://stackoverflow.com/questions/4217449/returning-average-rating-from-a-database-sql FROM products p!How do I find the INNER JOIN product_ratings pr!average rating for ON pr.product_id = p.product_id! GROUP BY p.product_id!each product? ORDER BY rating_average DESC! 12 David Gleich · Purdue bit.ly/10SIe1A
  13. 13. This SQL statement is a "matrix computation! 13 Image from rockysprings, deviantart, CC share-alike
  14. 14. SELECT! ...! AVG(pr.rating)!...!GROUP BY p.product_id!product_ratingspid8 uid2 4 pid1pid9 uid9 1 pid2pid2 uid9 5 pid3pid9 uid5 5 pid4pid6 uid8 4 pid5pid1 uid2 4 pid6pid3 uid4 4 Is a matrix! pid7pid5 uid9 2 pid8pid9 uid8 4 pid9pid9 uid9 1 14 David Gleich · Purdue bit.ly/10SIe1A
  15. 15. But it’s a weird matrix"product_ratingspid8 uid2 4 pid1pid9 uid9 1 pid2pid2 uid9 5 pid3pid9 uid5 5 pid4pid6 uid8 4 pid5pid1 uid2 4 pid6pid3 uid4 4 Is a matrix! pid7pid5 uid9 2 pid8pid9 uid8 4 pid9pid9 uid9 1 Missing entries! 15 David Gleich · Purdue bit.ly/10SIe1A
  16. 16. But it’s a weird matrix" Average" of ratingsproduct_ratingspid8 uid2 4 pid1 4pid9 uid9 1 pid2pid2 uid9 5 pid3 4pid9 uid5 5 pid4pid6 uid8 4 pid5pid1 uid2 4 pid6 4 SELECTpid3 uid4 4 Is a matrix! pid7 AVG(r)pid5 uid9 2 pid8 4 ...pid9 uid8 4 pid9 5 4 GROUP BY pid Matrix Vector 16 David Gleich · Purdue bit.ly/10SIe1A
  17. 17. But it’s a weird matrix"and not a linear operator 2 3 Iproduct_ratings s a matrix A1,1 A1,2 ··· A1,n !6 . 7pid8 uid2 4 6 A2,1 A2,2 ··· . 7 . 7pid9 uid9 1 A=6 . 6 7 4 . .. ..pid2 uid9 5 . . . Am 1,n 5pid9 uid5 5pid6 uid8 4 Am,1 ··· Am,n 1 Am,npid1 uid2 4 2 P P 3 j A1,j / Pj “A1,j 6= 0”pid3 uid4 4pid5 uid9 2 P j A2,j / j “A2,j 6= 0”pid9 uid8 4 6 7 6 7pid9 uid9 1 avg(A) = 6 . 7 4 . . 5 P P j Am,j / j “Am,j 6= 0” 17 David Gleich · Purdue bit.ly/10SIe1A
  18. 18. matrix computations " ≠" linear algebra 18 David Gleich · Purdue bit.ly/10SIe1A
  19. 19. … but there is a linearoperator hiding … 2 P P 3 A1,1 / j “A1,j 6= 0” A1,2 / j “A1,j 6= 0” ··· P P 6A2,1 / · · ·7P=4 j “A2,j 6= 0” A2,2 / j “A2,j 6= 0” 5 . . .. . . avg(A) = Pe e is the vector of all ones 19 David Gleich · Purdue bit.ly/10SIe1A
  20. 20. Hadoop, MapReduce,and Matrix Methods 20 David Gleich · Purdue bit.ly/10SIe1A
  21. 21. MapReduce 21 David Gleich · Purdue bit.ly/10SIe1A
  22. 22. The MapReduce FrameworkOriginated at Google for indexing web Data scalablepages and computing PageRank. Maps M M 1 2 1 M 2 M Reduce M M R 3 4 MExpress algorithms in " 3 R 4 M M“data-local operations”. 5 M Shuffle 5Implement one type of Fault-tolerance by designcommunication: shuffle. Input stored in triplicate Reduce input/" MShuffle moves all data with M output on disk Rthe same key to the same M R Mreducer. Map output" persisted to disk" 22 before shuffle David Gleich · Purdue bit.ly/10SIe1A
  23. 23. wordcount "is a matrix computation too map(document) : for word in document D D emit (word, 1) 1 2 matrix,1 bigdata,1 hadoop,1 D D matrix,1 bigdata,1 hadoop,1 3 4 matrix,1 bigdata,1 hadoop,1 matrix,1 bigdata,1 hadoop,1 bigdata,1 hadoop,1 D bigdata,1 hadoop,1 5 bigdata,1 hadoop,1 bigdata,1 reduce(word, counts) : emit (word, sum(counts)) 23 David Gleich · Purdue bit.ly/10SIe1A
  24. 24. wordcount "is a matrix computation too 2 3doc1 A1,1 A1,2 ··· A1,n 6 . 7 . 7 6doc2 A2,1 A2,2 ··· . 7A=6 . 6 7 = A 4 . .. .. . . . Am 1,n 5docm Am,1 ··· Am,n 1 Am,nword count = colsum(A) = AT e e is the vector of all ones 24 David Gleich · Purdue bit.ly/10SIe1A
  25. 25. inverted index"is a matrix computation too 2 3doc1 A1,1 A1,2 ··· A1,n 6 . 7 . 7 6doc2 A2,1 A2,2 ··· . 7A=6 . 6 7 = A 4 . .. .. . . . Am 1,n 5docm Am,1 ··· Am,n 1 Am,n 25 David Gleich · Purdue bit.ly/10SIe1A
  26. 26. inverted index"is a matrix computation too 2 3term1 A1,1 A2,1 ··· Am,1 6 . 7 . 7 6A1,2 A2,2 ··· . 7 6 term2 6 . 7 = AT 4 . .. .. . . . Am,n 1 5termm A1,n ··· Am 1,n Am,n 26 David Gleich · Purdue bit.ly/10SIe1A
  27. 27. A recommender system "with social infoproduct_ratings friends_linkspid8 uid2 4 uid6 uid1pid9 uid9 1 uid8 uid9pid2 uid9 5 uid7 uid7pid9 uid5 5 uid7 uid4pid6 uid8 4 uid6 uid2pid1 uid2 4 uid7 uid1pid3 uid4 4 uid3 uid1pid5 uid9 2 uid1 uid8pid9 uid8 4 uid7 uid3pid9 uid9 1 uid9 uid1 27 David Gleich · Purdue bit.ly/10SIe1A
  28. 28. A recommender system "with social infoproduct_ratings friends_linkspid8 uid2 4 uid6 uid1pid9 uid9 1 uid8 uid9pid2 uid9 5 uid7 uid7 2pid9 uid5 5 3 uid7 uid4 2 3 pid1 Apid6 uid8 4 1,1pid1 uid2 4 A2,1 ··· uid6 uid2 uid1 uid7 uid1 A1,1 A2,1 ··· 6Apid3 uid4 4 pid2 1,2 A2,2 · · ·7 uid3 uid1 6A1,2 A2,2 · · ·7 4pid5 uid9 2 5 uid2 uid1 uid8 4 5 .pid9 uid8 4 . .. .. uid7 uid3 . . .. .. .pid9 uid9 1 . . uid9 uid1 . . . 28 David Gleich · Purdue bit.ly/10SIe1A
  29. 29. A recommender system "with social infoproduct_ratings friends_linkspid8 uid2 4 uid6 uid1pid9 uid9 1 uid8 uid9pid2 uid9 5 uid7 uid7pid9 uid5 5 uid7 uid4pid6 uid8 4 uid6 uid2 R Spid1 uid2 4 uid7 uid1pid3 uid4 4 uid3 uid1pid5 uid9 2 uid1 uid8pid9 uid8 4 uid7 uid3pid9 uid9 1 uid9 uid1 29 David Gleich · Purdue bit.ly/10SIe1A
  30. 30. A recommender system "with social infoRecommend each item based with something that is"on the average rating of all “X = S RT” almost a matrix-matrix"trusted users product 2 3 2 3 A1,1 A2,1 ··· A1,1 A2,1 ··· R S pid1 uid1 6 pid2 A1,2 A2,2 · · ·7 6 uid2 A1,2 A2,2 · · ·7 4 5 4 5 . . .. .. . . .. .. . . . . . . ! ! 1 X X Xuid,pid = Suid,uid2 Ruid2,pid · “Suid,uid2 and Ruid2,pid 6= 0” uid2 uid2 30 David Gleich · Purdue bit.ly/10SIe1A
  31. 31. Tools I like hadoop streaming dumbo mrjob hadoopy C++ 31 David Gleich · Purdue bit.ly/10SIe1A
  32. 32. Tools I don’t use but otherpeople seem to like … pig java hbase Mahout is the closest thing to a library for matrix computations in Hadoop. If you like Java, you should probably mahout start there. Eclipse I’m a low-level guy Cassandra 32 David Gleich · Purdue bit.ly/10SIe1A
  33. 33. hadoop streaming the map function is a program" (key,value) pairs are sent via stdin" output (key,value) pairs goes to stdout the reduce function is a program" (key,value) pairs are sent via stdin" keys are grouped" output (key,value) pairs goes to stdout 33 David Gleich · Purdue bit.ly/10SIe1A
  34. 34. mrjob from a wrapper around hadoop streaming for map and reduce functions in python class MRWordFreqCount(MRJob): def mapper(self, _, line): for word in line.split(): yield (word.lower(), 1) def reducer(self, word, counts): yield (word, sum(counts)) if __name__ == __main__: MRWordFreqCount.run() 34 David Gleich · Purdue bit.ly/10SIe1A
  35. 35. How can Hadoop streaming Synthetic data test 100,000,000-by-500 matrix (~500GB) Codes implemented in MapReduce streamingpossibly be fast? Matrix stored as TypedBytes lists of doubles Python frameworks use Numpy+Atlas Custom C++ TypedBytes reader/writer with Atlas500 GB matrix. Computing the R in a QR factorization. "See my non-streaming Java implementation too New next talk! Iter 1 Iter 1 Iter 2 Overall QR (secs.) Total (secs.) Total (secs.) Total (secs.)Dumbo 67725 960 217 1177Hadoopy 70909 612 118 730C++ 15809 350 37 387Java 436 66 502 C++ in streaming beats a native Java implementation. All timing results from the Hadoop job trackermrjob could be faster if it usedDavid Gleich (Sandia) MapReduce 2011 16/22 Example available from typedbytes for intermediate storage see github.com/dgleich/mrtsqr"https://github.com/Yelp/mrjob/pull/447 for verification 35 David Gleich · Purdue bit.ly/10SIe1A
  36. 36. Ax = y XMatrix-vector product yi = Aik xkFollow along! kmatrix-hadoop/codes/smatvec.py! x A 36 David Gleich · Purdue bit.ly/10SIe1A
  37. 37. Where do matrix-vectorproducts arise?Google’s PageRank Computing cosine-similarity between onedocument and all other documentsPredictions from kernel methodsComputing averages (the example above) 37 David Gleich · Purdue bit.ly/10SIe1A
  38. 38. Ax = y XMatrix-vector product yi = Aik xkFollow along! kmatrix-hadoop/codes/smatvec.py! A is stored by row x $ head samples/smat_5_5.txt ! 0 0 0.125 3 1.024 4 0.121! A 1 0 0.597! 2 2 1.247! 3 4 -1.45! x is stored entry-wise ! 4 2 0.061! $ head samples/vec_5.txt! 0 0.241! 1 -0.98! 2 0.237! 3 -0.32! 4 0.080! 38 David Gleich · Purdue bit.ly/10SIe1A
  39. 39. Matrix-vector product Ax = y X(in pictures) yi = Aik xk k x x x A A A y Input Map 1! Reduce 1! Reduce 2! Align on columns Output Aik xk Output keyed on row i sum(Aik xk) 39 David Gleich · Purdue bit.ly/10SIe1A
  40. 40. Matrix-vector product Ax = y X(in pictures) yi = Aik xk k x def joinmap(self, key, line):! vals = line.split()! x if len(vals) == 2:! # the vector! yield (vals[0], # row! A A (float(vals[1]),)) # xi! else:! # the matrix! row = vals[0]! for i in xrange(1,len(vals),2):! yield (vals[i], # column! (row, # i,Aij! float(vals[i+1])))! Input Map 1! Align on columns 40 David Gleich · Purdue bit.ly/10SIe1A
  41. 41. Matrix-vector product Ax = y X(in pictures) yi = Aik xk k x x def joinred(self, key, vals):! vecval = 0. ! x matvals = []! for val in vals:! if len(val) == 1:! A A A vecval += val[0]! else:! matvals.append(val) ! for val in matvals:! yield (val[0], val[1]*vecval)!Note that you should use a Inputsecondary sort to avoid Map 1! Reduce 1!reading both in memory Align on columns Output Aik xk keyed on row i 41 David Gleich · Purdue bit.ly/10SIe1A
  42. 42. Matrix-vector product Ax = y X(in pictures) yi = Aik xk k x x x A def sumred(self, key, vals):! A A y yield (key, sum(vals))! Input Map 1! Reduce 1! Reduce 2! Align on columns Output Aik xk Output keyed on row i sum(Aik xk) 42 David Gleich · Purdue bit.ly/10SIe1A
  43. 43. AB = CMatrix-matrix product Cij = X Aik BkjFollow along! kmatrix-hadoop/codes/matmat.py! B A 43 David Gleich · Purdue bit.ly/10SIe1A
  44. 44. AB = CMatrix-matrix product Cij = X Aik BkjFollow along! kmatrix-hadoop/codes/matmat.py! A is stored by row B $ head samples/smat_10_5_A.txt ! 0 0 0.599 4 -1.53! 1! A 2 3! 4 2 0.260! 0 0.267 1 0.839 B is stored by row $ head samples/smat_5_5.txt ! 0 0 0.125 3 1.024 4 0.121! 1 0 0.597! 2 2 1.247! 44 David Gleich · Purdue bit.ly/10SIe1A
  45. 45. Matrix-matrix product AB = C(in pictures) Cij = X Aik Bkj B B k B A A A C Map 1! Reduce 1! Reduce 2! Align on columns Output Aik Bkj Output keyed on (i,j) sum(Aik Bkj) 45 David Gleich · Purdue bit.ly/10SIe1A
  46. 46. Matrix-matrix product AB = C(in code) Cij = X Aik Bkj B k def joinmap(self, key, line):! B mtype = self.parsemat()! vals = line.split()! row = vals[0]! rowvals = ! A A [(vals[i],float(vals[i+1])) ! for i in xrange(1,len(vals),2)]! if mtype==1:! # matrix A, output by col! for val in rowvals:! yield (val[0], (row, val[1]))! else:! yield (row, (rowvals,))! Map 1! Align on columns 46 David Gleich · Purdue bit.ly/10SIe1A
  47. 47. Matrix-matrix product AB = C(in pictures) Cij = X Aik Bkj B B kdef joinred(self, key, line):! B # load the data into memory ! brow = []! acol = []! for val in vals:! A if len(val) == 1:! A A brow.extend(val[0])! else:! acol.append(val)! ! for (bcol,bval) in brow:! for (arow,aval) in acol:! yield ((arow,bcol),aval*bval)! Map 1! Reduce 1! Align on columns Output Aik Bkj keyed on (i,j) 47 David Gleich · Purdue bit.ly/10SIe1A
  48. 48. Matrix-matrix product AB = C(in pictures) Cij = X Aik Bkj B B k B A A A C def sumred(self, key, vals):! yield (key, sum(vals))! Map 1! Reduce 1! Reduce 2! Align on columns Output Aik Bkj Output keyed on (i,j) sum(Aik Bkj) 48 David Gleich · Purdue bit.ly/10SIe1A
  49. 49. Our social recommenderFollow along! matrix-hadoop/recsys/recsys.py! R is stored entry-wise S RT ! $ gunzip –c data/rating.txt.gz! 139431556 591156 5! 139431556 1312460676 5! 139431556 204358 4 Object ID! 368725 139431556 User ID! Rating! 5! S is stored entry-wise ! $ gunzip –c data/rating.txt.gz! 3287060356 232085 -1! 3288305540 709420 1! 3290337156 204418 -1! My ID! Other ID! Trust! 3294138244 269243 -1! 49 David Gleich · Purdue bit.ly/10SIe1A
  50. 50. Social recommender (in code) BConceptually, the first stepis the same as the matrix-matrix product. def joinmap(self, key, line):! BWe reorganize the data by parts = line.split(t)!user-id to be able to map if len(parts) == 8: # ratings!the trust relationships objid = parts[0].strip()! uid = parts[1].strip()! A A rat = int(parts[2])! yield (uid, (objid, rat))! else len(parts) == 4: # trust! myid = parts[0].strip()! otherid = parts[1].strip()! value = int(parts[2])! if value 0:! yield (otherid, (myid,))! Map 1! Align on columns 50 David Gleich · Purdue bit.ly/10SIe1A
  51. 51. Matrix-matrix product (in pictures) B B Conceptually,def joinred(self, key, vals):! the second step B tusers = [] # uids that trust key! is the same as ratobjs = [] # objs rated by uid=key! the matrix- for val in vals:! matrix product if len(val) == 1:! too, we “map” A tusers.append(val[0])! A A else:! the ratings from ratobjs.append(val)! each trusted! user back to the for (objid, rat) in ratobjs:! source. for uid in tusers:! yield ((uid, objid), rat)! Map 1! Reduce 1! Align on columns Output Aik Bkj keyed on (i,j) 51 David Gleich · Purdue bit.ly/10SIe1A
  52. 52. Matrix-matrix product AB = C(in pictures) Cij = X Aik Bkj B B k B def avgred(self, key, vals):! A s = 0.! A A C n = 0! for val in vals:! s += val! n += 1! # the smoothed average of ratings! yield key, ! (s+self.options.avg)/float(n+1) ! ! Map 1! Reduce 1! Reduce 2! Align on columns Output Aik Bkj Output keyed on (i,j) sum(Aik Bkj) 52 David Gleich · Purdue bit.ly/10SIe1A
  53. 53. Better ways to store matrices in Hadoop Block matrices minimize the number of intermediate keys and values used. I’d form themNo need for “integer” keys that based on the first reduce fall between 1 and n! B B A A 53 David Gleich · Purdue bit.ly/10SIe1A
  54. 54. Tall-and-skinny matrices arecommon in BigDataA : m x n, m ≫ n A1Key is an arbitrary row-id A2Value is the 1 x n array for a row A3Each submatrix Ai is an A4 the input to a map task. 54 David Gleich · Purdue bit.ly/10SIe1A
  55. 55. Double-precision floating pointwas designed for the erawhere “big” was 1000-10000 55 David Gleich · Purdue bit.ly/10SIe1A
  56. 56. Error analysis of summations = 0; for i=1 to n: s = s + x[i]fl(x + y ) = (x + y )(1 + ) X X X 16fl( xi ) xi  nµ |xi | µ ⇡ 10 i i iA simple summation formula has error that is not always small if n is a billion 56 David Gleich · Purdue bit.ly/10SIe1A
  57. 57. If your application mattersthen watch out for this issue.Use quad-precision arithmeticor compensated summationinstead. 57 David Gleich · Purdue bit.ly/10SIe1A
  58. 58. Compensated Summation“Kahan summation algorithm” on Wikipedias = 0.; c = 0.; Mathematically, c is always zero. for i=1 to n: On a computer, c can be non-zero y = x[i] – c The parentheses matter! t = s + y X 2 X fl(csum(x)) xi  (µ + nµ ) |xi | c = (t – s) – y i i 16 µ ⇡ 10 s = t 58 David Gleich · Purdue bit.ly/10SIe1A
  59. 59. Collaborators, Friends, andPeople who have taught meMRTSQR! Sandia MapReduce!Paul Constantine (Stanford) Todd PlantengaAustin Benson (Stanford) Tammy KoldaJames Demmel (Berkeley) Justin Basilico (now Netflix)Simform! Others!Jeremy Templeton (Sandia) Margot Gerritsen (Stanford)Joe Ruthruff (Sandia)Yangyang Hou (Purdue) Grants Sandia CSARJoe Nichols (Stanford) 59 David Gleich · Purdue bit.ly/10SIe1A
  60. 60. Questions? 60Image from rockysprings, deviantart, CC share-alike

×