Your SlideShare is downloading. ×
0
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

MapReduce for scientific simulation analysis

3,108

Published on

A tutorial I gave at Stanford on how to use MapReduce style computations for simulation data.

A tutorial I gave at Stanford on how to use MapReduce style computations for simulation data.

Published in: Education, Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,108
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
79
Comments
0
Likes
5
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. A hands on introductionto scientific data analysiswith Hadoop !!A matrix computations perspectiveDAVID F. GLEICH, PURDUE UNIVERSITYICME MAPREDUCE WORKSHOP @ STANFORD 1 David Gleich · Purdue MRWorkshop
  • 2. Who is this for? workshop project groups those curious about " “MapReduce” and “Hadoop” those who think about " problems as matrices 2 David Gleich · Purdue MRWorkshop
  • 3. What should you get out of it? 1. understand some problems that MapReduce solves effectively. 2. techniques to solve them using Hadoop and dumbo 3. learn some Hadoop words 3 David Gleich · Purdue MRWorkshop
  • 4. What you won’t learn … latest and greatest in " MapReduce algorithms how to improve the perform-" ance of your Hadoop job how to write wordcount " in Hadoop 4 David Gleich · Purdue MRWorkshop
  • 5. Slides will be online soon.Code samples and short tutorials atgithub.com/dgleich/mrmatrix 5 David Gleich · Purdue MRWorkshop
  • 6. 1.  HPC vs. Data (redux)2.  MapReduce vs. Hadoop3.  Dive into Hadoop with Hadoop streaming4.  Sparse matrix methods " with Hadoop 6 David Gleich · Purdue MRWorkshop
  • 7. High performancecomputing vs.Data intensivecomputing 7 David Gleich · Purdue MRWorkshop
  • 8. 224k Cores 10 PB drive 80k cores" 1.7 Pflops 50 PB drive ? Pflops 7 MW ? MW Custom " interconnect" GB ethernet $104 M $?? M 45 GB/core 625 GB/core 8 David Gleich · Purdue MRWorkshop
  • 9. icme-hadoop112 nodes; 4-core i7 processor, 24GB/node, 1GB ethernet 12 TB/node, 3000 GB/core, 50 TB usable space (3x redundancy) 9 David Gleich · Purdue MRWorkshop
  • 10. MapReduce is designed tosolve a different set of problems 10 David Gleich · Purdue MRWorkshop
  • 11. Supercomputer Data computing cluster EngineerEach multi-day HPC A data cluster can … enabling engineers to querysimulation generates hold hundreds or thousands and analyze months of simulationgigabytes of data. of old simulations … data for all sorts of neat purposes. 11 David Gleich · Purdue MRWorkshop
  • 12. MapReduce and!Hadoop overview 12 David Gleich · Purdue MRWorkshop
  • 13. The MapReduceprogramming model Input a list of (key, value) pairs Map apply a function f to all pairs Reduce apply a function g to " all values with key k (for all k) Output a list of (key, value) pairs 13 David Gleich · Purdue MRWorkshop
  • 14. The MapReduceprogramming model Input a list of (key, value) pairs Map apply a function f to all pairs Reduce apply a function g to " all values with key k (for all k) Output a list of (key, value) pairs Map function f must be side-effect free Reduce function g must be side-effect free 14 David Gleich · Purdue MRWorkshop
  • 15. The MapReduceprogramming model Input a list of (key, value) pairs Map apply a function f to all pairs Reduce apply a function g to " all values with key k (for all k) Output a list of (key, value) pairs All map functions can be done in parallel All reduce functions (for key k) can be done in parallel 15 David Gleich · Purdue MRWorkshop
  • 16. The MapReduceprogramming model Input a list of (key, value) pairs Map apply a function f to all pairs Reduce apply a function g to " all values with key k (for all k) Output a list of (key, value) pairs ! Shuffle group all pairs with key k together" (sorting suffices) 16 David Gleich · Purdue MRWorkshop
  • 17. Mesh point variance in MapReduce Run 1 Run 2 Run 3T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3 17 David Gleich · Purdue MRWorkshop
  • 18. Mesh point variance in MapReduce Run 1 Run 2 Run 3 T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3 M M M1. Each mapper out- 2. Shuffle moves allputs the mesh points values from the samewith the same key. mesh point to the R R same reducer. 3. Reducers just compute a numerical variance. 18 David Gleich · Purdue MRWorkshop
  • 19. MapReduce vs. Hadoop. MapReduce! Hadoop! A computation An implementation model with: of MapReduce Map - a local data using the HDFS transform parallel file-system. Shuffle - a grouping Others ! function Pheonix++, Twisted, Google MapReduce, Reduce – " spark, … an aggregation 19 David Gleich · Purdue MRWorkshop
  • 20. Why so many limitations? 20 David Gleich · Purdue MRWorkshop
  • 21. Data scalability Maps M M 1 2 1 M Reduce 2 M M M R 3 4 3 M R 4 M M 5 5 M Shuffle The idea ! Bring the computations to the data MR can schedule map functions without moving data. 21 David Gleich · Purdue MRWorkshop
  • 22. Mesh point variance in MapReduce Run 1 Run 2 Run 3 T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3 M M M1. Each mapper out- 2. Shuffle moves allputs the mesh points values from the samewith the same key. mesh point to the R R same reducer. 3. Reducers just compute a numerical variance. Bring the computations to the data! 22 David Gleich · Purdue MRWorkshop
  • 23. heartbreak on node rs252After waiting in the queue for a month and "after 24 hours of finding eigenvalues, one node randomly hiccups. 23 David Gleich · Purdue MRWorkshop
  • 24. Fault tolerant Input stored in triplicate Reduce input/" M output on disk M R M R M Map output" persisted to disk" before shuffle Redundant input helps make maps data-local Just one type of communication: shuffle 24 David Gleich · Purdue MRWorkshop
  • 25. Fault injection 200 Faults (200M by 200) Time to completion (sec) With 1/5 tasks failing, No faults (200M by 200) the job only takes twice 100 Faults (800M by 10) as long. No faults " (800M by 10) 10 100 1000 1/Prob(failure) – mean number of success per failure 25 David Gleich · Purdue MRWorkshop
  • 26. Diving into Hadoop(with python) 26 David Gleich · Purdue MRWorkshop
  • 27. Tools I like hadoop streaming dumbo mrjob hadoopy C++ 27 David Gleich · Purdue MRWorkshop
  • 28. Tools I don’t use but otherpeople seem to like … pig java hbase Eclipse Cassandra 28 David Gleich · Purdue MRWorkshop
  • 29. hadoop streaming the map function is a program" (key,value) pairs are sent via stdin" output (key,value) pairs goes to stdout the reduce function is a program" (key,value) pairs are sent via stdin" keys are grouped" output (key,value) pairs goes to stdout 29 David Gleich · Purdue MRWorkshop
  • 30. dumbo a wrapper around hadoop streaming for map and reduce functions in python #!/usr/bin/env dumbo def mapper(key,value): """ Each record is a line of text. key=<byte that the line starts in the file> value=<line of text> """ valarray = [float(v) for v in value.split()] yield key, sum(valarray) if __name__==__main__: import dumbo import dumbo.lib dumbo.run(mapper,dumbo.lib.identityreducer) 30 David Gleich · Purdue MRWorkshop
  • 31. Synthetic data test 100,000,000-by-500 matrix (~500GB)How can Hadoop streaming Codes implemented in MapReduce streamingpossibly be fast? Matrix stored as TypedBytes lists of doubles Python frameworks use Numpy+Atlas Custom C++ TypedBytes reader/writer with Atlas500 GBnon-streaming the R in a QR factorization. too New matrix. Computing Java implementation Iter 1 Iter 1 Iter 2 Overall QR (secs.) Total (secs.) Total (secs.) Total (secs.)Dumbo 67725 960 217 1177Hadoopy 70909 612 118 730C++ 15809 350 37 387Java 436 66 502 C++ in streaming beats a native Java implementation. All timing results from the Hadoop job trackerDavid Gleich (Sandia) MapReduce 2011 16/22 31 David Gleich · Purdue MRWorkshop
  • 32. Demo 11. generate data2. get data to hadoop3. run row sums4. see row sums! 32 David Gleich · Purdue MRWorkshop
  • 33. How does Hadoop know key = byte in file" value = line of text! !InputFormat!Map a file on HDFS to (key,value) pairsTextInputFormat!Map a text file to (<byte offset>, <line>)pairs 33 David Gleich · Purdue MRWorkshop
  • 34. The Hadoop Distributed File System (HDFS)and a big text file HDFS stores files in 64MB chunks Each chunk is a FileSplit FileSplits are stored in parallel A InputFormat converts FileSplits into a sequence of key-val records FileSplits can cross record borders" (a small bit of communication) 34 David Gleich · Purdue MRWorkshop
  • 35. Tall-and-skinny matrixstorage in MapReduceA : m x n, m ≫ n A1Key is an arbitrary row-id A2Value is the 1 x n array "for a row A3 A4 Each submatrix Ai is an "InputSplit (the input to a"map task). 35 David Gleich · Purdue MRWorkshop
  • 36. hadoop! MPI!output row-sum for parallel loadall local rows for my-batch-of-rows compute row-sum parallel save 36 David Gleich · Purdue MRWorkshop
  • 37. Isn’t reading and writing textfiles rather inefficient? 37 David Gleich · Purdue MRWorkshop
  • 38. Sequence Files and !OutputFormat SequenceFile An internal Hadoop file format to store (key, value) pairs efficiently. Used between map and reduce steps. OutputFormat Map (key, value) pairs to output on disk TextOutputFormat Map (key,value) pairs to keytvalue strings 38 David Gleich · Purdue MRWorkshop
  • 39. typedbytes A simple binary serialization scheme. [<1-byte-type-flag> <binary-value>]* Roughly equivalent to JSON (Optionally) used to communicate to and from Hadoop streaming. 39 David Gleich · Purdue MRWorkshop
  • 40. typedbytes example def _read(self): t = unpack_type(self.file.read(1))[0] self.t = t return self.handler_table[t](self) def read_vector(self): r = self._read count = unpack_int(self.file.read(4))[0] return tuple(r() for i in xrange(count)) 40 David Gleich · Purdue MRWorkshop
  • 41. Demo 2 Column sums 41 David Gleich · Purdue MRWorkshop
  • 42. Column sums in dumbo #!/usr/bin/env dumbo def mapper(key,value): """ Each record is a line of text. """ valarray = [float(v) for v in value.split()] for col,val in enumerate(valarray): yield col, val def reducer(col,values): yield col, sum(values) if __name__==__main__: import dumbo import dumbo.lib dumbo.run(mapper,reducer) 42 David Gleich · Purdue MRWorkshop
  • 43. Isn’t this just moving the datato the computation? MPI! parallel load Yes. for my-batch-of-rows update sum of each columns It seems much" parallel reduce partial worse than MPI. column sums parallel save 43 David Gleich · Purdue MRWorkshop
  • 44. The MapReduceprogramming model Input a list of (key, value) pairs Map apply a function f to all pairs Combine apply g to local values with key k! Shuffle group all pairs with key k together! Reduce apply a function g to " all values with key k Output a list of (key, value) pairs ! 44 David Gleich · Purdue MRWorkshop
  • 45. Column sums in dumbo #!/usr/bin/env dumbo def mapper(key,value): """ Each record is a line of text. """ valarray = [float(v) for v in value.split()] for col,val in enumerate(valarray): yield col, val def reducer(col,values): yield col, sum(values) if __name__==__main__: import dumbo import dumbo.lib dumbo.run(mapper,reducer,combiner=reducer) 45 David Gleich · Purdue MRWorkshop
  • 46. How many mappers andreducers? The number of maps is the number of InputSplits. You choose how many reducers. Each reducer outputs to a separate file. 46 David Gleich · Purdue MRWorkshop
  • 47. Demo 3 Column sums with multiplereducers 47 David Gleich · Purdue MRWorkshop
  • 48. Which reducer does my keygo to? Partitioner! Map a given key to a reducer HashPartitioner! Randomly distribute keys 48 David Gleich · Purdue MRWorkshop
  • 49. Sparse matrix methods 49 David Gleich · Purdue MRWorkshop
  • 50. of a graph, 4 9 storing the matrix by columns corresponds to storing the 1 10 then 7 6graph as an in-edge list. 13 4 ci 2 3 3 4 2 5 3 6 4 6 Storing a matrix by rows We briey 14 5ure .. 3 illustrate compressed row 13 10 12 4 storage schemes 4 g- ai 16 and column 14 9 20 7 in0 0 0 Compressed sparse row 16 13 0 Compressed sparse column0 2 12 4 0 0 rp 1 3 5 7 9 11 0 10 12 cp 1 1 3 6 8 90 14 0 11 11 16 20 4 0 01 0 0 10 94 9 7 0 20 6 0 ci 2 3 3 4 2 5 6 0 0 4 ri 1 3 1 2 4 2 3 6 4 5 13 4 5 3 4 0 0 70 0 ai 16 13 10 12 4 14 0 30 140 5 0 ai 16 4 13 10 9 12 9 7 20 14 7 20 4 4Row 1 13 0 (3,13.) 16 (2,16.) 0 Row 5 (4,7.) (6,4.)0 Most graph algorithms0are designed to work with out-edge lists instead of Compressed sparse column0 0 10 12 0 0 Row 2 (3,10.) (4,12.) an algorithm, MatlabBGL 9 11 0 4 lists. Before running cpRow 6 3 6 8 explicitly transposesin-edge 0 0 14 0 1 1 graph so that Matlab’s internal representation corresponds to storing out-the 0 9 0 0 20Row 3 (2,4.) (5,14.) 0 lists. For algorithms symmetric graphs, these transposes are not 0 0 0 7 0 4 ri 1 3 1 2 4 2 5 3 4 5 edge on Row 4 0 0 (6,20.) ai 16 0 0 (3,9.) 0 0 required. 4 13 10 9 12 7 14 20 4 e mex commands mxGetPr, mxGetJc, and mxGetIr retrieve pointers to 50Matlab’s internal storage of the matrix withoutGleich · Purdue MRWorkshop David making a copy. ese functions
  • 51. of a graph, 4 9 storing the matrix by columns corresponds to storing the 1 10 then 7 6graph as an in-edge list. 13 4 ci 2 3 3 4 2 5 3 6 4 6 Storing a matrix by rows in a text-file We briey 14 5ure .. 3 illustrate compressed row 13 10 12 4 storage schemes 4 g- ai 16 and column 14 9 20 7 in0 0 0 Compressed sparse row 16 13 0 Compressed sparse column0 2 12 4 0 0 rp 1 3 5 7 9 11 0 10 12 cp 1 1 3 6 8 90 14 0 11 11 16 20 4 0 01 0 0 10 94 9 7 0 20 6 0 ci 2 3 3 4 2 5 6 0 0 4 ri 1 3 1 2 4 2 3 6 4 5 13 4 5 3 4 0 0 70 0 ai 16 13 10 12 4 14 0 30 140 5 0 ai 16 4 13 10 9 12 9 7 20 14 7 20 4 4Row 1 13 0 (3,13.) 16 (2,16.) 0 Row 5 (4,7.) (6,4.)0 Most graph algorithms0are designed to work with out-edge lists instead of Compressed sparse column0 0 10 12 0 0 Row 2 (3,10.) (4,12.) an algorithm, MatlabBGL 9 11 0 4 lists. Before running cpRow 6 3 6 8 explicitly transposesin-edge 0 0 14 0 1 1 graph so that Matlab’s internal representation corresponds to storing out-the 0 9 0 0 20Row 3 (2,4.) (5,14.) 0 lists. For algorithms symmetric graphs, these transposes are not 0 0 0 7 0 4 ri 1 3 1 2 4 2 5 3 4 5 edge on Row 4 0 0 (6,20.) ai 16 0 0 (3,9.) 0 0 required. 4 13 10 9 12 7 14 20 4 e mex commands mxGetPr, mxGetJc, and mxGetIr retrieve pointers to 51Matlab’s internal storage of the matrix withoutGleich · Purdue MRWorkshop David making a copy. ese functions
  • 52. To store an m×n sparse matrix M, Matlab uses compressed column format [Gilbert et al., ]. Matlab never stores a 0 value in a sparse matrix. It always “re-compresses” the data structure in these cases. If M is the adjacency matrix Sparse matrix-vector product of a graph, then storing the matrix by columns corresponds to storing the graph as an in-edge list. We briey illustrate compressed row and column storage schemes in g- ure .. 2 X 12 4 The matrix! Compressed sparse row The vector! row and c Figure 6.1 – Compressed rp 1 3 5 7 9 11 11[Ax]i = Ai,j xj 16 20 storage. At far le, we have a wei 1 10 4 9 7 6 1 (2,16.) (3,13.) 1 2.1 directed graph. Its weighted adjac 13 4 ci 2 3 3 4 2 5 3 6 4 6 matrix lies below. At right are the pressed row and compressed colu 3 14 j 5 ai 2 (3,10.) (4,12.) 16 13 10 12 4 14 9 20 7 4 2 -1.3arrays for this graph and matrix. sparse matrices, compressed row 0 0 Compressed sparse column column storage make it easy to ac 0 16 13 0 0 0 cp 3 (2,4.) (5,14.) 3 0.5 entries in rows and columns, resp 0 10 12 0 Consider the rd entry in rp. It sa 0 0 1 1 3 6 8 9 11 4 0 0 14 to look at the th element in ci to 4 (3,9.) (6,20.) 4 0.6 0 20 all the columns in the rd row of 0 9 0 0 0 4 ri 1 3 1 2 4 2 5 3 4 5 matrix. e th and th elements 0 0 7 0 0 0 ai 16 4 13 10 9 12 7 14 20 4 and ai tell us that row has non- 0 0 0 0 5 (4,7.) (6,4.) 5 -1.2in columns and , with values . When the sparse matrix corre to the adjacency matrix of a grap 6 Most graph algorithms are designed to work with out-edge lists instead of 6 0.89corresponds to ecient access to out-edges and in-edges of a vertex in-edge lists. Before running an algorithm, MatlabBGL explicitly transposes the graph so that Matlab’s internal representation corresponds to storing out- to To make this work, we need to get the value of the vector 52 edge lists. For algorithms on as the column ofthese matrix the same function symmetric graphs, the transposes are not required. David Gleich · Purdue MRWorkshop
  • 53. To store an m×n sparse matrix M, Matlab uses compressed column format [Gilbert et al., ]. Matlab never stores a 0 value in a sparse matrix. It always “re-compresses” the data structure in these cases. If M is the adjacency matrix Sparse matrix-vector product of a graph, then storing the matrix by columns corresponds to storing the graph as an in-edge list. We briey illustrate compressed row and column storage schemes in g- ure .. 2 X 12 4 The matrix! Compressed sparse row The vector! row and c Figure 6.1 – Compressed rp 1 3 5 7 9 11 11[Ax]i = Ai,j xj 16 20 storage. At far le, we have a wei 1 10 4 9 7 6 1 (2,16.) (3,13.) 1 2.1 directed graph. Its weighted adjac 13 4 ci 2 3 3 4 2 5 3 6 4 6 matrix lies below. At right are the pressed row and compressed colu 3 14 j 5 ai 2 (3,10.) (4,12.) 16 13 10 12 4 14 9 20 7 4 2 -1.3arrays for this graph and matrix. sparse matrices, compressed row 0 0 Compressed sparse column column storage make it easy to ac 0 16 13 0 0 0 cp 3 (2,4.) (5,14.) 3 0.5 entries in rows and columns, resp 0 10 12 0 Consider the rd entry in rp. It sa 0 0 1 1 3 6 8 9 11 4 0 0 14 to look at the th element in ci to 4 (3,9.) (6,20.) 4 0.6 0 20 all the columns in the rd row of 0 9 0 0 0 4 ri 1 3 1 2 4 2 5 3 4 5 matrix. e th and th elements 0 0 7 0 0 0 ai 16 4 13 10 9 12 7 14 20 4 and ai tell us that row has non- 0 0 0 0 5 (4,7.) (6,4.) 5 -1.2in columns and , with values . When the sparse matrix corre to the adjacency matrix of a grap 6 Most graph algorithms are designed to work with out-edge lists instead of 6 0.89corresponds to ecient access to out-edges and in-edges of a vertex in-edge lists. Before running an algorithm, MatlabBGL explicitly transposes the graph so need to “join” the representationvector based storing out- We that Matlab’s internal matrix and corresponds to on the column 53 edge lists. For algorithms on symmetric graphs, these transposes are not required. David Gleich · Purdue MRWorkshop
  • 54. Sparse matrix-vector product!takes two MR tasks Two type soMap! records! f Map!If vector, emit (row,vecval) IdentityIf matrix, for each non-zero (row,col,val), emit (col,(row,val)) One of th ese values is not like Reduce (row, [(Aij xj), …]) !Reduce! the other sFind vecval in input keys emit (row, sum(Aij xj))For each (col,(row,val)), emit (row,(val*vecval)) Form Aij xj for each nonzero Regroup data by rows, compute sums 54 David Gleich · Purdue MRWorkshop
  • 55. What about a “dense” row?Map! If vector, emit (row,vecval) If matrix, How do we find for each non-zero (row,col,val), emit (col,(row,val)) vecval without One of th ese looking through values isReduce! the other not like s (and buffering) all Find vecval in input keys the input?For each (col,(row,val)), emit (row,(val*vecval)) Form Aij xj for each nonzero 55 David Gleich · Purdue MRWorkshop
  • 56. Sparse matrix-vector product!takes two MR tasks Two type soMap! records! f If vector, emit ((row,-1),vecval) If matrix, Use a custom partitioner for each non-zero (row,col,val), to make sure that (row,*) emit ((col,0),(row,val)) all get mapped to the same reducer, and that we always see (row,-1)Reduce! before (row,0).Find vecval in input keysFor each (col,(row,val)), emit (row,(val*vecval)) Form Aij xj for each nonzero Regroup data by rows, compute sums 56 David Gleich · Purdue MRWorkshop
  • 57. Demo 4 Sparse matrix vector products 57 David Gleich · Purdue MRWorkshop
  • 58. 58David Gleich · Purdue MRWorkshop
  • 59. Matrix factorizations 59 David Gleich · Purdue MRWorkshop
  • 60. Algorithm Data Rows of a matrix A1 A1 Map QR factorization of rows A2 qr Reduce QR factorization of rows A2 Q2 R2Mapper 1 qrSerial TSQR A3 A3 Q3 R3 A4 qr emit A4 Q4 R4 A5 A5 qr A6 A6 Q6 R6Mapper 2 qrSerial TSQR A7 A7 Q7 R7 A8 qr emit A8 Q8 R8 R4 R4Reducer 1Serial TSQR qr emit R8 R8 Q R 60 David Gleich · Purdue MRWorkshop
  • 61. In hadoopy Full code in hadoopyimport random, numpy, hadoopy def close(self):class SerialTSQR: self.compress() def __init__(self,blocksize,isreducer): for row in self.data: key = random.randint(0,2000000000) self.bsize=blocksize yield key, row self.data = [] if isreducer: self.__call__ = self.reducer def mapper(self,key,value): else: self.__call__ = self.mapper self.collect(key,value) def reducer(self,key,values): def compress(self): for value in values: self.mapper(key,value) R = numpy.linalg.qr( numpy.array(self.data),r) if __name__==__main__: # reset data and re-initialize to R mapper = SerialTSQR(blocksize=3,isreducer=False) self.data = [] reducer = SerialTSQR(blocksize=3,isreducer=True) for row in R: hadoopy.run(mapper, reducer) self.data.append([float(v) for v in row]) def collect(self,key,value): self.data.append(value) if len(self.data)self.bsize*len(self.data[0]): self.compress() 61 David Gleich (Sandia) MapReduceDavid 2011 Gleich · Purdue MRWorkshop 13/22
  • 62. Related resources Apache Mahout Machine learning for Hadoop … lots of matrices there … Another fantasic tutorial http://www.eurecom.fr/~michiard/ teaching/webtech/tutorial.pdf 62 David Gleich · Purdue MRWorkshop
  • 63. Way too much stuff! I hope to keep expanding this tutorial over the week… Keep checking the git repo. 63 David Gleich · Purdue MRWorkshop

×