An Introduction to Applying MapReduce and Hadoop for Scientific Data Mining

1. MapReduce / Hadoop for Scientiﬁc Data Mining Hadoop = Open Source MapReduce Wider World of Hadoop An Introduction to the World of Hadoop Applications to Scientiﬁc Data Mining Gordon Rios g.rios@4c.ucc.ie Cork Constraint Computation Centre (4C) University College Cork October 29, 2010 Gordon Rios Introduction to Hadoop

2. MapReduce / Hadoop for Scientific Data Mining Hadoop = Open Source MapReduce Wider World of Hadoop Outline 1 MapReduce / Hadoop for Scientific Data Mining Objectives for the Talk MapReduce as Simplified Parallel Computing Thinking in Terms of MapReduce 2 Hadoop is Open Source MapReduce Basics of Hadoop Hadoop Examples Developing Production Systems with Hadoop 3 Wider World of Hadoop Ad Hoc Analysis with Hadoop Further Reading Gordon Rios Introduction to Hadoop

3. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce Thinking Outline 1 MapReduce / Hadoop for Scientific Data Mining Objectives for the Talk MapReduce as Simplified Parallel Computing Thinking in Terms of MapReduce 2 Hadoop is Open Source MapReduce Basics of Hadoop Hadoop Examples Developing Production Systems with Hadoop 3 Wider World of Hadoop Ad Hoc Analysis with Hadoop Further Reading Gordon Rios Introduction to Hadoop

4. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce Thinking Objectives At the end of this talk I want you to have ideas for how to apply MapReduce to your domains and confidence that Hadoop is a good way to do it. . . Introduce thinking in terms of MapReduce and why it’s a good idea Introduce Hadoop as an open source implementation of MapReduce Present a detailed example of using the Hadoop streaming API for a scientific data mining task Discuss higher level notions for performing ad hoc analysis and building systems with Hadoop Gordon Rios Introduction to Hadoop

9. MapReduce / Hadoop for Scientiﬁc Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce Thinking MapReduce MapReduce is distributed computing where we take advantage of data locality to push the computation to the data. . . Distributed computing: clusters of computers with local memory and disk (network intensive for big data) Parallel Computing: multiple CPUs processing over shared memory and ﬁlesystem If we can decompose the problem into independent map and reduce tasks we can achieve “easy” parallelism with MapReduce. . . 1 Map works independently to convert input data to key value pairs. . . 2 Reduce works independently on all values for a given key and transforms them to a single output set (possibly even just the ∅) per key. . . Now, let’s expand that a bit. . . Gordon Rios Introduction to Hadoop

15. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce Thinking Basics Elements of MapReduce MapReduce is distributed sort with specific places to insert application logic. . . an input reader: read work data W from file system1 and produce a set of splits S: W → S a Map function: (S) → (K , V ) combiner function: a mapper optimization. . . partition function: partition2 keys k ∈ K to reducers K → R compare function cmp(ki , kj ): sort keys presented to each reducer a Reduce function: reduce output from all mappers for a particular to another set of values for that key wk (k , V ) → (k , wk )) an output writer: write output to file system. 1 A distributed file system (DFS) for stability and scale 2 The default hash keys modulo number of reducers Gordon Rios Introduction to Hadoop

17. MapReduce / Hadoop for Scientiﬁc Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce Thinking Examples of Map and Reduce Let’s start with a few examples of Map. . . Word Count: read in a stream of text (e.g. a document or a set of documents) and emit each word as a key with a value of 1 Inverted Index: read in a stream of documents and emit each word as a key and the document ID as the value Max Temperature: read in formatted data and emit year as a key with temperature as the value Mean Rain Precipitation: read in daily data and emit (year-month, lat, long) as a key with temperature as the value Reduce in these cases simply applies a count, list, max, average, to a set of values for each key, respectively. [Dean, Ghemawat, 2008, Wikipedia, 2010, White, 2011] Gordon Rios Introduction to Hadoop

18. MapReduce / Hadoop for Scientiﬁc Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce Thinking Visualizing Word Count source: Chris Wensel from http://www.cascading.org Gordon Rios Introduction to Hadoop

19. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Outline 1 MapReduce / Hadoop for Scientific Data Mining Objectives for the Talk MapReduce as Simplified Parallel Computing Thinking in Terms of MapReduce 2 Hadoop is Open Source MapReduce Basics of Hadoop Hadoop Examples Developing Production Systems with Hadoop 3 Wider World of Hadoop Ad Hoc Analysis with Hadoop Further Reading Gordon Rios Introduction to Hadoop

20. MapReduce / Hadoop for Scientiﬁc Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Engineering Intermezzo This is how easy it is to get Hadoop installed . . . given that you have Java 6 installed already. . . Get Hadoop: http://hadoop.apache.org/ % t a r x z f hadoop−x . y . z . t a r . gz % e x p o r t HADOOP_INSTALL=BUILD_DIR / hadoop−x . y . z % e x p o r t PATH=$PATH : $HADOOP_INSTALL / b i n Gordon Rios Introduction to Hadoop

21. MapReduce / Hadoop for Scientiﬁc Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems MapReduce with Hadoop and the streaming library Now, let’s take a closer look at how Hadoop implements MapReduce from [White, 2011]. . . Gordon Rios Introduction to Hadoop

22. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Hadoop Streaming Library We’ll focus on the streaming library as it’s the most natural for scientific or technical computing. . . let’s look at the Definitive Guide’s weather example. . . Gordon Rios Introduction to Hadoop

24. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Hadoop Book Examples More examples from Hadoop: The Definitive Guide, 2nd Edition (Hadoop 20.1) http://www.hadoopbook.com/. . . here’s how to install and try them for yourself. . . Install Git: http://git-scm.com/ Visit github for book code: http://github.com/tomwhite/ hadoop-book/ Checkout code examples from The Definitive Guide % cd BUILD_DIR % git clone http://github.com/tomwhite/hadoop-book.git hadoop-book Gordon Rios Introduction to Hadoop

25. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Example: ECA Mean Precipitation Let’s compute mean precipitation at over 2,000 weather stations and make some graphics. There are 2,186 files with median of 21,875 lines each, a minimum of 1,025 and a maximum of 78,090. ECA Daily Data The ECA dataset contains series of daily observations at meteorological stations throughout Europe and the Mediterranean. Part of the dataset is freely available for non-commercial research. To download this public data select one of the options below. Note that a gridded version with daily temperature and precipitation fields is also available. source: http://eca.knmi.nl/dailydata/index.php File Format FILE FORMAT ( MISSING VALUE CODE = −9999): 01−06 STAID : Station i d e n t i f i e r 08−13 SOUID : Source i d e n t i f i e r 15−22 DATE : Date YYYYMMDD 24−28 RR : P r e c i p i t a t i o n amount i n 0 . 1 mm 30−34 Q_RR : q u a l i t y code f o r RR ( 0 = ’ v a l i d ’ ; 1= ’ suspect ’ ; 9= ’ missing ’ ) Gordon Rios Introduction to Hadoop

26. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Example: ECA Mean Precipitation Scientific Data Mining: use the Hadoop stream library and manually pipeline MapReduce jobs together as needed. . . Write hadoop scripts in python in two steps Test cat data | map.py | sort | reduce.py > output (not shown) Process data into individual files for each time period (Year/Month) of interest using hadoop stream library (local mode) Call R in batch mode to produce image files Gordon Rios Introduction to Hadoop

27. MapReduce / Hadoop for Scientiﬁc Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems ECA Mean Precipitation: Step One map_one.py def l a t _ l o n _ t o _ c o o r d ( s ) : sign = 1 d , m, s = map( lambda x : f l o a t ( x ) , s . s p l i t ( " : " ) ) s i g n = −1 i f d < 0 else 1 x = abs ( d ) + m / 60.0 + s / 3600.0 return f l o a t ( sign ∗ x ) f o r l i n e i n sys . s t d i n : # f l d s = ( s t a i d , souid , date , r r , q _ r r ) flds = line . strip (). split ( " , " ) i f len ( f l d s ) != 5: continue staid = flds [ 0 ] . strip () # station id date = f l d s [ 2 ] . s t r i p ( ) # YYYYMMDD i f date < BEGIN_DATE or date > END_DATE : continue rr = flds [3]. strip () # p r e c i p i t a t i o n i n 0 . 1 mm q_rr = f l d s [ 4 ] . s t r i p ( ) # q u a l i t y code " 0 " = v a l i d l a t , l o n = l a t l o n s . g e t ( s t a i d , ( None , None ) ) i f q _ r r == ’ 0 ’ and ( l a t i s not None ) and ( l o n i s not None ) : p r i n t "%s ,%.4 f ,%.4 f t%s " % ( date [ 0 : 6 ] , l a t , lon , r r ) Gordon Rios Introduction to Hadoop

28. MapReduce / Hadoop for Scientiﬁc Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems ECA Mean Precipitation: Step One (cont) reduce_one.py ( l a s t _ k e y , x , n ) = ( None , 0 . 0 , 0 ) f o r l i n e i n sys . s t d i n : ( key , v a l ) = l i n e . s t r i p ( ) . s p l i t ( " t " ) i f l a s t _ k e y and l a s t _ k e y ! = key : # t i m e t o e m i t reduced v a l u e i f n > 0: p r i n t "%s t %.2 f " % ( l a s t _ k e y , x / n ) x = 0.0 n = 0 # we j u s t want data f o r t h e year 2009 ( l a s t _ k e y , x , n ) = ( key , x + f l o a t ( v a l ) , n + 1 ) i f last_key : i f n > 0: p r i n t "%s t %.2 f " % ( l a s t _ k e y , x / n ) Gordon Rios Introduction to Hadoop

29. MapReduce / Hadoop for Scientiﬁc Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems ECA Mean Precipitation: Step Two Map ((yyyymm,lat,lon),mean) -> (yyyymm, (lat,lon,mean)) map_two.py f o r l i n e i n sys . s t d i n : yyyymm_lat_lon , mean_precip = l i n e . s t r i p ( ) . s p l i t ( " t " ) yyyymm , l a t , l o n = yyyymm_lat_lon . s t r i p ( ) . s p l i t ( " , " ) p r i n t "%s t%s %s %s " % ( yyyymm , l a t , lon , mean_precip ) Empty reduce just write to a local ﬁle (hack since we’re running locally) reduce_two.py l a s t _ k e y = None values = [ ] f o r l i n e i n sys . s t d i n : ( key , v a l ) = l i n e . s t r i p ( ) . s p l i t ( " t " ) i f l a s t _ k e y and l a s t _ k e y ! = key : # t i m e t o e m i t reduced v a l u e w r i t e _ f i l e ( last_key , values ) values = [ ] l a s t _ k e y = key v a l u e s . append ( v a l ) # c r e a t e a s t r i n g w i t h t h r e e v a l u e s i f last_key : w r i t e _ f i l e ( last_key , values ) Gordon Rios Introduction to Hadoop

30. MapReduce / Hadoop for Scientiﬁc Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems ECA Mean Precipitation: Step Two Map ((yyyymm,lat,lon),mean) -> (yyyymm, (lat,lon,mean)) map_two.py f o r l i n e i n sys . s t d i n : yyyymm_lat_lon , mean_precip = l i n e . s t r i p ( ) . s p l i t ( " t " ) yyyymm , l a t , l o n = yyyymm_lat_lon . s t r i p ( ) . s p l i t ( " , " ) p r i n t "%s t%s %s %s " % ( yyyymm , l a t , lon , mean_precip ) Empty reduce just write to a local ﬁle (hack since we’re running locally) reduce_two.py l a s t _ k e y = None values = [ ] f o r l i n e i n sys . s t d i n : ( key , v a l ) = l i n e . s t r i p ( ) . s p l i t ( " t " ) i f l a s t _ k e y and l a s t _ k e y ! = key : # t i m e t o e m i t reduced v a l u e w r i t e _ f i l e ( last_key , values ) values = [ ] l a s t _ k e y = key v a l u e s . append ( v a l ) # c r e a t e a s t r i n g w i t h t h r e e v a l u e s i f last_key : w r i t e _ f i l e ( last_key , values ) Gordon Rios Introduction to Hadoop

31. MapReduce / Hadoop for Scientiﬁc Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Example: ECA Mean Precipitation Step One: input -> (yyyymm,lat,lon), mean precip % hadoop jar /Users/gordon/build/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input /Downloads/tarragona/ECA_blend_all_data/RR_STAID00* -output output -mapper /Desktop/tmp/tarragona/python/map_one.py -reducer /Desktop/tmp/tarragona/python/reduce_one.py % 10/10/25 19:34:31 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= ... % 10/10/25 20:07:36 INFO streaming.StreamJob: Job complete: job_local_0001 % 10/10/25 20:07:36 INFO streaming.StreamJob: Output: output Step Two: (date,lat,lon), mean precip -> files(yymm) % hadoop jar /Users/gordon/build/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input output/part-00000 -output output_two -mapper /Desktop/tmp/tarragona/python/map_two.py -reducer /Desktop/tmp/tarragona/python/reduce_two.py % 10/10/25 20:41:43 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= ... % 10/10/25 20:41:47 INFO streaming.StreamJob: Job complete: job_local_0001 % 10/10/25 20:41:47 INFO streaming.StreamJob: Output: output_two Gordon Rios Introduction to Hadoop

32. MapReduce / Hadoop for Scientiﬁc Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Example: ECA Mean Precipitation Step One: input -> (yyyymm,lat,lon), mean precip % hadoop jar /Users/gordon/build/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input /Downloads/tarragona/ECA_blend_all_data/RR_STAID00* -output output -mapper /Desktop/tmp/tarragona/python/map_one.py -reducer /Desktop/tmp/tarragona/python/reduce_one.py % 10/10/25 19:34:31 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= ... % 10/10/25 20:07:36 INFO streaming.StreamJob: Job complete: job_local_0001 % 10/10/25 20:07:36 INFO streaming.StreamJob: Output: output Step Two: (date,lat,lon), mean precip -> files(yymm) % hadoop jar /Users/gordon/build/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input output/part-00000 -output output_two -mapper /Desktop/tmp/tarragona/python/map_two.py -reducer /Desktop/tmp/tarragona/python/reduce_two.py % 10/10/25 20:41:43 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= ... % 10/10/25 20:41:47 INFO streaming.StreamJob: Job complete: job_local_0001 % 10/10/25 20:41:47 INFO streaming.StreamJob: Output: output_two Gordon Rios Introduction to Hadoop

33. MapReduce / Hadoop for Scientiﬁc Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Batch Processing in R And, after a little batch processing with R. . . batch-graphics.R library ( fields ) f i l e s <− c ( " 200901. d a t " , " 200902. d a t " , " 200903. d a t " , " 200904. d a t " , " 200905. d a t " , " 200906. d a t " , " 200907. d a t " , " 200908. d a t " , " 200909. d a t " , " 200910. d a t " , " 200911. d a t " , " 200912. d a t " ) i <− 1 for ( f in f i l e s ) { mat <− read . t a b l e ( f ) names ( mat ) <− c ( " l a t " , " l o n g " , " p r e c i p " ) png ( f i l e n a m e =paste ( " p r e c i p−" , i , " . png " , sep= " " ) , h e i g h t =480 , w i d t h =480) q u i l t . p l o t ( mat $long , mat $ l a t , mat $ p r e c i p , n c o l =100 , nrow =100 , y l i m =c ( 2 2 . 0 , 7 9 . 0 ) , x l i m =c ( − 5 2 . 0 , 7 2 . 0 ) , c o l =two . c o l o r s ( 2 5 6 , s t a r t = " wheat " , end= " d a r k b l u e " , middle = " b l u e " ) , z l i m =c ( 0 , 4 1 0 ) , add . legend=T , cex . l a b = 0 . 6 ) p o i n t s ( 1 . 2 4 5 3 , 41.1187 , pch =1) t e x t ( 1 . 2 4 5 3 , 41.1187 , " t a r r a g o n a " , cex = 0 . 8 , pos=1) p o i n t s ( 2 . 3 5 0 8 3 , 4 8 . 8 9 , pch =1) t e x t ( 2 . 3 5 0 8 , 4 8 . 8 9 , " p a r i s " , cex = 0 . 8 , pos=4) p o i n t s ( 1 2 . 4 8 2 3 , 41.8955 , pch =1) t e x t ( 1 2 . 4 8 2 3 , 41.8955 , " rome " , cex = 0 . 8 , pos=4) dev . o f f ( ) i <− i + 1 } Gordon Rios Introduction to Hadoop

34. MapReduce / Hadoop for Scientiﬁc Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems ECA Precipitation 2009 Month: 1 Gordon Rios Introduction to Hadoop

46. MapReduce / Hadoop for Scientiﬁc Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Summary of What We Did Work through a complete example but that’s not all since with very little additional work we can. . . Test the scripts in pseudo-distributed mode locally on our own machine Run the job on a compute cluster remotely Run the job in the cloud with EC2 there system as just another remote cluster Run the job with Amazon’s Elastic MapReduce http://aws.amazon.com/elasticmapreduce/ which allows you to pay for exactly as much computing as you use. See [White, 2011] for complete details on how to run in these different modes. . . Gordon Rios Introduction to Hadoop

48. MapReduce / Hadoop for Scientiﬁc Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Systems Development APIs And, you can build production systems with Hadoop in either Java or C++. . . Full featured Java API for Hadoop Pipes is the C++ API for Hadoop MapReduce Cascading is an API for developing general data processing systems that incorporate MapReduce (http://www.cascading.org) Gordon Rios Introduction to Hadoop

51. MapReduce / Hadoop for Scientiﬁc Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Cascading Cascading allows developers to model very complex data ﬂows at a higher level than Map & Reduce and then automatically generate and visualize the dozens or perhaps even hundreds of necessary Hadoop jobs as a graph Gordon Rios Introduction to Hadoop

52. MapReduce / Hadoop for Scientific Data Mining Ad Hoc Analysis Hadoop = Open Source MapReduce Further Reading Wider World of Hadoop Outline 1 MapReduce / Hadoop for Scientific Data Mining Objectives for the Talk MapReduce as Simplified Parallel Computing Thinking in Terms of MapReduce 2 Hadoop is Open Source MapReduce Basics of Hadoop Hadoop Examples Developing Production Systems with Hadoop 3 Wider World of Hadoop Ad Hoc Analysis with Hadoop Further Reading Gordon Rios Introduction to Hadoop

53. MapReduce / Hadoop for Scientiﬁc Data Mining Ad Hoc Analysis Hadoop = Open Source MapReduce Further Reading Wider World of Hadoop Ad Hoc Analysis What’s missing? Sometimes you need to do fast ad hoc queries. . . can we do that in a scalable way? Pig: “Pig is a scripting language for exploring large datasets” [White, 2011] (Yahoo!) Hive: provide an SQL interface for running ad hoc queries and other data processing tasks for SQL analysts (Facebook) Hbase: Column oriented database along the lines of Google’s Bigtable database (Powerset) Hypertable: GPL clone of Google’s Bigtable database written in C++ (Zvents) Google’s Bigtable database is described in [Chang, Dean, Ghemawat, et al., 2008] Gordon Rios Introduction to Hadoop

58. MapReduce / Hadoop for Scientiﬁc Data Mining Ad Hoc Analysis Hadoop = Open Source MapReduce Further Reading Wider World of Hadoop Interesting Application Frameworks with Hadoop Here are a few examples of frameworks in development or already available that use Hadoop as a platform. . . Apache Mahout: Ambitious project to implement popular machine learning algorithms and recommenders with Hadoop3 Graph: Jake Hoffman from Yahoo Research has released some of his work on large scale network analysis with Hadoop with prototype code4 . Also see [Vassilvitskii, 2010] for related graph analysis research. Application to GIS: Nathan Kerr’s M.S. Thesis with lots of details on how to do GIS with Hadoop5 3 http://mahout.apache.org/ 4 http://github.com/jhofman/icwsm2010_tutorial 5 http://www.nathankerr.com/projects/parallel-gis-processing/alternative_ approaches_to_parallel_gis_processing.html Gordon Rios Introduction to Hadoop

59. MapReduce / Hadoop for Scientific Data Mining Ad Hoc Analysis Hadoop = Open Source MapReduce Further Reading Wider World of Hadoop Outline 1 MapReduce / Hadoop for Scientific Data Mining Objectives for the Talk MapReduce as Simplified Parallel Computing Thinking in Terms of MapReduce 2 Hadoop is Open Source MapReduce Basics of Hadoop Hadoop Examples Developing Production Systems with Hadoop 3 Wider World of Hadoop Ad Hoc Analysis with Hadoop Further Reading Gordon Rios Introduction to Hadoop

60. MapReduce / Hadoop for Scientific Data Mining Ad Hoc Analysis Hadoop = Open Source MapReduce Further Reading Wider World of Hadoop Further Reading White, T. Hadoop: The Definitive Guide, 2nd Edition O’Reilly Media, Inc., Sebastopol, CA, 2011 Sanderson, D. Programming Google App Engine O’Reilly Media, Inc., Sebastopol, CA, 2009 Murty, J. Programming Amazon Web Services O’Reilly Media, Inc., Sebastopol, CA, 2008 Dean, J. and Ghemawat, S. MapReduce: simplified data processing on large clusters Communications of the ACM, 51(1):107–113, 2008 Chang, Fay and Dean, Jeffrey and Ghemawat, Sanjay and Hsieh, Wilson C. and Wallach, Deborah A. and Burrows, Mike and Chandra, Tushar and Fikes, Andrew and Gruber, Robert E. Bigtable: a distributed storage system for structured data OSDI ’06: Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation, USENIX Assoc., Berkeley, CA, 2006 MapReduce on Wikipedia http://en.wikipedia.org/wiki/MapReduce Vassilvitskii, S. XXL Graph Algorithms, Hadoop Summit 2010 http://developer.yahoo.com/events/hadoopsummit2010/ Gordon Rios Introduction to Hadoop

An Introduction to Applying MapReduce and Hadoop for Scientific Data Mining

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to An Introduction to Applying MapReduce and Hadoop for Scientific Data Mining

Similar to An Introduction to Applying MapReduce and Hadoop for Scientific Data Mining (20)

Recently uploaded

Recently uploaded (20)

An Introduction to Applying MapReduce and Hadoop for Scientific Data Mining