Your SlideShare is downloading. ×
An Introduction to the World of Hadoop
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

An Introduction to the World of Hadoop

13,974
views

Published on

A presentation on Hadoop for scientific researchers given at Universitat Rovira i Virgili in Catalonia, Spain in October 2010. http://etseq.urv.cat/seminaris/seminars/3/

A presentation on Hadoop for scientific researchers given at Universitat Rovira i Virgili in Catalonia, Spain in October 2010. http://etseq.urv.cat/seminaris/seminars/3/

Published in: Technology

1 Comment
32 Likes
Statistics
Notes
No Downloads
Views
Total Views
13,974
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
1,249
Comments
1
Likes
32
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. MapReduce / Hadoop for Scientific Data Mining Hadoop = Open Source MapReduce Wider World of Hadoop An Introduction to the World of Hadoop Applications to Scientific Data Mining Gordon Rios g.rios@4c.ucc.ie Cork Constraint Computation Centre (4C) University College Cork October 29, 2010 Gordon Rios Introduction to Hadoop
  • 2. MapReduce / Hadoop for Scientific Data Mining Hadoop = Open Source MapReduce Wider World of HadoopOutline 1 MapReduce / Hadoop for Scientific Data Mining Objectives for the Talk MapReduce as Simplified Parallel Computing Thinking in Terms of MapReduce 2 Hadoop is Open Source MapReduce Basics of Hadoop Hadoop Examples Developing Production Systems with Hadoop 3 Wider World of Hadoop Ad Hoc Analysis with Hadoop Further Reading Gordon Rios Introduction to Hadoop
  • 3. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce ThinkingOutline 1 MapReduce / Hadoop for Scientific Data Mining Objectives for the Talk MapReduce as Simplified Parallel Computing Thinking in Terms of MapReduce 2 Hadoop is Open Source MapReduce Basics of Hadoop Hadoop Examples Developing Production Systems with Hadoop 3 Wider World of Hadoop Ad Hoc Analysis with Hadoop Further Reading Gordon Rios Introduction to Hadoop
  • 4. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce ThinkingObjectives At the end of this talk I want you to have ideas for how to apply MapReduce to your domains and confidence that Hadoop is a good way to do it. . . Introduce thinking in terms of MapReduce and why it’s a good idea Introduce Hadoop as an open source implementation of MapReduce Present a detailed example of using the Hadoop streaming API for a scientific data mining task Discuss higher level notions for performing ad hoc analysis and building systems with Hadoop Gordon Rios Introduction to Hadoop
  • 5. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce ThinkingObjectives At the end of this talk I want you to have ideas for how to apply MapReduce to your domains and confidence that Hadoop is a good way to do it. . . Introduce thinking in terms of MapReduce and why it’s a good idea Introduce Hadoop as an open source implementation of MapReduce Present a detailed example of using the Hadoop streaming API for a scientific data mining task Discuss higher level notions for performing ad hoc analysis and building systems with Hadoop Gordon Rios Introduction to Hadoop
  • 6. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce ThinkingObjectives At the end of this talk I want you to have ideas for how to apply MapReduce to your domains and confidence that Hadoop is a good way to do it. . . Introduce thinking in terms of MapReduce and why it’s a good idea Introduce Hadoop as an open source implementation of MapReduce Present a detailed example of using the Hadoop streaming API for a scientific data mining task Discuss higher level notions for performing ad hoc analysis and building systems with Hadoop Gordon Rios Introduction to Hadoop
  • 7. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce ThinkingObjectives At the end of this talk I want you to have ideas for how to apply MapReduce to your domains and confidence that Hadoop is a good way to do it. . . Introduce thinking in terms of MapReduce and why it’s a good idea Introduce Hadoop as an open source implementation of MapReduce Present a detailed example of using the Hadoop streaming API for a scientific data mining task Discuss higher level notions for performing ad hoc analysis and building systems with Hadoop Gordon Rios Introduction to Hadoop
  • 8. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce ThinkingOutline 1 MapReduce / Hadoop for Scientific Data Mining Objectives for the Talk MapReduce as Simplified Parallel Computing Thinking in Terms of MapReduce 2 Hadoop is Open Source MapReduce Basics of Hadoop Hadoop Examples Developing Production Systems with Hadoop 3 Wider World of Hadoop Ad Hoc Analysis with Hadoop Further Reading Gordon Rios Introduction to Hadoop
  • 9. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce ThinkingMapReduce MapReduce is distributed computing where we take advantage of data locality to push the computation to the data. . . Distributed computing: clusters of computers with local memory and disk (network intensive for big data) Parallel Computing: multiple CPUs processing over shared memory and filesystem If we can decompose the problem into independent map and reduce tasks we can achieve “easy” parallelism with MapReduce. . . 1 Map works independently to convert input data to key value pairs. . . 2 Reduce works independently on all values for a given key and transforms them to a single output set (possibly even just the ∅) per key. . . Now, let’s expand that a bit. . . Gordon Rios Introduction to Hadoop
  • 10. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce ThinkingMapReduce MapReduce is distributed computing where we take advantage of data locality to push the computation to the data. . . Distributed computing: clusters of computers with local memory and disk (network intensive for big data) Parallel Computing: multiple CPUs processing over shared memory and filesystem If we can decompose the problem into independent map and reduce tasks we can achieve “easy” parallelism with MapReduce. . . 1 Map works independently to convert input data to key value pairs. . . 2 Reduce works independently on all values for a given key and transforms them to a single output set (possibly even just the ∅) per key. . . Now, let’s expand that a bit. . . Gordon Rios Introduction to Hadoop
  • 11. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce ThinkingMapReduce MapReduce is distributed computing where we take advantage of data locality to push the computation to the data. . . Distributed computing: clusters of computers with local memory and disk (network intensive for big data) Parallel Computing: multiple CPUs processing over shared memory and filesystem If we can decompose the problem into independent map and reduce tasks we can achieve “easy” parallelism with MapReduce. . . 1 Map works independently to convert input data to key value pairs. . . 2 Reduce works independently on all values for a given key and transforms them to a single output set (possibly even just the ∅) per key. . . Now, let’s expand that a bit. . . Gordon Rios Introduction to Hadoop
  • 12. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce ThinkingMapReduce MapReduce is distributed computing where we take advantage of data locality to push the computation to the data. . . Distributed computing: clusters of computers with local memory and disk (network intensive for big data) Parallel Computing: multiple CPUs processing over shared memory and filesystem If we can decompose the problem into independent map and reduce tasks we can achieve “easy” parallelism with MapReduce. . . 1 Map works independently to convert input data to key value pairs. . . 2 Reduce works independently on all values for a given key and transforms them to a single output set (possibly even just the ∅) per key. . . Now, let’s expand that a bit. . . Gordon Rios Introduction to Hadoop
  • 13. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce ThinkingMapReduce MapReduce is distributed computing where we take advantage of data locality to push the computation to the data. . . Distributed computing: clusters of computers with local memory and disk (network intensive for big data) Parallel Computing: multiple CPUs processing over shared memory and filesystem If we can decompose the problem into independent map and reduce tasks we can achieve “easy” parallelism with MapReduce. . . 1 Map works independently to convert input data to key value pairs. . . 2 Reduce works independently on all values for a given key and transforms them to a single output set (possibly even just the ∅) per key. . . Now, let’s expand that a bit. . . Gordon Rios Introduction to Hadoop
  • 14. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce ThinkingMapReduce MapReduce is distributed computing where we take advantage of data locality to push the computation to the data. . . Distributed computing: clusters of computers with local memory and disk (network intensive for big data) Parallel Computing: multiple CPUs processing over shared memory and filesystem If we can decompose the problem into independent map and reduce tasks we can achieve “easy” parallelism with MapReduce. . . 1 Map works independently to convert input data to key value pairs. . . 2 Reduce works independently on all values for a given key and transforms them to a single output set (possibly even just the ∅) per key. . . Now, let’s expand that a bit. . . Gordon Rios Introduction to Hadoop
  • 15. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce ThinkingBasics Elements of MapReduce MapReduce is distributed sort with specific places to insert application logic. . . an input reader: read work data W from file system1 and produce a set of splits S: W → S a Map function: (S) → (K , V ) combiner function: a mapper optimization. . . partition function: partition2 keys k ∈ K to reducers K → R compare function cmp(ki , kj ): sort keys presented to each reducer a Reduce function: reduce output from all mappers for a particular to another set of values for that key wk (k , V ) → (k , wk )) an output writer: write output to file system. 1 A distributed file system (DFS) for stability and scale 2 The default hash keys modulo number of reducers Gordon Rios Introduction to Hadoop
  • 16. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce ThinkingOutline 1 MapReduce / Hadoop for Scientific Data Mining Objectives for the Talk MapReduce as Simplified Parallel Computing Thinking in Terms of MapReduce 2 Hadoop is Open Source MapReduce Basics of Hadoop Hadoop Examples Developing Production Systems with Hadoop 3 Wider World of Hadoop Ad Hoc Analysis with Hadoop Further Reading Gordon Rios Introduction to Hadoop
  • 17. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce ThinkingExamples of Map and Reduce Let’s start with a few examples of Map. . . Word Count: read in a stream of text (e.g. a document or a set of documents) and emit each word as a key with a value of 1 Inverted Index: read in a stream of documents and emit each word as a key and the document ID as the value Max Temperature: read in formatted data and emit year as a key with temperature as the value Mean Rain Precipitation: read in daily data and emit (year-month, lat, long) as a key with temperature as the value Reduce in these cases simply applies a count, list, max, average, to a set of values for each key, respectively. [Dean, Ghemawat, 2008, Wikipedia, 2010, White, 2011] Gordon Rios Introduction to Hadoop
  • 18. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce ThinkingVisualizing Word Count source: Chris Wensel from http://www.cascading.org Gordon Rios Introduction to Hadoop
  • 19. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production SystemsOutline 1 MapReduce / Hadoop for Scientific Data Mining Objectives for the Talk MapReduce as Simplified Parallel Computing Thinking in Terms of MapReduce 2 Hadoop is Open Source MapReduce Basics of Hadoop Hadoop Examples Developing Production Systems with Hadoop 3 Wider World of Hadoop Ad Hoc Analysis with Hadoop Further Reading Gordon Rios Introduction to Hadoop
  • 20. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production SystemsEngineering Intermezzo This is how easy it is to get Hadoop installed . . . given that you have Java 6 installed already. . . Get Hadoop: http://hadoop.apache.org/ % t a r x z f hadoop−x . y . z . t a r . gz % e x p o r t HADOOP_INSTALL=BUILD_DIR / hadoop−x . y . z % e x p o r t PATH=$PATH : $HADOOP_INSTALL / b i n Gordon Rios Introduction to Hadoop
  • 21. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production SystemsMapReduce with Hadoop and the streaming library Now, let’s take a closer look at how Hadoop implements MapReduce from [White, 2011]. . . Gordon Rios Introduction to Hadoop
  • 22. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production SystemsHadoop Streaming Library We’ll focus on the streaming library as it’s the most natural for scientific or technical computing. . . let’s look at the Definitive Guide’s weather example. . . Gordon Rios Introduction to Hadoop
  • 23. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production SystemsOutline 1 MapReduce / Hadoop for Scientific Data Mining Objectives for the Talk MapReduce as Simplified Parallel Computing Thinking in Terms of MapReduce 2 Hadoop is Open Source MapReduce Basics of Hadoop Hadoop Examples Developing Production Systems with Hadoop 3 Wider World of Hadoop Ad Hoc Analysis with Hadoop Further Reading Gordon Rios Introduction to Hadoop
  • 24. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production SystemsHadoop Book Examples More examples from Hadoop: The Definitive Guide, 2nd Edition (Hadoop 20.1) http://www.hadoopbook.com/. . . here’s how to install and try them for yourself. . . Install Git: http://git-scm.com/ Visit github for book code: http://github.com/tomwhite/ hadoop-book/ Checkout code examples from The Definitive Guide % cd BUILD_DIR % git clone http://github.com/tomwhite/hadoop-book.git hadoop-book Gordon Rios Introduction to Hadoop
  • 25. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production SystemsExample: ECA Mean Precipitation Let’s compute mean precipitation at over 2,000 weather stations and make some graphics. There are 2,186 files with median of 21,875 lines each, a minimum of 1,025 and a maximum of 78,090. ECA Daily Data The ECA dataset contains series of daily observations at meteorological stations throughout Europe and the Mediterranean. Part of the dataset is freely available for non-commercial research. To download this public data select one of the options below. Note that a gridded version with daily temperature and precipitation fields is also available. source: http://eca.knmi.nl/dailydata/index.php File Format FILE FORMAT ( MISSING VALUE CODE = −9999): 01−06 STAID : Station i d e n t i f i e r 08−13 SOUID : Source i d e n t i f i e r 15−22 DATE : Date YYYYMMDD 24−28 RR : P r e c i p i t a t i o n amount i n 0 . 1 mm 30−34 Q_RR : q u a l i t y code f o r RR ( 0 = ’ v a l i d ’ ; 1= ’ suspect ’ ; 9= ’ missing ’ ) Gordon Rios Introduction to Hadoop
  • 26. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production SystemsExample: ECA Mean Precipitation Scientific Data Mining: use the Hadoop stream library and manually pipeline MapReduce jobs together as needed. . . Write hadoop scripts in python in two steps Test cat data | map.py | sort | reduce.py > output (not shown) Process data into individual files for each time period (Year/Month) of interest using hadoop stream library (local mode) Call R in batch mode to produce image files Gordon Rios Introduction to Hadoop
  • 27. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production SystemsECA Mean Precipitation: Step One map_one.py def l a t _ l o n _ t o _ c o o r d ( s ) : sign = 1 d , m, s = map( lambda x : f l o a t ( x ) , s . s p l i t ( " : " ) ) s i g n = −1 i f d < 0 else 1 x = abs ( d ) + m / 60.0 + s / 3600.0 return f l o a t ( sign ∗ x ) f o r l i n e i n sys . s t d i n : # f l d s = ( s t a i d , souid , date , r r , q _ r r ) flds = line . strip (). split ( " , " ) i f len ( f l d s ) != 5: continue staid = flds [ 0 ] . strip () # station id date = f l d s [ 2 ] . s t r i p ( ) # YYYYMMDD i f date < BEGIN_DATE or date > END_DATE : continue rr = flds [3]. strip () # p r e c i p i t a t i o n i n 0 . 1 mm q_rr = f l d s [ 4 ] . s t r i p ( ) # q u a l i t y code " 0 " = v a l i d l a t , l o n = l a t l o n s . g e t ( s t a i d , ( None , None ) ) i f q _ r r == ’ 0 ’ and ( l a t i s not None ) and ( l o n i s not None ) : p r i n t "%s ,%.4 f ,%.4 f t%s " % ( date [ 0 : 6 ] , l a t , lon , r r ) Gordon Rios Introduction to Hadoop
  • 28. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production SystemsECA Mean Precipitation: Step One (cont) reduce_one.py ( l a s t _ k e y , x , n ) = ( None , 0 . 0 , 0 ) f o r l i n e i n sys . s t d i n : ( key , v a l ) = l i n e . s t r i p ( ) . s p l i t ( " t " ) i f l a s t _ k e y and l a s t _ k e y ! = key : # t i m e t o e m i t reduced v a l u e i f n > 0: p r i n t "%s t %.2 f " % ( l a s t _ k e y , x / n ) x = 0.0 n = 0 # we j u s t want data f o r t h e year 2009 ( l a s t _ k e y , x , n ) = ( key , x + f l o a t ( v a l ) , n + 1 ) i f last_key : i f n > 0: p r i n t "%s t %.2 f " % ( l a s t _ k e y , x / n ) Gordon Rios Introduction to Hadoop
  • 29. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production SystemsECA Mean Precipitation: Step Two Map ((yyyymm,lat,lon),mean) -> (yyyymm, (lat,lon,mean)) map_two.py f o r l i n e i n sys . s t d i n : yyyymm_lat_lon , mean_precip = l i n e . s t r i p ( ) . s p l i t ( " t " ) yyyymm , l a t , l o n = yyyymm_lat_lon . s t r i p ( ) . s p l i t ( " , " ) p r i n t "%s t%s %s %s " % ( yyyymm , l a t , lon , mean_precip ) Empty reduce just write to a local file (hack since we’re running locally) reduce_two.py l a s t _ k e y = None values = [ ] f o r l i n e i n sys . s t d i n : ( key , v a l ) = l i n e . s t r i p ( ) . s p l i t ( " t " ) i f l a s t _ k e y and l a s t _ k e y ! = key : # t i m e t o e m i t reduced v a l u e w r i t e _ f i l e ( last_key , values ) values = [ ] l a s t _ k e y = key v a l u e s . append ( v a l ) # c r e a t e a s t r i n g w i t h t h r e e v a l u e s i f last_key : w r i t e _ f i l e ( last_key , values ) Gordon Rios Introduction to Hadoop
  • 30. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production SystemsECA Mean Precipitation: Step Two Map ((yyyymm,lat,lon),mean) -> (yyyymm, (lat,lon,mean)) map_two.py f o r l i n e i n sys . s t d i n : yyyymm_lat_lon , mean_precip = l i n e . s t r i p ( ) . s p l i t ( " t " ) yyyymm , l a t , l o n = yyyymm_lat_lon . s t r i p ( ) . s p l i t ( " , " ) p r i n t "%s t%s %s %s " % ( yyyymm , l a t , lon , mean_precip ) Empty reduce just write to a local file (hack since we’re running locally) reduce_two.py l a s t _ k e y = None values = [ ] f o r l i n e i n sys . s t d i n : ( key , v a l ) = l i n e . s t r i p ( ) . s p l i t ( " t " ) i f l a s t _ k e y and l a s t _ k e y ! = key : # t i m e t o e m i t reduced v a l u e w r i t e _ f i l e ( last_key , values ) values = [ ] l a s t _ k e y = key v a l u e s . append ( v a l ) # c r e a t e a s t r i n g w i t h t h r e e v a l u e s i f last_key : w r i t e _ f i l e ( last_key , values ) Gordon Rios Introduction to Hadoop
  • 31. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production SystemsExample: ECA Mean Precipitation Step One: input -> (yyyymm,lat,lon), mean precip % hadoop jar /Users/gordon/build/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input /Downloads/tarragona/ECA_blend_all_data/RR_STAID00* -output output -mapper /Desktop/tmp/tarragona/python/map_one.py -reducer /Desktop/tmp/tarragona/python/reduce_one.py % 10/10/25 19:34:31 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= ... % 10/10/25 20:07:36 INFO streaming.StreamJob: Job complete: job_local_0001 % 10/10/25 20:07:36 INFO streaming.StreamJob: Output: output Step Two: (date,lat,lon), mean precip -> files(yymm) % hadoop jar /Users/gordon/build/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input output/part-00000 -output output_two -mapper /Desktop/tmp/tarragona/python/map_two.py -reducer /Desktop/tmp/tarragona/python/reduce_two.py % 10/10/25 20:41:43 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= ... % 10/10/25 20:41:47 INFO streaming.StreamJob: Job complete: job_local_0001 % 10/10/25 20:41:47 INFO streaming.StreamJob: Output: output_two Gordon Rios Introduction to Hadoop
  • 32. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production SystemsExample: ECA Mean Precipitation Step One: input -> (yyyymm,lat,lon), mean precip % hadoop jar /Users/gordon/build/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input /Downloads/tarragona/ECA_blend_all_data/RR_STAID00* -output output -mapper /Desktop/tmp/tarragona/python/map_one.py -reducer /Desktop/tmp/tarragona/python/reduce_one.py % 10/10/25 19:34:31 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= ... % 10/10/25 20:07:36 INFO streaming.StreamJob: Job complete: job_local_0001 % 10/10/25 20:07:36 INFO streaming.StreamJob: Output: output Step Two: (date,lat,lon), mean precip -> files(yymm) % hadoop jar /Users/gordon/build/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input output/part-00000 -output output_two -mapper /Desktop/tmp/tarragona/python/map_two.py -reducer /Desktop/tmp/tarragona/python/reduce_two.py % 10/10/25 20:41:43 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= ... % 10/10/25 20:41:47 INFO streaming.StreamJob: Job complete: job_local_0001 % 10/10/25 20:41:47 INFO streaming.StreamJob: Output: output_two Gordon Rios Introduction to Hadoop
  • 33. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production SystemsBatch Processing in R And, after a little batch processing with R. . . batch-graphics.R library ( fields ) f i l e s <− c ( " 200901. d a t " , " 200902. d a t " , " 200903. d a t " , " 200904. d a t " , " 200905. d a t " , " 200906. d a t " , " 200907. d a t " , " 200908. d a t " , " 200909. d a t " , " 200910. d a t " , " 200911. d a t " , " 200912. d a t " ) i <− 1 for ( f in f i l e s ) { mat <− read . t a b l e ( f ) names ( mat ) <− c ( " l a t " , " l o n g " , " p r e c i p " ) png ( f i l e n a m e =paste ( " p r e c i p−" , i , " . png " , sep= " " ) , h e i g h t =480 , w i d t h =480) q u i l t . p l o t ( mat $long , mat $ l a t , mat $ p r e c i p , n c o l =100 , nrow =100 , y l i m =c ( 2 2 . 0 , 7 9 . 0 ) , x l i m =c ( − 5 2 . 0 , 7 2 . 0 ) , c o l =two . c o l o r s ( 2 5 6 , s t a r t = " wheat " , end= " d a r k b l u e " , middle = " b l u e " ) , z l i m =c ( 0 , 4 1 0 ) , add . legend=T , cex . l a b = 0 . 6 ) p o i n t s ( 1 . 2 4 5 3 , 41.1187 , pch =1) t e x t ( 1 . 2 4 5 3 , 41.1187 , " t a r r a g o n a " , cex = 0 . 8 , pos=1) p o i n t s ( 2 . 3 5 0 8 3 , 4 8 . 8 9 , pch =1) t e x t ( 2 . 3 5 0 8 , 4 8 . 8 9 , " p a r i s " , cex = 0 . 8 , pos=4) p o i n t s ( 1 2 . 4 8 2 3 , 41.8955 , pch =1) t e x t ( 1 2 . 4 8 2 3 , 41.8955 , " rome " , cex = 0 . 8 , pos=4) dev . o f f ( ) i <− i + 1 } Gordon Rios Introduction to Hadoop
  • 34. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production SystemsECA Precipitation 2009 Month: 1 Gordon Rios Introduction to Hadoop
  • 35. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production SystemsECA Precipitation 2009 Month: 2 Gordon Rios Introduction to Hadoop
  • 36. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production SystemsECA Precipitation 2009 Month: 3 Gordon Rios Introduction to Hadoop
  • 37. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production SystemsECA Precipitation 2009 Month: 4 Gordon Rios Introduction to Hadoop
  • 38. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production SystemsECA Precipitation 2009 Month: 5 Gordon Rios Introduction to Hadoop
  • 39. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production SystemsECA Precipitation 2009 Month: 6 Gordon Rios Introduction to Hadoop
  • 40. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production SystemsECA Precipitation 2009 Month: 7 Gordon Rios Introduction to Hadoop
  • 41. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production SystemsECA Precipitation 2009 Month: 8 Gordon Rios Introduction to Hadoop
  • 42. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production SystemsECA Precipitation 2009 Month: 9 Gordon Rios Introduction to Hadoop
  • 43. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production SystemsECA Precipitation 2009 Month: 10 Gordon Rios Introduction to Hadoop
  • 44. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production SystemsECA Precipitation 2009 Month: 11 Gordon Rios Introduction to Hadoop
  • 45. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production SystemsECA Precipitation 2009 Month: 12 Gordon Rios Introduction to Hadoop
  • 46. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production SystemsSummary of What We Did Work through a complete example but that’s not all since with very little additional work we can. . . Test the scripts in pseudo-distributed mode locally on our own machine Run the job on a compute cluster remotely Run the job in the cloud with EC2 there system as just another remote cluster Run the job with Amazon’s Elastic MapReduce http://aws.amazon.com/elasticmapreduce/ which allows you to pay for exactly as much computing as you use. See [White, 2011] for complete details on how to run in these different modes. . . Gordon Rios Introduction to Hadoop
  • 47. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production SystemsOutline 1 MapReduce / Hadoop for Scientific Data Mining Objectives for the Talk MapReduce as Simplified Parallel Computing Thinking in Terms of MapReduce 2 Hadoop is Open Source MapReduce Basics of Hadoop Hadoop Examples Developing Production Systems with Hadoop 3 Wider World of Hadoop Ad Hoc Analysis with Hadoop Further Reading Gordon Rios Introduction to Hadoop
  • 48. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production SystemsSystems Development APIs And, you can build production systems with Hadoop in either Java or C++. . . Full featured Java API for Hadoop Pipes is the C++ API for Hadoop MapReduce Cascading is an API for developing general data processing systems that incorporate MapReduce (http://www.cascading.org) Gordon Rios Introduction to Hadoop
  • 49. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production SystemsSystems Development APIs And, you can build production systems with Hadoop in either Java or C++. . . Full featured Java API for Hadoop Pipes is the C++ API for Hadoop MapReduce Cascading is an API for developing general data processing systems that incorporate MapReduce (http://www.cascading.org) Gordon Rios Introduction to Hadoop
  • 50. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production SystemsSystems Development APIs And, you can build production systems with Hadoop in either Java or C++. . . Full featured Java API for Hadoop Pipes is the C++ API for Hadoop MapReduce Cascading is an API for developing general data processing systems that incorporate MapReduce (http://www.cascading.org) Gordon Rios Introduction to Hadoop
  • 51. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production SystemsCascading Cascading allows developers to model very complex data flows at a higher level than Map & Reduce and then automatically generate and visualize the dozens or perhaps even hundreds of necessary Hadoop jobs as a graph Gordon Rios Introduction to Hadoop
  • 52. MapReduce / Hadoop for Scientific Data Mining Ad Hoc Analysis Hadoop = Open Source MapReduce Further Reading Wider World of HadoopOutline 1 MapReduce / Hadoop for Scientific Data Mining Objectives for the Talk MapReduce as Simplified Parallel Computing Thinking in Terms of MapReduce 2 Hadoop is Open Source MapReduce Basics of Hadoop Hadoop Examples Developing Production Systems with Hadoop 3 Wider World of Hadoop Ad Hoc Analysis with Hadoop Further Reading Gordon Rios Introduction to Hadoop
  • 53. MapReduce / Hadoop for Scientific Data Mining Ad Hoc Analysis Hadoop = Open Source MapReduce Further Reading Wider World of HadoopAd Hoc Analysis What’s missing? Sometimes you need to do fast ad hoc queries. . . can we do that in a scalable way? Pig: “Pig is a scripting language for exploring large datasets” [White, 2011] (Yahoo!) Hive: provide an SQL interface for running ad hoc queries and other data processing tasks for SQL analysts (Facebook) Hbase: Column oriented database along the lines of Google’s Bigtable database (Powerset) Hypertable: GPL clone of Google’s Bigtable database written in C++ (Zvents) Google’s Bigtable database is described in [Chang, Dean, Ghemawat, et al., 2008] Gordon Rios Introduction to Hadoop
  • 54. MapReduce / Hadoop for Scientific Data Mining Ad Hoc Analysis Hadoop = Open Source MapReduce Further Reading Wider World of HadoopAd Hoc Analysis What’s missing? Sometimes you need to do fast ad hoc queries. . . can we do that in a scalable way? Pig: “Pig is a scripting language for exploring large datasets” [White, 2011] (Yahoo!) Hive: provide an SQL interface for running ad hoc queries and other data processing tasks for SQL analysts (Facebook) Hbase: Column oriented database along the lines of Google’s Bigtable database (Powerset) Hypertable: GPL clone of Google’s Bigtable database written in C++ (Zvents) Google’s Bigtable database is described in [Chang, Dean, Ghemawat, et al., 2008] Gordon Rios Introduction to Hadoop
  • 55. MapReduce / Hadoop for Scientific Data Mining Ad Hoc Analysis Hadoop = Open Source MapReduce Further Reading Wider World of HadoopAd Hoc Analysis What’s missing? Sometimes you need to do fast ad hoc queries. . . can we do that in a scalable way? Pig: “Pig is a scripting language for exploring large datasets” [White, 2011] (Yahoo!) Hive: provide an SQL interface for running ad hoc queries and other data processing tasks for SQL analysts (Facebook) Hbase: Column oriented database along the lines of Google’s Bigtable database (Powerset) Hypertable: GPL clone of Google’s Bigtable database written in C++ (Zvents) Google’s Bigtable database is described in [Chang, Dean, Ghemawat, et al., 2008] Gordon Rios Introduction to Hadoop
  • 56. MapReduce / Hadoop for Scientific Data Mining Ad Hoc Analysis Hadoop = Open Source MapReduce Further Reading Wider World of HadoopAd Hoc Analysis What’s missing? Sometimes you need to do fast ad hoc queries. . . can we do that in a scalable way? Pig: “Pig is a scripting language for exploring large datasets” [White, 2011] (Yahoo!) Hive: provide an SQL interface for running ad hoc queries and other data processing tasks for SQL analysts (Facebook) Hbase: Column oriented database along the lines of Google’s Bigtable database (Powerset) Hypertable: GPL clone of Google’s Bigtable database written in C++ (Zvents) Google’s Bigtable database is described in [Chang, Dean, Ghemawat, et al., 2008] Gordon Rios Introduction to Hadoop
  • 57. MapReduce / Hadoop for Scientific Data Mining Ad Hoc Analysis Hadoop = Open Source MapReduce Further Reading Wider World of HadoopAd Hoc Analysis What’s missing? Sometimes you need to do fast ad hoc queries. . . can we do that in a scalable way? Pig: “Pig is a scripting language for exploring large datasets” [White, 2011] (Yahoo!) Hive: provide an SQL interface for running ad hoc queries and other data processing tasks for SQL analysts (Facebook) Hbase: Column oriented database along the lines of Google’s Bigtable database (Powerset) Hypertable: GPL clone of Google’s Bigtable database written in C++ (Zvents) Google’s Bigtable database is described in [Chang, Dean, Ghemawat, et al., 2008] Gordon Rios Introduction to Hadoop
  • 58. MapReduce / Hadoop for Scientific Data Mining Ad Hoc Analysis Hadoop = Open Source MapReduce Further Reading Wider World of HadoopInteresting Application Frameworks with Hadoop Here are a few examples of frameworks in development or already available that use Hadoop as a platform. . . Apache Mahout: Ambitious project to implement popular machine learning algorithms and recommenders with Hadoop3 Graph: Jake Hoffman from Yahoo Research has released some of his work on large scale network analysis with Hadoop with prototype code4 . Also see [Vassilvitskii, 2010] for related graph analysis research. Application to GIS: Nathan Kerr’s M.S. Thesis with lots of details on how to do GIS with Hadoop5 3 http://mahout.apache.org/ 4 http://github.com/jhofman/icwsm2010_tutorial 5 http://www.nathankerr.com/projects/parallel-gis-processing/alternative_ approaches_to_parallel_gis_processing.html Gordon Rios Introduction to Hadoop
  • 59. MapReduce / Hadoop for Scientific Data Mining Ad Hoc Analysis Hadoop = Open Source MapReduce Further Reading Wider World of HadoopOutline 1 MapReduce / Hadoop for Scientific Data Mining Objectives for the Talk MapReduce as Simplified Parallel Computing Thinking in Terms of MapReduce 2 Hadoop is Open Source MapReduce Basics of Hadoop Hadoop Examples Developing Production Systems with Hadoop 3 Wider World of Hadoop Ad Hoc Analysis with Hadoop Further Reading Gordon Rios Introduction to Hadoop
  • 60. MapReduce / Hadoop for Scientific Data Mining Ad Hoc Analysis Hadoop = Open Source MapReduce Further Reading Wider World of HadoopFurther Reading White, T. Hadoop: The Definitive Guide, 2nd Edition O’Reilly Media, Inc., Sebastopol, CA, 2011 Sanderson, D. Programming Google App Engine O’Reilly Media, Inc., Sebastopol, CA, 2009 Murty, J. Programming Amazon Web Services O’Reilly Media, Inc., Sebastopol, CA, 2008 Dean, J. and Ghemawat, S. MapReduce: simplified data processing on large clusters Communications of the ACM, 51(1):107–113, 2008 Chang, Fay and Dean, Jeffrey and Ghemawat, Sanjay and Hsieh, Wilson C. and Wallach, Deborah A. and Burrows, Mike and Chandra, Tushar and Fikes, Andrew and Gruber, Robert E. Bigtable: a distributed storage system for structured data OSDI ’06: Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation, USENIX Assoc., Berkeley, CA, 2006 MapReduce on Wikipedia http://en.wikipedia.org/wiki/MapReduce Vassilvitskii, S. XXL Graph Algorithms, Hadoop Summit 2010 http://developer.yahoo.com/events/hadoopsummit2010/ Gordon Rios Introduction to Hadoop

×