The document is an introduction to Hadoop and MapReduce for scientific data mining. It aims to introduce MapReduce thinking and how it enables parallel computing, introduce Hadoop as an open source implementation of MapReduce, and present an example of using Hadoop's streaming API for a scientific data mining task. It also discusses higher-level concepts for performing ad hoc analysis and building systems with Hadoop.
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
An Introduction to Applying MapReduce and Hadoop for Scientific Data Mining
1. MapReduce / Hadoop for Scientific Data Mining
Hadoop = Open Source MapReduce
Wider World of Hadoop
An Introduction to the World of Hadoop
Applications to Scientific Data Mining
Gordon Rios
g.rios@4c.ucc.ie
Cork Constraint Computation Centre (4C)
University College Cork
October 29, 2010
Gordon Rios Introduction to Hadoop
2. MapReduce / Hadoop for Scientific Data Mining
Hadoop = Open Source MapReduce
Wider World of Hadoop
Outline
1 MapReduce / Hadoop for Scientific Data Mining
Objectives for the Talk
MapReduce as Simplified Parallel Computing
Thinking in Terms of MapReduce
2 Hadoop is Open Source MapReduce
Basics of Hadoop
Hadoop Examples
Developing Production Systems with Hadoop
3 Wider World of Hadoop
Ad Hoc Analysis with Hadoop
Further Reading
Gordon Rios Introduction to Hadoop
3. MapReduce / Hadoop for Scientific Data Mining Objectives
Hadoop = Open Source MapReduce Parallel Computing with MapReduce
Wider World of Hadoop MapReduce Thinking
Outline
1 MapReduce / Hadoop for Scientific Data Mining
Objectives for the Talk
MapReduce as Simplified Parallel Computing
Thinking in Terms of MapReduce
2 Hadoop is Open Source MapReduce
Basics of Hadoop
Hadoop Examples
Developing Production Systems with Hadoop
3 Wider World of Hadoop
Ad Hoc Analysis with Hadoop
Further Reading
Gordon Rios Introduction to Hadoop
4. MapReduce / Hadoop for Scientific Data Mining Objectives
Hadoop = Open Source MapReduce Parallel Computing with MapReduce
Wider World of Hadoop MapReduce Thinking
Objectives
At the end of this talk I want you to have ideas for how to apply
MapReduce to your domains and confidence that Hadoop is a
good way to do it. . .
Introduce thinking in terms of MapReduce and why it’s a
good idea
Introduce Hadoop as an open source implementation of
MapReduce
Present a detailed example of using the Hadoop
streaming API for a scientific data mining task
Discuss higher level notions for performing ad hoc analysis
and building systems with Hadoop
Gordon Rios Introduction to Hadoop
5. MapReduce / Hadoop for Scientific Data Mining Objectives
Hadoop = Open Source MapReduce Parallel Computing with MapReduce
Wider World of Hadoop MapReduce Thinking
Objectives
At the end of this talk I want you to have ideas for how to apply
MapReduce to your domains and confidence that Hadoop is a
good way to do it. . .
Introduce thinking in terms of MapReduce and why it’s a
good idea
Introduce Hadoop as an open source implementation of
MapReduce
Present a detailed example of using the Hadoop
streaming API for a scientific data mining task
Discuss higher level notions for performing ad hoc analysis
and building systems with Hadoop
Gordon Rios Introduction to Hadoop
6. MapReduce / Hadoop for Scientific Data Mining Objectives
Hadoop = Open Source MapReduce Parallel Computing with MapReduce
Wider World of Hadoop MapReduce Thinking
Objectives
At the end of this talk I want you to have ideas for how to apply
MapReduce to your domains and confidence that Hadoop is a
good way to do it. . .
Introduce thinking in terms of MapReduce and why it’s a
good idea
Introduce Hadoop as an open source implementation of
MapReduce
Present a detailed example of using the Hadoop
streaming API for a scientific data mining task
Discuss higher level notions for performing ad hoc analysis
and building systems with Hadoop
Gordon Rios Introduction to Hadoop
7. MapReduce / Hadoop for Scientific Data Mining Objectives
Hadoop = Open Source MapReduce Parallel Computing with MapReduce
Wider World of Hadoop MapReduce Thinking
Objectives
At the end of this talk I want you to have ideas for how to apply
MapReduce to your domains and confidence that Hadoop is a
good way to do it. . .
Introduce thinking in terms of MapReduce and why it’s a
good idea
Introduce Hadoop as an open source implementation of
MapReduce
Present a detailed example of using the Hadoop
streaming API for a scientific data mining task
Discuss higher level notions for performing ad hoc analysis
and building systems with Hadoop
Gordon Rios Introduction to Hadoop
8. MapReduce / Hadoop for Scientific Data Mining Objectives
Hadoop = Open Source MapReduce Parallel Computing with MapReduce
Wider World of Hadoop MapReduce Thinking
Outline
1 MapReduce / Hadoop for Scientific Data Mining
Objectives for the Talk
MapReduce as Simplified Parallel Computing
Thinking in Terms of MapReduce
2 Hadoop is Open Source MapReduce
Basics of Hadoop
Hadoop Examples
Developing Production Systems with Hadoop
3 Wider World of Hadoop
Ad Hoc Analysis with Hadoop
Further Reading
Gordon Rios Introduction to Hadoop
9. MapReduce / Hadoop for Scientific Data Mining Objectives
Hadoop = Open Source MapReduce Parallel Computing with MapReduce
Wider World of Hadoop MapReduce Thinking
MapReduce
MapReduce is distributed computing where we take advantage of
data locality to push the computation to the data. . .
Distributed computing: clusters of computers with local memory
and disk (network intensive for big data)
Parallel Computing: multiple CPUs processing over shared
memory and filesystem
If we can decompose the problem into independent map and
reduce tasks we can achieve “easy” parallelism with
MapReduce. . .
1 Map works independently to convert input data to key value
pairs. . .
2 Reduce works independently on all values for a given key
and transforms them to a single output set (possibly even
just the ∅) per key. . .
Now, let’s expand that a bit. . .
Gordon Rios Introduction to Hadoop
10. MapReduce / Hadoop for Scientific Data Mining Objectives
Hadoop = Open Source MapReduce Parallel Computing with MapReduce
Wider World of Hadoop MapReduce Thinking
MapReduce
MapReduce is distributed computing where we take advantage of
data locality to push the computation to the data. . .
Distributed computing: clusters of computers with local memory
and disk (network intensive for big data)
Parallel Computing: multiple CPUs processing over shared
memory and filesystem
If we can decompose the problem into independent map and
reduce tasks we can achieve “easy” parallelism with
MapReduce. . .
1 Map works independently to convert input data to key value
pairs. . .
2 Reduce works independently on all values for a given key
and transforms them to a single output set (possibly even
just the ∅) per key. . .
Now, let’s expand that a bit. . .
Gordon Rios Introduction to Hadoop
11. MapReduce / Hadoop for Scientific Data Mining Objectives
Hadoop = Open Source MapReduce Parallel Computing with MapReduce
Wider World of Hadoop MapReduce Thinking
MapReduce
MapReduce is distributed computing where we take advantage of
data locality to push the computation to the data. . .
Distributed computing: clusters of computers with local memory
and disk (network intensive for big data)
Parallel Computing: multiple CPUs processing over shared
memory and filesystem
If we can decompose the problem into independent map and
reduce tasks we can achieve “easy” parallelism with
MapReduce. . .
1 Map works independently to convert input data to key value
pairs. . .
2 Reduce works independently on all values for a given key
and transforms them to a single output set (possibly even
just the ∅) per key. . .
Now, let’s expand that a bit. . .
Gordon Rios Introduction to Hadoop
12. MapReduce / Hadoop for Scientific Data Mining Objectives
Hadoop = Open Source MapReduce Parallel Computing with MapReduce
Wider World of Hadoop MapReduce Thinking
MapReduce
MapReduce is distributed computing where we take advantage of
data locality to push the computation to the data. . .
Distributed computing: clusters of computers with local memory
and disk (network intensive for big data)
Parallel Computing: multiple CPUs processing over shared
memory and filesystem
If we can decompose the problem into independent map and
reduce tasks we can achieve “easy” parallelism with
MapReduce. . .
1 Map works independently to convert input data to key value
pairs. . .
2 Reduce works independently on all values for a given key
and transforms them to a single output set (possibly even
just the ∅) per key. . .
Now, let’s expand that a bit. . .
Gordon Rios Introduction to Hadoop
13. MapReduce / Hadoop for Scientific Data Mining Objectives
Hadoop = Open Source MapReduce Parallel Computing with MapReduce
Wider World of Hadoop MapReduce Thinking
MapReduce
MapReduce is distributed computing where we take advantage of
data locality to push the computation to the data. . .
Distributed computing: clusters of computers with local memory
and disk (network intensive for big data)
Parallel Computing: multiple CPUs processing over shared
memory and filesystem
If we can decompose the problem into independent map and
reduce tasks we can achieve “easy” parallelism with
MapReduce. . .
1 Map works independently to convert input data to key value
pairs. . .
2 Reduce works independently on all values for a given key
and transforms them to a single output set (possibly even
just the ∅) per key. . .
Now, let’s expand that a bit. . .
Gordon Rios Introduction to Hadoop
14. MapReduce / Hadoop for Scientific Data Mining Objectives
Hadoop = Open Source MapReduce Parallel Computing with MapReduce
Wider World of Hadoop MapReduce Thinking
MapReduce
MapReduce is distributed computing where we take advantage of
data locality to push the computation to the data. . .
Distributed computing: clusters of computers with local memory
and disk (network intensive for big data)
Parallel Computing: multiple CPUs processing over shared
memory and filesystem
If we can decompose the problem into independent map and
reduce tasks we can achieve “easy” parallelism with
MapReduce. . .
1 Map works independently to convert input data to key value
pairs. . .
2 Reduce works independently on all values for a given key
and transforms them to a single output set (possibly even
just the ∅) per key. . .
Now, let’s expand that a bit. . .
Gordon Rios Introduction to Hadoop
15. MapReduce / Hadoop for Scientific Data Mining Objectives
Hadoop = Open Source MapReduce Parallel Computing with MapReduce
Wider World of Hadoop MapReduce Thinking
Basics Elements of MapReduce
MapReduce is distributed sort with specific places to insert
application logic. . .
an input reader: read work data W from file system1 and
produce a set of splits S: W → S
a Map function: (S) → (K , V )
combiner function: a mapper optimization. . .
partition function: partition2 keys k ∈ K to reducers K → R
compare function cmp(ki , kj ): sort keys presented to each
reducer
a Reduce function: reduce output from all mappers for a
particular to another set of values for that key wk
(k , V ) → (k , wk ))
an output writer: write output to file system.
1
A distributed file system (DFS) for stability and scale
2
The default hash keys modulo number of reducers
Gordon Rios Introduction to Hadoop
16. MapReduce / Hadoop for Scientific Data Mining Objectives
Hadoop = Open Source MapReduce Parallel Computing with MapReduce
Wider World of Hadoop MapReduce Thinking
Outline
1 MapReduce / Hadoop for Scientific Data Mining
Objectives for the Talk
MapReduce as Simplified Parallel Computing
Thinking in Terms of MapReduce
2 Hadoop is Open Source MapReduce
Basics of Hadoop
Hadoop Examples
Developing Production Systems with Hadoop
3 Wider World of Hadoop
Ad Hoc Analysis with Hadoop
Further Reading
Gordon Rios Introduction to Hadoop
17. MapReduce / Hadoop for Scientific Data Mining Objectives
Hadoop = Open Source MapReduce Parallel Computing with MapReduce
Wider World of Hadoop MapReduce Thinking
Examples of Map and Reduce
Let’s start with a few examples of Map. . .
Word Count: read in a stream of text (e.g. a document or a set of
documents) and emit each word as a key with a value of 1
Inverted Index: read in a stream of documents and emit each
word as a key and the document ID as the value
Max Temperature: read in formatted data and emit year as a
key with temperature as the value
Mean Rain Precipitation: read in daily data and emit
(year-month, lat, long) as a key with temperature as
the value
Reduce in these cases simply applies a count, list, max,
average, to a set of values for each key,
respectively. [Dean, Ghemawat, 2008, Wikipedia, 2010, White, 2011]
Gordon Rios Introduction to Hadoop
18. MapReduce / Hadoop for Scientific Data Mining Objectives
Hadoop = Open Source MapReduce Parallel Computing with MapReduce
Wider World of Hadoop MapReduce Thinking
Visualizing Word Count
source: Chris Wensel from
http://www.cascading.org
Gordon Rios Introduction to Hadoop
19. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics
Hadoop = Open Source MapReduce Hadoop Examples
Wider World of Hadoop Developing Production Systems
Outline
1 MapReduce / Hadoop for Scientific Data Mining
Objectives for the Talk
MapReduce as Simplified Parallel Computing
Thinking in Terms of MapReduce
2 Hadoop is Open Source MapReduce
Basics of Hadoop
Hadoop Examples
Developing Production Systems with Hadoop
3 Wider World of Hadoop
Ad Hoc Analysis with Hadoop
Further Reading
Gordon Rios Introduction to Hadoop
20. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics
Hadoop = Open Source MapReduce Hadoop Examples
Wider World of Hadoop Developing Production Systems
Engineering Intermezzo
This is how easy it is to get Hadoop installed . . . given that you
have Java 6 installed already. . .
Get Hadoop: http://hadoop.apache.org/
% t a r x z f hadoop−x . y . z . t a r . gz
% e x p o r t HADOOP_INSTALL=BUILD_DIR / hadoop−x . y . z
% e x p o r t PATH=$PATH : $HADOOP_INSTALL / b i n
Gordon Rios Introduction to Hadoop
21. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics
Hadoop = Open Source MapReduce Hadoop Examples
Wider World of Hadoop Developing Production Systems
MapReduce with Hadoop and the streaming library
Now, let’s take a closer look at how Hadoop implements
MapReduce from [White, 2011]. . .
Gordon Rios Introduction to Hadoop
22. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics
Hadoop = Open Source MapReduce Hadoop Examples
Wider World of Hadoop Developing Production Systems
Hadoop Streaming Library
We’ll focus on the streaming library as it’s the most natural for
scientific or technical computing. . . let’s look at the Definitive
Guide’s weather example. . .
Gordon Rios Introduction to Hadoop
23. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics
Hadoop = Open Source MapReduce Hadoop Examples
Wider World of Hadoop Developing Production Systems
Outline
1 MapReduce / Hadoop for Scientific Data Mining
Objectives for the Talk
MapReduce as Simplified Parallel Computing
Thinking in Terms of MapReduce
2 Hadoop is Open Source MapReduce
Basics of Hadoop
Hadoop Examples
Developing Production Systems with Hadoop
3 Wider World of Hadoop
Ad Hoc Analysis with Hadoop
Further Reading
Gordon Rios Introduction to Hadoop
24. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics
Hadoop = Open Source MapReduce Hadoop Examples
Wider World of Hadoop Developing Production Systems
Hadoop Book Examples
More examples from Hadoop: The Definitive Guide, 2nd Edition
(Hadoop 20.1) http://www.hadoopbook.com/. . . here’s
how to install and try them for yourself. . .
Install Git: http://git-scm.com/
Visit github for book code:
http://github.com/tomwhite/
hadoop-book/
Checkout code examples from The Definitive Guide
% cd BUILD_DIR
% git clone http://github.com/tomwhite/hadoop-book.git hadoop-book
Gordon Rios Introduction to Hadoop
25. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics
Hadoop = Open Source MapReduce Hadoop Examples
Wider World of Hadoop Developing Production Systems
Example: ECA Mean Precipitation
Let’s compute mean precipitation at over 2,000 weather stations and
make some graphics. There are 2,186 files with median of 21,875
lines each, a minimum of 1,025 and a maximum of 78,090.
ECA Daily Data
The ECA dataset contains series of daily observations at meteorological stations throughout Europe and the
Mediterranean. Part of the dataset is freely available for non-commercial research. To download this public data
select one of the options below. Note that a gridded version with daily temperature and precipitation fields is also
available. source: http://eca.knmi.nl/dailydata/index.php
File Format
FILE FORMAT ( MISSING VALUE CODE = −9999):
01−06 STAID : Station i d e n t i f i e r
08−13 SOUID : Source i d e n t i f i e r
15−22 DATE : Date YYYYMMDD
24−28 RR : P r e c i p i t a t i o n amount i n 0 . 1 mm
30−34 Q_RR : q u a l i t y code f o r RR ( 0 = ’ v a l i d ’ ; 1= ’ suspect ’ ; 9= ’ missing ’ )
Gordon Rios Introduction to Hadoop
26. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics
Hadoop = Open Source MapReduce Hadoop Examples
Wider World of Hadoop Developing Production Systems
Example: ECA Mean Precipitation
Scientific Data Mining: use the Hadoop stream library and
manually pipeline MapReduce jobs together as needed. . .
Write hadoop scripts in python in two steps
Test cat data | map.py | sort | reduce.py >
output (not shown)
Process data into individual files for each time period
(Year/Month) of interest using hadoop stream library (local
mode)
Call R in batch mode to produce image files
Gordon Rios Introduction to Hadoop
27. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics
Hadoop = Open Source MapReduce Hadoop Examples
Wider World of Hadoop Developing Production Systems
ECA Mean Precipitation: Step One
map_one.py
def l a t _ l o n _ t o _ c o o r d ( s ) :
sign = 1
d , m, s = map( lambda x : f l o a t ( x ) , s . s p l i t ( " : " ) )
s i g n = −1 i f d < 0 else 1
x = abs ( d ) + m / 60.0 + s / 3600.0
return f l o a t ( sign ∗ x )
f o r l i n e i n sys . s t d i n :
# f l d s = ( s t a i d , souid , date , r r , q _ r r )
flds = line . strip (). split ( " , " )
i f len ( f l d s ) != 5:
continue
staid = flds [ 0 ] . strip () # station id
date = f l d s [ 2 ] . s t r i p ( ) # YYYYMMDD
i f date < BEGIN_DATE or date > END_DATE :
continue
rr = flds [3]. strip () # p r e c i p i t a t i o n i n 0 . 1 mm
q_rr = f l d s [ 4 ] . s t r i p ( ) # q u a l i t y code " 0 " = v a l i d
l a t , l o n = l a t l o n s . g e t ( s t a i d , ( None , None ) )
i f q _ r r == ’ 0 ’ and ( l a t i s not None ) and ( l o n i s not None ) :
p r i n t "%s ,%.4 f ,%.4 f t%s " % ( date [ 0 : 6 ] , l a t , lon , r r )
Gordon Rios Introduction to Hadoop
28. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics
Hadoop = Open Source MapReduce Hadoop Examples
Wider World of Hadoop Developing Production Systems
ECA Mean Precipitation: Step One (cont)
reduce_one.py
( l a s t _ k e y , x , n ) = ( None , 0 . 0 , 0 )
f o r l i n e i n sys . s t d i n :
( key , v a l ) = l i n e . s t r i p ( ) . s p l i t ( " t " )
i f l a s t _ k e y and l a s t _ k e y ! = key : # t i m e t o e m i t reduced v a l u e
i f n > 0:
p r i n t "%s t %.2 f " % ( l a s t _ k e y , x / n )
x = 0.0
n = 0
# we j u s t want data f o r t h e year 2009
( l a s t _ k e y , x , n ) = ( key , x + f l o a t ( v a l ) , n + 1 )
i f last_key :
i f n > 0:
p r i n t "%s t %.2 f " % ( l a s t _ k e y , x / n )
Gordon Rios Introduction to Hadoop
29. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics
Hadoop = Open Source MapReduce Hadoop Examples
Wider World of Hadoop Developing Production Systems
ECA Mean Precipitation: Step Two
Map ((yyyymm,lat,lon),mean) -> (yyyymm, (lat,lon,mean))
map_two.py
f o r l i n e i n sys . s t d i n :
yyyymm_lat_lon , mean_precip = l i n e . s t r i p ( ) . s p l i t ( " t " )
yyyymm , l a t , l o n = yyyymm_lat_lon . s t r i p ( ) . s p l i t ( " , " )
p r i n t "%s t%s %s %s " % ( yyyymm , l a t , lon , mean_precip )
Empty reduce just write to a local file (hack since we’re running locally)
reduce_two.py
l a s t _ k e y = None
values = [ ]
f o r l i n e i n sys . s t d i n :
( key , v a l ) = l i n e . s t r i p ( ) . s p l i t ( " t " )
i f l a s t _ k e y and l a s t _ k e y ! = key : # t i m e t o e m i t reduced v a l u e
w r i t e _ f i l e ( last_key , values )
values = [ ]
l a s t _ k e y = key
v a l u e s . append ( v a l ) # c r e a t e a s t r i n g w i t h t h r e e v a l u e s
i f last_key :
w r i t e _ f i l e ( last_key , values )
Gordon Rios Introduction to Hadoop
30. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics
Hadoop = Open Source MapReduce Hadoop Examples
Wider World of Hadoop Developing Production Systems
ECA Mean Precipitation: Step Two
Map ((yyyymm,lat,lon),mean) -> (yyyymm, (lat,lon,mean))
map_two.py
f o r l i n e i n sys . s t d i n :
yyyymm_lat_lon , mean_precip = l i n e . s t r i p ( ) . s p l i t ( " t " )
yyyymm , l a t , l o n = yyyymm_lat_lon . s t r i p ( ) . s p l i t ( " , " )
p r i n t "%s t%s %s %s " % ( yyyymm , l a t , lon , mean_precip )
Empty reduce just write to a local file (hack since we’re running locally)
reduce_two.py
l a s t _ k e y = None
values = [ ]
f o r l i n e i n sys . s t d i n :
( key , v a l ) = l i n e . s t r i p ( ) . s p l i t ( " t " )
i f l a s t _ k e y and l a s t _ k e y ! = key : # t i m e t o e m i t reduced v a l u e
w r i t e _ f i l e ( last_key , values )
values = [ ]
l a s t _ k e y = key
v a l u e s . append ( v a l ) # c r e a t e a s t r i n g w i t h t h r e e v a l u e s
i f last_key :
w r i t e _ f i l e ( last_key , values )
Gordon Rios Introduction to Hadoop
31. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics
Hadoop = Open Source MapReduce Hadoop Examples
Wider World of Hadoop Developing Production Systems
Example: ECA Mean Precipitation
Step One: input -> (yyyymm,lat,lon), mean precip
% hadoop jar /Users/gordon/build/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input
/Downloads/tarragona/ECA_blend_all_data/RR_STAID00* -output output -mapper
/Desktop/tmp/tarragona/python/map_one.py -reducer /Desktop/tmp/tarragona/python/reduce_one.py
% 10/10/25 19:34:31 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
...
% 10/10/25 20:07:36 INFO streaming.StreamJob: Job complete: job_local_0001
% 10/10/25 20:07:36 INFO streaming.StreamJob: Output: output
Step Two: (date,lat,lon), mean precip -> files(yymm)
% hadoop jar /Users/gordon/build/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input output/part-00000
-output output_two -mapper /Desktop/tmp/tarragona/python/map_two.py -reducer
/Desktop/tmp/tarragona/python/reduce_two.py
% 10/10/25 20:41:43 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
...
% 10/10/25 20:41:47 INFO streaming.StreamJob: Job complete: job_local_0001
% 10/10/25 20:41:47 INFO streaming.StreamJob: Output: output_two
Gordon Rios Introduction to Hadoop
32. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics
Hadoop = Open Source MapReduce Hadoop Examples
Wider World of Hadoop Developing Production Systems
Example: ECA Mean Precipitation
Step One: input -> (yyyymm,lat,lon), mean precip
% hadoop jar /Users/gordon/build/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input
/Downloads/tarragona/ECA_blend_all_data/RR_STAID00* -output output -mapper
/Desktop/tmp/tarragona/python/map_one.py -reducer /Desktop/tmp/tarragona/python/reduce_one.py
% 10/10/25 19:34:31 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
...
% 10/10/25 20:07:36 INFO streaming.StreamJob: Job complete: job_local_0001
% 10/10/25 20:07:36 INFO streaming.StreamJob: Output: output
Step Two: (date,lat,lon), mean precip -> files(yymm)
% hadoop jar /Users/gordon/build/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input output/part-00000
-output output_two -mapper /Desktop/tmp/tarragona/python/map_two.py -reducer
/Desktop/tmp/tarragona/python/reduce_two.py
% 10/10/25 20:41:43 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
...
% 10/10/25 20:41:47 INFO streaming.StreamJob: Job complete: job_local_0001
% 10/10/25 20:41:47 INFO streaming.StreamJob: Output: output_two
Gordon Rios Introduction to Hadoop
33. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics
Hadoop = Open Source MapReduce Hadoop Examples
Wider World of Hadoop Developing Production Systems
Batch Processing in R
And, after a little batch processing with R. . .
batch-graphics.R
library ( fields )
f i l e s <− c ( " 200901. d a t " , " 200902. d a t " , " 200903. d a t " ,
" 200904. d a t " , " 200905. d a t " , " 200906. d a t " ,
" 200907. d a t " , " 200908. d a t " , " 200909. d a t " ,
" 200910. d a t " , " 200911. d a t " , " 200912. d a t " )
i <− 1
for ( f in f i l e s ) {
mat <− read . t a b l e ( f )
names ( mat ) <− c ( " l a t " , " l o n g " , " p r e c i p " )
png ( f i l e n a m e =paste ( " p r e c i p−" , i , " . png " , sep= " " ) , h e i g h t =480 , w i d t h =480)
q u i l t . p l o t ( mat $long , mat $ l a t , mat $ p r e c i p , n c o l =100 , nrow =100 ,
y l i m =c ( 2 2 . 0 , 7 9 . 0 ) , x l i m =c ( − 5 2 . 0 , 7 2 . 0 ) ,
c o l =two . c o l o r s ( 2 5 6 , s t a r t = " wheat " , end= " d a r k b l u e " , middle = " b l u e " ) ,
z l i m =c ( 0 , 4 1 0 ) , add . legend=T , cex . l a b = 0 . 6 )
p o i n t s ( 1 . 2 4 5 3 , 41.1187 , pch =1)
t e x t ( 1 . 2 4 5 3 , 41.1187 , " t a r r a g o n a " , cex = 0 . 8 , pos=1)
p o i n t s ( 2 . 3 5 0 8 3 , 4 8 . 8 9 , pch =1)
t e x t ( 2 . 3 5 0 8 , 4 8 . 8 9 , " p a r i s " , cex = 0 . 8 , pos=4)
p o i n t s ( 1 2 . 4 8 2 3 , 41.8955 , pch =1)
t e x t ( 1 2 . 4 8 2 3 , 41.8955 , " rome " , cex = 0 . 8 , pos=4)
dev . o f f ( )
i <− i + 1
}
Gordon Rios Introduction to Hadoop
34. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics
Hadoop = Open Source MapReduce Hadoop Examples
Wider World of Hadoop Developing Production Systems
ECA Precipitation 2009 Month: 1
Gordon Rios Introduction to Hadoop
35. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics
Hadoop = Open Source MapReduce Hadoop Examples
Wider World of Hadoop Developing Production Systems
ECA Precipitation 2009 Month: 2
Gordon Rios Introduction to Hadoop
36. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics
Hadoop = Open Source MapReduce Hadoop Examples
Wider World of Hadoop Developing Production Systems
ECA Precipitation 2009 Month: 3
Gordon Rios Introduction to Hadoop
37. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics
Hadoop = Open Source MapReduce Hadoop Examples
Wider World of Hadoop Developing Production Systems
ECA Precipitation 2009 Month: 4
Gordon Rios Introduction to Hadoop
38. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics
Hadoop = Open Source MapReduce Hadoop Examples
Wider World of Hadoop Developing Production Systems
ECA Precipitation 2009 Month: 5
Gordon Rios Introduction to Hadoop
39. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics
Hadoop = Open Source MapReduce Hadoop Examples
Wider World of Hadoop Developing Production Systems
ECA Precipitation 2009 Month: 6
Gordon Rios Introduction to Hadoop
40. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics
Hadoop = Open Source MapReduce Hadoop Examples
Wider World of Hadoop Developing Production Systems
ECA Precipitation 2009 Month: 7
Gordon Rios Introduction to Hadoop
41. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics
Hadoop = Open Source MapReduce Hadoop Examples
Wider World of Hadoop Developing Production Systems
ECA Precipitation 2009 Month: 8
Gordon Rios Introduction to Hadoop
42. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics
Hadoop = Open Source MapReduce Hadoop Examples
Wider World of Hadoop Developing Production Systems
ECA Precipitation 2009 Month: 9
Gordon Rios Introduction to Hadoop
43. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics
Hadoop = Open Source MapReduce Hadoop Examples
Wider World of Hadoop Developing Production Systems
ECA Precipitation 2009 Month: 10
Gordon Rios Introduction to Hadoop
44. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics
Hadoop = Open Source MapReduce Hadoop Examples
Wider World of Hadoop Developing Production Systems
ECA Precipitation 2009 Month: 11
Gordon Rios Introduction to Hadoop
45. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics
Hadoop = Open Source MapReduce Hadoop Examples
Wider World of Hadoop Developing Production Systems
ECA Precipitation 2009 Month: 12
Gordon Rios Introduction to Hadoop
46. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics
Hadoop = Open Source MapReduce Hadoop Examples
Wider World of Hadoop Developing Production Systems
Summary of What We Did
Work through a complete example but that’s not all since with very
little additional work we can. . .
Test the scripts in pseudo-distributed mode locally on our own
machine
Run the job on a compute cluster remotely
Run the job in the cloud with EC2 there system as just another
remote cluster
Run the job with Amazon’s Elastic MapReduce
http://aws.amazon.com/elasticmapreduce/ which
allows you to pay for exactly as much computing as you use.
See [White, 2011] for complete details on how to run in these different
modes. . .
Gordon Rios Introduction to Hadoop
47. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics
Hadoop = Open Source MapReduce Hadoop Examples
Wider World of Hadoop Developing Production Systems
Outline
1 MapReduce / Hadoop for Scientific Data Mining
Objectives for the Talk
MapReduce as Simplified Parallel Computing
Thinking in Terms of MapReduce
2 Hadoop is Open Source MapReduce
Basics of Hadoop
Hadoop Examples
Developing Production Systems with Hadoop
3 Wider World of Hadoop
Ad Hoc Analysis with Hadoop
Further Reading
Gordon Rios Introduction to Hadoop
48. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics
Hadoop = Open Source MapReduce Hadoop Examples
Wider World of Hadoop Developing Production Systems
Systems Development APIs
And, you can build production systems with Hadoop in either
Java or C++. . .
Full featured Java API for Hadoop
Pipes is the C++ API for Hadoop MapReduce
Cascading is an API for developing general data
processing systems that incorporate MapReduce
(http://www.cascading.org)
Gordon Rios Introduction to Hadoop
49. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics
Hadoop = Open Source MapReduce Hadoop Examples
Wider World of Hadoop Developing Production Systems
Systems Development APIs
And, you can build production systems with Hadoop in either
Java or C++. . .
Full featured Java API for Hadoop
Pipes is the C++ API for Hadoop MapReduce
Cascading is an API for developing general data
processing systems that incorporate MapReduce
(http://www.cascading.org)
Gordon Rios Introduction to Hadoop
50. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics
Hadoop = Open Source MapReduce Hadoop Examples
Wider World of Hadoop Developing Production Systems
Systems Development APIs
And, you can build production systems with Hadoop in either
Java or C++. . .
Full featured Java API for Hadoop
Pipes is the C++ API for Hadoop MapReduce
Cascading is an API for developing general data
processing systems that incorporate MapReduce
(http://www.cascading.org)
Gordon Rios Introduction to Hadoop
51. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics
Hadoop = Open Source MapReduce Hadoop Examples
Wider World of Hadoop Developing Production Systems
Cascading
Cascading allows developers to model very complex data flows at a higher level than Map & Reduce and then
automatically generate and visualize the dozens or perhaps even hundreds of necessary Hadoop jobs as a graph
Gordon Rios Introduction to Hadoop
52. MapReduce / Hadoop for Scientific Data Mining
Ad Hoc Analysis
Hadoop = Open Source MapReduce
Further Reading
Wider World of Hadoop
Outline
1 MapReduce / Hadoop for Scientific Data Mining
Objectives for the Talk
MapReduce as Simplified Parallel Computing
Thinking in Terms of MapReduce
2 Hadoop is Open Source MapReduce
Basics of Hadoop
Hadoop Examples
Developing Production Systems with Hadoop
3 Wider World of Hadoop
Ad Hoc Analysis with Hadoop
Further Reading
Gordon Rios Introduction to Hadoop
53. MapReduce / Hadoop for Scientific Data Mining
Ad Hoc Analysis
Hadoop = Open Source MapReduce
Further Reading
Wider World of Hadoop
Ad Hoc Analysis
What’s missing? Sometimes you need to do fast ad hoc
queries. . . can we do that in a scalable way?
Pig: “Pig is a scripting language for exploring large datasets”
[White, 2011] (Yahoo!)
Hive: provide an SQL interface for running ad hoc queries and
other data processing tasks for SQL analysts (Facebook)
Hbase: Column oriented database along the lines of Google’s
Bigtable database (Powerset)
Hypertable: GPL clone of Google’s Bigtable database written in
C++ (Zvents)
Google’s Bigtable database is described
in [Chang, Dean, Ghemawat, et al., 2008]
Gordon Rios Introduction to Hadoop
54. MapReduce / Hadoop for Scientific Data Mining
Ad Hoc Analysis
Hadoop = Open Source MapReduce
Further Reading
Wider World of Hadoop
Ad Hoc Analysis
What’s missing? Sometimes you need to do fast ad hoc
queries. . . can we do that in a scalable way?
Pig: “Pig is a scripting language for exploring large datasets”
[White, 2011] (Yahoo!)
Hive: provide an SQL interface for running ad hoc queries and
other data processing tasks for SQL analysts (Facebook)
Hbase: Column oriented database along the lines of Google’s
Bigtable database (Powerset)
Hypertable: GPL clone of Google’s Bigtable database written in
C++ (Zvents)
Google’s Bigtable database is described
in [Chang, Dean, Ghemawat, et al., 2008]
Gordon Rios Introduction to Hadoop
55. MapReduce / Hadoop for Scientific Data Mining
Ad Hoc Analysis
Hadoop = Open Source MapReduce
Further Reading
Wider World of Hadoop
Ad Hoc Analysis
What’s missing? Sometimes you need to do fast ad hoc
queries. . . can we do that in a scalable way?
Pig: “Pig is a scripting language for exploring large datasets”
[White, 2011] (Yahoo!)
Hive: provide an SQL interface for running ad hoc queries and
other data processing tasks for SQL analysts (Facebook)
Hbase: Column oriented database along the lines of Google’s
Bigtable database (Powerset)
Hypertable: GPL clone of Google’s Bigtable database written in
C++ (Zvents)
Google’s Bigtable database is described
in [Chang, Dean, Ghemawat, et al., 2008]
Gordon Rios Introduction to Hadoop
56. MapReduce / Hadoop for Scientific Data Mining
Ad Hoc Analysis
Hadoop = Open Source MapReduce
Further Reading
Wider World of Hadoop
Ad Hoc Analysis
What’s missing? Sometimes you need to do fast ad hoc
queries. . . can we do that in a scalable way?
Pig: “Pig is a scripting language for exploring large datasets”
[White, 2011] (Yahoo!)
Hive: provide an SQL interface for running ad hoc queries and
other data processing tasks for SQL analysts (Facebook)
Hbase: Column oriented database along the lines of Google’s
Bigtable database (Powerset)
Hypertable: GPL clone of Google’s Bigtable database written in
C++ (Zvents)
Google’s Bigtable database is described
in [Chang, Dean, Ghemawat, et al., 2008]
Gordon Rios Introduction to Hadoop
57. MapReduce / Hadoop for Scientific Data Mining
Ad Hoc Analysis
Hadoop = Open Source MapReduce
Further Reading
Wider World of Hadoop
Ad Hoc Analysis
What’s missing? Sometimes you need to do fast ad hoc
queries. . . can we do that in a scalable way?
Pig: “Pig is a scripting language for exploring large datasets”
[White, 2011] (Yahoo!)
Hive: provide an SQL interface for running ad hoc queries and
other data processing tasks for SQL analysts (Facebook)
Hbase: Column oriented database along the lines of Google’s
Bigtable database (Powerset)
Hypertable: GPL clone of Google’s Bigtable database written in
C++ (Zvents)
Google’s Bigtable database is described
in [Chang, Dean, Ghemawat, et al., 2008]
Gordon Rios Introduction to Hadoop
58. MapReduce / Hadoop for Scientific Data Mining
Ad Hoc Analysis
Hadoop = Open Source MapReduce
Further Reading
Wider World of Hadoop
Interesting Application Frameworks with Hadoop
Here are a few examples of frameworks in development or already
available that use Hadoop as a platform. . .
Apache Mahout: Ambitious project to implement popular
machine learning algorithms and recommenders with Hadoop3
Graph: Jake Hoffman from Yahoo Research has released some
of his work on large scale network analysis with Hadoop with
prototype code4 . Also see [Vassilvitskii, 2010] for related graph
analysis research.
Application to GIS: Nathan Kerr’s M.S. Thesis with lots of details
on how to do GIS with Hadoop5
3
http://mahout.apache.org/
4
http://github.com/jhofman/icwsm2010_tutorial
5
http://www.nathankerr.com/projects/parallel-gis-processing/alternative_
approaches_to_parallel_gis_processing.html
Gordon Rios Introduction to Hadoop
59. MapReduce / Hadoop for Scientific Data Mining
Ad Hoc Analysis
Hadoop = Open Source MapReduce
Further Reading
Wider World of Hadoop
Outline
1 MapReduce / Hadoop for Scientific Data Mining
Objectives for the Talk
MapReduce as Simplified Parallel Computing
Thinking in Terms of MapReduce
2 Hadoop is Open Source MapReduce
Basics of Hadoop
Hadoop Examples
Developing Production Systems with Hadoop
3 Wider World of Hadoop
Ad Hoc Analysis with Hadoop
Further Reading
Gordon Rios Introduction to Hadoop
60. MapReduce / Hadoop for Scientific Data Mining
Ad Hoc Analysis
Hadoop = Open Source MapReduce
Further Reading
Wider World of Hadoop
Further Reading
White, T.
Hadoop: The Definitive Guide, 2nd Edition
O’Reilly Media, Inc., Sebastopol, CA, 2011
Sanderson, D.
Programming Google App Engine
O’Reilly Media, Inc., Sebastopol, CA, 2009
Murty, J.
Programming Amazon Web Services
O’Reilly Media, Inc., Sebastopol, CA, 2008
Dean, J. and Ghemawat, S.
MapReduce: simplified data processing on large clusters
Communications of the ACM, 51(1):107–113, 2008
Chang, Fay and Dean, Jeffrey and Ghemawat, Sanjay and Hsieh, Wilson C. and Wallach, Deborah A. and
Burrows, Mike and Chandra, Tushar and Fikes, Andrew and Gruber, Robert E.
Bigtable: a distributed storage system for structured data
OSDI ’06: Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation,
USENIX Assoc., Berkeley, CA, 2006
MapReduce on Wikipedia
http://en.wikipedia.org/wiki/MapReduce
Vassilvitskii, S.
XXL Graph Algorithms, Hadoop Summit 2010
http://developer.yahoo.com/events/hadoopsummit2010/
Gordon Rios Introduction to Hadoop