SlideShare a Scribd company logo
1 of 60
Download to read offline
MapReduce / Hadoop for Scientific Data Mining
        Hadoop = Open Source MapReduce
                     Wider World of Hadoop




   An Introduction to the World of Hadoop
               Applications to Scientific Data Mining


                                    Gordon Rios

                              g.rios@4c.ucc.ie
                     Cork Constraint Computation Centre (4C)
                             University College Cork


                                October 29, 2010




                                Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining
              Hadoop = Open Source MapReduce
                           Wider World of Hadoop


Outline

  1   MapReduce / Hadoop for Scientific Data Mining
        Objectives for the Talk
        MapReduce as Simplified Parallel Computing
        Thinking in Terms of MapReduce
  2   Hadoop is Open Source MapReduce
        Basics of Hadoop
        Hadoop Examples
        Developing Production Systems with Hadoop
  3   Wider World of Hadoop
        Ad Hoc Analysis with Hadoop
        Further Reading


                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Objectives
              Hadoop = Open Source MapReduce         Parallel Computing with MapReduce
                           Wider World of Hadoop     MapReduce Thinking


Outline

  1   MapReduce / Hadoop for Scientific Data Mining
        Objectives for the Talk
        MapReduce as Simplified Parallel Computing
        Thinking in Terms of MapReduce
  2   Hadoop is Open Source MapReduce
        Basics of Hadoop
        Hadoop Examples
        Developing Production Systems with Hadoop
  3   Wider World of Hadoop
        Ad Hoc Analysis with Hadoop
        Further Reading


                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Objectives
              Hadoop = Open Source MapReduce         Parallel Computing with MapReduce
                           Wider World of Hadoop     MapReduce Thinking


Objectives

  At the end of this talk I want you to have ideas for how to apply
  MapReduce to your domains and confidence that Hadoop is a
  good way to do it. . .
       Introduce thinking in terms of MapReduce and why it’s a
       good idea
       Introduce Hadoop as an open source implementation of
       MapReduce
       Present a detailed example of using the Hadoop
       streaming API for a scientific data mining task
       Discuss higher level notions for performing ad hoc analysis
       and building systems with Hadoop


                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Objectives
              Hadoop = Open Source MapReduce         Parallel Computing with MapReduce
                           Wider World of Hadoop     MapReduce Thinking


Objectives

  At the end of this talk I want you to have ideas for how to apply
  MapReduce to your domains and confidence that Hadoop is a
  good way to do it. . .
       Introduce thinking in terms of MapReduce and why it’s a
       good idea
       Introduce Hadoop as an open source implementation of
       MapReduce
       Present a detailed example of using the Hadoop
       streaming API for a scientific data mining task
       Discuss higher level notions for performing ad hoc analysis
       and building systems with Hadoop


                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Objectives
              Hadoop = Open Source MapReduce         Parallel Computing with MapReduce
                           Wider World of Hadoop     MapReduce Thinking


Objectives

  At the end of this talk I want you to have ideas for how to apply
  MapReduce to your domains and confidence that Hadoop is a
  good way to do it. . .
       Introduce thinking in terms of MapReduce and why it’s a
       good idea
       Introduce Hadoop as an open source implementation of
       MapReduce
       Present a detailed example of using the Hadoop
       streaming API for a scientific data mining task
       Discuss higher level notions for performing ad hoc analysis
       and building systems with Hadoop


                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Objectives
              Hadoop = Open Source MapReduce         Parallel Computing with MapReduce
                           Wider World of Hadoop     MapReduce Thinking


Objectives

  At the end of this talk I want you to have ideas for how to apply
  MapReduce to your domains and confidence that Hadoop is a
  good way to do it. . .
       Introduce thinking in terms of MapReduce and why it’s a
       good idea
       Introduce Hadoop as an open source implementation of
       MapReduce
       Present a detailed example of using the Hadoop
       streaming API for a scientific data mining task
       Discuss higher level notions for performing ad hoc analysis
       and building systems with Hadoop


                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Objectives
              Hadoop = Open Source MapReduce         Parallel Computing with MapReduce
                           Wider World of Hadoop     MapReduce Thinking


Outline

  1   MapReduce / Hadoop for Scientific Data Mining
        Objectives for the Talk
        MapReduce as Simplified Parallel Computing
        Thinking in Terms of MapReduce
  2   Hadoop is Open Source MapReduce
        Basics of Hadoop
        Hadoop Examples
        Developing Production Systems with Hadoop
  3   Wider World of Hadoop
        Ad Hoc Analysis with Hadoop
        Further Reading


                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Objectives
            Hadoop = Open Source MapReduce         Parallel Computing with MapReduce
                         Wider World of Hadoop     MapReduce Thinking


MapReduce
 MapReduce is distributed computing where we take advantage of
 data locality to push the computation to the data. . .
     Distributed computing: clusters of computers with local memory
     and disk (network intensive for big data)
     Parallel Computing: multiple CPUs processing over shared
     memory and filesystem
     If we can decompose the problem into independent map and
     reduce tasks we can achieve “easy” parallelism with
     MapReduce. . .
        1 Map works independently to convert input data to key value
          pairs. . .
        2 Reduce works independently on all values for a given key
          and transforms them to a single output set (possibly even
          just the ∅) per key. . .
     Now, let’s expand that a bit. . .

                                    Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Objectives
            Hadoop = Open Source MapReduce         Parallel Computing with MapReduce
                         Wider World of Hadoop     MapReduce Thinking


MapReduce
 MapReduce is distributed computing where we take advantage of
 data locality to push the computation to the data. . .
     Distributed computing: clusters of computers with local memory
     and disk (network intensive for big data)
     Parallel Computing: multiple CPUs processing over shared
     memory and filesystem
     If we can decompose the problem into independent map and
     reduce tasks we can achieve “easy” parallelism with
     MapReduce. . .
        1 Map works independently to convert input data to key value
          pairs. . .
        2 Reduce works independently on all values for a given key
          and transforms them to a single output set (possibly even
          just the ∅) per key. . .
     Now, let’s expand that a bit. . .

                                    Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Objectives
            Hadoop = Open Source MapReduce         Parallel Computing with MapReduce
                         Wider World of Hadoop     MapReduce Thinking


MapReduce
 MapReduce is distributed computing where we take advantage of
 data locality to push the computation to the data. . .
     Distributed computing: clusters of computers with local memory
     and disk (network intensive for big data)
     Parallel Computing: multiple CPUs processing over shared
     memory and filesystem
     If we can decompose the problem into independent map and
     reduce tasks we can achieve “easy” parallelism with
     MapReduce. . .
        1 Map works independently to convert input data to key value
          pairs. . .
        2 Reduce works independently on all values for a given key
          and transforms them to a single output set (possibly even
          just the ∅) per key. . .
     Now, let’s expand that a bit. . .

                                    Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Objectives
            Hadoop = Open Source MapReduce         Parallel Computing with MapReduce
                         Wider World of Hadoop     MapReduce Thinking


MapReduce
 MapReduce is distributed computing where we take advantage of
 data locality to push the computation to the data. . .
     Distributed computing: clusters of computers with local memory
     and disk (network intensive for big data)
     Parallel Computing: multiple CPUs processing over shared
     memory and filesystem
     If we can decompose the problem into independent map and
     reduce tasks we can achieve “easy” parallelism with
     MapReduce. . .
        1 Map works independently to convert input data to key value
          pairs. . .
        2 Reduce works independently on all values for a given key
          and transforms them to a single output set (possibly even
          just the ∅) per key. . .
     Now, let’s expand that a bit. . .

                                    Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Objectives
            Hadoop = Open Source MapReduce         Parallel Computing with MapReduce
                         Wider World of Hadoop     MapReduce Thinking


MapReduce
 MapReduce is distributed computing where we take advantage of
 data locality to push the computation to the data. . .
     Distributed computing: clusters of computers with local memory
     and disk (network intensive for big data)
     Parallel Computing: multiple CPUs processing over shared
     memory and filesystem
     If we can decompose the problem into independent map and
     reduce tasks we can achieve “easy” parallelism with
     MapReduce. . .
        1 Map works independently to convert input data to key value
          pairs. . .
        2 Reduce works independently on all values for a given key
          and transforms them to a single output set (possibly even
          just the ∅) per key. . .
     Now, let’s expand that a bit. . .

                                    Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Objectives
            Hadoop = Open Source MapReduce         Parallel Computing with MapReduce
                         Wider World of Hadoop     MapReduce Thinking


MapReduce
 MapReduce is distributed computing where we take advantage of
 data locality to push the computation to the data. . .
     Distributed computing: clusters of computers with local memory
     and disk (network intensive for big data)
     Parallel Computing: multiple CPUs processing over shared
     memory and filesystem
     If we can decompose the problem into independent map and
     reduce tasks we can achieve “easy” parallelism with
     MapReduce. . .
        1 Map works independently to convert input data to key value
          pairs. . .
        2 Reduce works independently on all values for a given key
          and transforms them to a single output set (possibly even
          just the ∅) per key. . .
     Now, let’s expand that a bit. . .

                                    Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining            Objectives
                 Hadoop = Open Source MapReduce                  Parallel Computing with MapReduce
                              Wider World of Hadoop              MapReduce Thinking


Basics Elements of MapReduce
  MapReduce is distributed sort with specific places to insert
  application logic. . .
          an input reader: read work data W from file system1 and
          produce a set of splits S: W → S
          a Map function: (S) → (K , V )
          combiner function: a mapper optimization. . .
          partition function: partition2 keys k ∈ K to reducers K → R
          compare function cmp(ki , kj ): sort keys presented to each
          reducer
          a Reduce function: reduce output from all mappers for a
          particular to another set of values for that key wk
          (k , V ) → (k , wk ))
          an output writer: write output to file system.
    1
        A distributed file system (DFS) for stability and scale
    2
        The default hash keys modulo number of reducers
                                             Gordon Rios         Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Objectives
              Hadoop = Open Source MapReduce         Parallel Computing with MapReduce
                           Wider World of Hadoop     MapReduce Thinking


Outline

  1   MapReduce / Hadoop for Scientific Data Mining
        Objectives for the Talk
        MapReduce as Simplified Parallel Computing
        Thinking in Terms of MapReduce
  2   Hadoop is Open Source MapReduce
        Basics of Hadoop
        Hadoop Examples
        Developing Production Systems with Hadoop
  3   Wider World of Hadoop
        Ad Hoc Analysis with Hadoop
        Further Reading


                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Objectives
              Hadoop = Open Source MapReduce         Parallel Computing with MapReduce
                           Wider World of Hadoop     MapReduce Thinking


Examples of Map and Reduce
  Let’s start with a few examples of Map. . .
       Word Count: read in a stream of text (e.g. a document or a set of
       documents) and emit each word as a key with a value of 1
       Inverted Index: read in a stream of documents and emit each
       word as a key and the document ID as the value
       Max Temperature: read in formatted data and emit year as a
       key with temperature as the value
       Mean Rain Precipitation: read in daily data and emit
       (year-month, lat, long) as a key with temperature as
       the value
  Reduce in these cases simply applies a count, list, max,
  average, to a set of values for each key,
  respectively. [Dean, Ghemawat, 2008, Wikipedia, 2010, White, 2011]

                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Objectives
             Hadoop = Open Source MapReduce         Parallel Computing with MapReduce
                          Wider World of Hadoop     MapReduce Thinking


Visualizing Word Count




                                                                  source: Chris Wensel from
                                    http://www.cascading.org




                                     Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
              Hadoop = Open Source MapReduce         Hadoop Examples
                           Wider World of Hadoop     Developing Production Systems


Outline

  1   MapReduce / Hadoop for Scientific Data Mining
        Objectives for the Talk
        MapReduce as Simplified Parallel Computing
        Thinking in Terms of MapReduce
  2   Hadoop is Open Source MapReduce
        Basics of Hadoop
        Hadoop Examples
        Developing Production Systems with Hadoop
  3   Wider World of Hadoop
        Ad Hoc Analysis with Hadoop
        Further Reading


                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
              Hadoop = Open Source MapReduce         Hadoop Examples
                           Wider World of Hadoop     Developing Production Systems


Engineering Intermezzo



  This is how easy it is to get Hadoop installed . . . given that you
  have Java 6 installed already. . .
  Get Hadoop: http://hadoop.apache.org/

       % t a r x z f hadoop−x . y . z . t a r . gz
       % e x p o r t HADOOP_INSTALL=BUILD_DIR / hadoop−x . y . z
       % e x p o r t PATH=$PATH : $HADOOP_INSTALL / b i n




                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
              Hadoop = Open Source MapReduce         Hadoop Examples
                           Wider World of Hadoop     Developing Production Systems


MapReduce with Hadoop and the streaming library


             Now, let’s take a closer look at how Hadoop implements
  MapReduce from [White, 2011]. . .




                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
              Hadoop = Open Source MapReduce         Hadoop Examples
                           Wider World of Hadoop     Developing Production Systems


Hadoop Streaming Library



  We’ll focus on the streaming library as it’s the most natural for
  scientific or technical computing. . . let’s look at the Definitive
  Guide’s weather example. . .




                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
              Hadoop = Open Source MapReduce         Hadoop Examples
                           Wider World of Hadoop     Developing Production Systems


Outline

  1   MapReduce / Hadoop for Scientific Data Mining
        Objectives for the Talk
        MapReduce as Simplified Parallel Computing
        Thinking in Terms of MapReduce
  2   Hadoop is Open Source MapReduce
        Basics of Hadoop
        Hadoop Examples
        Developing Production Systems with Hadoop
  3   Wider World of Hadoop
        Ad Hoc Analysis with Hadoop
        Further Reading


                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
               Hadoop = Open Source MapReduce         Hadoop Examples
                            Wider World of Hadoop     Developing Production Systems


Hadoop Book Examples

  More examples from Hadoop: The Definitive Guide, 2nd Edition
  (Hadoop 20.1) http://www.hadoopbook.com/. . . here’s
  how to install and try them for yourself. . .

                                      Install Git: http://git-scm.com/
                                      Visit github for book code:
                                      http://github.com/tomwhite/
                                      hadoop-book/

  Checkout code examples from The Definitive Guide
  % cd BUILD_DIR
  % git clone http://github.com/tomwhite/hadoop-book.git hadoop-book




                                       Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining            Hadoop Basics
                  Hadoop = Open Source MapReduce                  Hadoop Examples
                               Wider World of Hadoop              Developing Production Systems


Example: ECA Mean Precipitation

  Let’s compute mean precipitation at over 2,000 weather stations and
  make some graphics. There are 2,186 files with median of 21,875
  lines each, a minimum of 1,025 and a maximum of 78,090.
  ECA Daily Data

  The ECA dataset contains series of daily observations at meteorological stations throughout Europe and the
  Mediterranean. Part of the dataset is freely available for non-commercial research. To download this public data
  select one of the options below. Note that a gridded version with daily temperature and precipitation fields is also
  available. source: http://eca.knmi.nl/dailydata/index.php


  File Format


  FILE FORMAT ( MISSING VALUE CODE = −9999):

  01−06   STAID :   Station i d e n t i f i e r
  08−13   SOUID :   Source i d e n t i f i e r
  15−22   DATE :    Date YYYYMMDD
  24−28   RR    :   P r e c i p i t a t i o n amount i n 0 . 1 mm
  30−34   Q_RR :    q u a l i t y code f o r RR ( 0 = ’ v a l i d ’ ; 1= ’ suspect ’ ; 9= ’ missing ’ )




                                              Gordon Rios         Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
             Hadoop = Open Source MapReduce         Hadoop Examples
                          Wider World of Hadoop     Developing Production Systems


Example: ECA Mean Precipitation


  Scientific Data Mining: use the Hadoop stream library and
  manually pipeline MapReduce jobs together as needed. . .
      Write hadoop scripts in python in two steps
      Test cat data | map.py | sort | reduce.py >
      output (not shown)
      Process data into individual files for each time period
      (Year/Month) of interest using hadoop stream library (local
      mode)
      Call R in batch mode to produce image files




                                     Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining                    Hadoop Basics
                  Hadoop = Open Source MapReduce                          Hadoop Examples
                               Wider World of Hadoop                      Developing Production Systems


ECA Mean Precipitation: Step One

  map_one.py
  def l a t _ l o n _ t o _ c o o r d ( s ) :
      sign = 1
      d , m, s = map( lambda x : f l o a t ( x ) , s . s p l i t ( " : " ) )
      s i g n = −1 i f d < 0 else 1
      x = abs ( d ) + m / 60.0 + s / 3600.0
      return f l o a t ( sign ∗ x )

  f o r l i n e i n sys . s t d i n :
        # f l d s = ( s t a i d , souid , date , r r , q _ r r )
        flds = line . strip (). split ( " , " )
        i f len ( f l d s ) != 5:
               continue
        staid = flds [ 0 ] . strip ()                               # station id
        date = f l d s [ 2 ] . s t r i p ( )                        # YYYYMMDD
        i f date < BEGIN_DATE or date > END_DATE :
               continue
        rr = flds [3]. strip ()                                     # p r e c i p i t a t i o n i n 0 . 1 mm
        q_rr = f l d s [ 4 ] . s t r i p ( )                        # q u a l i t y code " 0 " = v a l i d
        l a t , l o n = l a t l o n s . g e t ( s t a i d , ( None , None ) )
        i f q _ r r == ’ 0 ’ and ( l a t i s not None ) and ( l o n i s not None ) :
               p r i n t "%s ,%.4 f ,%.4 f  t%s " % ( date [ 0 : 6 ] , l a t , lon , r r )




                                                   Gordon Rios            Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining                Hadoop Basics
                 Hadoop = Open Source MapReduce                      Hadoop Examples
                              Wider World of Hadoop                  Developing Production Systems


ECA Mean Precipitation: Step One (cont)


  reduce_one.py
  ( l a s t _ k e y , x , n ) = ( None , 0 . 0 , 0 )
  f o r l i n e i n sys . s t d i n :
       ( key , v a l ) = l i n e . s t r i p ( ) . s p l i t ( "  t " )
       i f l a s t _ k e y and l a s t _ k e y ! = key : # t i m e t o e m i t reduced v a l u e
            i f n > 0:
                p r i n t "%s  t %.2 f " % ( l a s t _ k e y , x / n )
               x = 0.0
               n = 0
      # we j u s t want data f o r t h e year 2009
       ( l a s t _ k e y , x , n ) = ( key , x + f l o a t ( v a l ) , n + 1 )
  i f last_key :
       i f n > 0:
            p r i n t "%s  t %.2 f " % ( l a s t _ k e y , x / n )




                                                Gordon Rios          Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining               Hadoop Basics
                 Hadoop = Open Source MapReduce                     Hadoop Examples
                              Wider World of Hadoop                 Developing Production Systems


ECA Mean Precipitation: Step Two
  Map ((yyyymm,lat,lon),mean) -> (yyyymm, (lat,lon,mean))

  map_two.py


  f o r l i n e i n sys . s t d i n :
        yyyymm_lat_lon , mean_precip = l i n e . s t r i p ( ) . s p l i t ( "  t " )
        yyyymm , l a t , l o n = yyyymm_lat_lon . s t r i p ( ) . s p l i t ( " , " )
        p r i n t "%s  t%s %s %s " % ( yyyymm , l a t , lon , mean_precip )


  Empty reduce just write to a local file (hack since we’re running locally)

  reduce_two.py


  l a s t _ k e y = None
  values = [ ]
  f o r l i n e i n sys . s t d i n :
      ( key , v a l ) = l i n e . s t r i p ( ) . s p l i t ( "  t " )
      i f l a s t _ k e y and l a s t _ k e y ! = key : # t i m e t o e m i t reduced v a l u e
           w r i t e _ f i l e ( last_key , values )
           values = [ ]
      l a s t _ k e y = key
      v a l u e s . append ( v a l ) # c r e a t e a s t r i n g w i t h t h r e e v a l u e s

  i f last_key :
     w r i t e _ f i l e ( last_key , values )



                                                 Gordon Rios        Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining               Hadoop Basics
                 Hadoop = Open Source MapReduce                     Hadoop Examples
                              Wider World of Hadoop                 Developing Production Systems


ECA Mean Precipitation: Step Two
  Map ((yyyymm,lat,lon),mean) -> (yyyymm, (lat,lon,mean))

  map_two.py


  f o r l i n e i n sys . s t d i n :
        yyyymm_lat_lon , mean_precip = l i n e . s t r i p ( ) . s p l i t ( "  t " )
        yyyymm , l a t , l o n = yyyymm_lat_lon . s t r i p ( ) . s p l i t ( " , " )
        p r i n t "%s  t%s %s %s " % ( yyyymm , l a t , lon , mean_precip )


  Empty reduce just write to a local file (hack since we’re running locally)

  reduce_two.py


  l a s t _ k e y = None
  values = [ ]
  f o r l i n e i n sys . s t d i n :
      ( key , v a l ) = l i n e . s t r i p ( ) . s p l i t ( "  t " )
      i f l a s t _ k e y and l a s t _ k e y ! = key : # t i m e t o e m i t reduced v a l u e
           w r i t e _ f i l e ( last_key , values )
           values = [ ]
      l a s t _ k e y = key
      v a l u e s . append ( v a l ) # c r e a t e a s t r i n g w i t h t h r e e v a l u e s

  i f last_key :
     w r i t e _ f i l e ( last_key , values )



                                                 Gordon Rios        Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining         Hadoop Basics
                Hadoop = Open Source MapReduce               Hadoop Examples
                             Wider World of Hadoop           Developing Production Systems


Example: ECA Mean Precipitation


  Step One: input -> (yyyymm,lat,lon), mean precip

  % hadoop jar /Users/gordon/build/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input
  /Downloads/tarragona/ECA_blend_all_data/RR_STAID00* -output output -mapper
  /Desktop/tmp/tarragona/python/map_one.py -reducer /Desktop/tmp/tarragona/python/reduce_one.py
  % 10/10/25 19:34:31 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
  ...
  % 10/10/25 20:07:36 INFO streaming.StreamJob: Job complete: job_local_0001
  % 10/10/25 20:07:36 INFO streaming.StreamJob: Output: output



  Step Two: (date,lat,lon), mean precip -> files(yymm)

  % hadoop jar /Users/gordon/build/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input output/part-00000
  -output output_two -mapper /Desktop/tmp/tarragona/python/map_two.py -reducer
  /Desktop/tmp/tarragona/python/reduce_two.py
  % 10/10/25 20:41:43 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
  ...
  % 10/10/25 20:41:47 INFO streaming.StreamJob: Job complete: job_local_0001
  % 10/10/25 20:41:47 INFO streaming.StreamJob: Output: output_two




                                          Gordon Rios        Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining         Hadoop Basics
                Hadoop = Open Source MapReduce               Hadoop Examples
                             Wider World of Hadoop           Developing Production Systems


Example: ECA Mean Precipitation


  Step One: input -> (yyyymm,lat,lon), mean precip

  % hadoop jar /Users/gordon/build/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input
  /Downloads/tarragona/ECA_blend_all_data/RR_STAID00* -output output -mapper
  /Desktop/tmp/tarragona/python/map_one.py -reducer /Desktop/tmp/tarragona/python/reduce_one.py
  % 10/10/25 19:34:31 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
  ...
  % 10/10/25 20:07:36 INFO streaming.StreamJob: Job complete: job_local_0001
  % 10/10/25 20:07:36 INFO streaming.StreamJob: Output: output



  Step Two: (date,lat,lon), mean precip -> files(yymm)

  % hadoop jar /Users/gordon/build/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input output/part-00000
  -output output_two -mapper /Desktop/tmp/tarragona/python/map_two.py -reducer
  /Desktop/tmp/tarragona/python/reduce_two.py
  % 10/10/25 20:41:43 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
  ...
  % 10/10/25 20:41:47 INFO streaming.StreamJob: Job complete: job_local_0001
  % 10/10/25 20:41:47 INFO streaming.StreamJob: Output: output_two




                                          Gordon Rios        Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining                       Hadoop Basics
                  Hadoop = Open Source MapReduce                             Hadoop Examples
                               Wider World of Hadoop                         Developing Production Systems


Batch Processing in R
  And, after a little batch processing with R. . .
  batch-graphics.R


  library ( fields )
  f i l e s <− c ( " 200901. d a t " , " 200902. d a t " , " 200903. d a t " ,
                          " 200904. d a t " , " 200905. d a t " , " 200906. d a t " ,
                          " 200907. d a t " , " 200908. d a t " , " 200909. d a t " ,
                          " 200910. d a t " , " 200911. d a t " , " 200912. d a t " )
  i <− 1
  for ( f in f i l e s ) {
       mat <− read . t a b l e ( f )
      names ( mat ) <− c ( " l a t " , " l o n g " , " p r e c i p " )
       png ( f i l e n a m e =paste ( " p r e c i p−" , i , " . png " , sep= " " ) , h e i g h t =480 , w i d t h =480)
       q u i l t . p l o t ( mat  $long , mat  $ l a t , mat  $ p r e c i p , n c o l =100 , nrow =100 ,
           y l i m =c ( 2 2 . 0 , 7 9 . 0 ) , x l i m =c ( − 5 2 . 0 , 7 2 . 0 ) ,
           c o l =two . c o l o r s ( 2 5 6 , s t a r t = " wheat " , end= " d a r k b l u e " , middle = " b l u e " ) ,
           z l i m =c ( 0 , 4 1 0 ) , add . legend=T , cex . l a b = 0 . 6 )
       p o i n t s ( 1 . 2 4 5 3 , 41.1187 , pch =1)
       t e x t ( 1 . 2 4 5 3 , 41.1187 , " t a r r a g o n a " , cex = 0 . 8 , pos=1)
       p o i n t s ( 2 . 3 5 0 8 3 , 4 8 . 8 9 , pch =1)
       t e x t ( 2 . 3 5 0 8 , 4 8 . 8 9 , " p a r i s " , cex = 0 . 8 , pos=4)
       p o i n t s ( 1 2 . 4 8 2 3 , 41.8955 , pch =1)
       t e x t ( 1 2 . 4 8 2 3 , 41.8955 , " rome " , cex = 0 . 8 , pos=4)
       dev . o f f ( )
       i <− i + 1
  }


                                                     Gordon Rios             Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
             Hadoop = Open Source MapReduce         Hadoop Examples
                          Wider World of Hadoop     Developing Production Systems


ECA Precipitation 2009 Month: 1




                                     Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
             Hadoop = Open Source MapReduce         Hadoop Examples
                          Wider World of Hadoop     Developing Production Systems


ECA Precipitation 2009 Month: 2




                                     Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
             Hadoop = Open Source MapReduce         Hadoop Examples
                          Wider World of Hadoop     Developing Production Systems


ECA Precipitation 2009 Month: 3




                                     Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
             Hadoop = Open Source MapReduce         Hadoop Examples
                          Wider World of Hadoop     Developing Production Systems


ECA Precipitation 2009 Month: 4




                                     Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
             Hadoop = Open Source MapReduce         Hadoop Examples
                          Wider World of Hadoop     Developing Production Systems


ECA Precipitation 2009 Month: 5




                                     Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
             Hadoop = Open Source MapReduce         Hadoop Examples
                          Wider World of Hadoop     Developing Production Systems


ECA Precipitation 2009 Month: 6




                                     Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
             Hadoop = Open Source MapReduce         Hadoop Examples
                          Wider World of Hadoop     Developing Production Systems


ECA Precipitation 2009 Month: 7




                                     Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
             Hadoop = Open Source MapReduce         Hadoop Examples
                          Wider World of Hadoop     Developing Production Systems


ECA Precipitation 2009 Month: 8




                                     Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
             Hadoop = Open Source MapReduce         Hadoop Examples
                          Wider World of Hadoop     Developing Production Systems


ECA Precipitation 2009 Month: 9




                                     Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
             Hadoop = Open Source MapReduce         Hadoop Examples
                          Wider World of Hadoop     Developing Production Systems


ECA Precipitation 2009 Month: 10




                                     Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
             Hadoop = Open Source MapReduce         Hadoop Examples
                          Wider World of Hadoop     Developing Production Systems


ECA Precipitation 2009 Month: 11




                                     Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
             Hadoop = Open Source MapReduce         Hadoop Examples
                          Wider World of Hadoop     Developing Production Systems


ECA Precipitation 2009 Month: 12




                                     Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
              Hadoop = Open Source MapReduce         Hadoop Examples
                           Wider World of Hadoop     Developing Production Systems


Summary of What We Did

  Work through a complete example but that’s not all since with very
  little additional work we can. . .
       Test the scripts in pseudo-distributed mode locally on our own
       machine
       Run the job on a compute cluster remotely
       Run the job in the cloud with EC2 there system as just another
       remote cluster
       Run the job with Amazon’s Elastic MapReduce
       http://aws.amazon.com/elasticmapreduce/ which
       allows you to pay for exactly as much computing as you use.
  See [White, 2011] for complete details on how to run in these different
  modes. . .

                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
              Hadoop = Open Source MapReduce         Hadoop Examples
                           Wider World of Hadoop     Developing Production Systems


Outline

  1   MapReduce / Hadoop for Scientific Data Mining
        Objectives for the Talk
        MapReduce as Simplified Parallel Computing
        Thinking in Terms of MapReduce
  2   Hadoop is Open Source MapReduce
        Basics of Hadoop
        Hadoop Examples
        Developing Production Systems with Hadoop
  3   Wider World of Hadoop
        Ad Hoc Analysis with Hadoop
        Further Reading


                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
             Hadoop = Open Source MapReduce         Hadoop Examples
                          Wider World of Hadoop     Developing Production Systems


Systems Development APIs



  And, you can build production systems with Hadoop in either
  Java or C++. . .
      Full featured Java API for Hadoop
      Pipes is the C++ API for Hadoop MapReduce
      Cascading is an API for developing general data
      processing systems that incorporate MapReduce
      (http://www.cascading.org)




                                     Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
             Hadoop = Open Source MapReduce         Hadoop Examples
                          Wider World of Hadoop     Developing Production Systems


Systems Development APIs



  And, you can build production systems with Hadoop in either
  Java or C++. . .
      Full featured Java API for Hadoop
      Pipes is the C++ API for Hadoop MapReduce
      Cascading is an API for developing general data
      processing systems that incorporate MapReduce
      (http://www.cascading.org)




                                     Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining   Hadoop Basics
             Hadoop = Open Source MapReduce         Hadoop Examples
                          Wider World of Hadoop     Developing Production Systems


Systems Development APIs



  And, you can build production systems with Hadoop in either
  Java or C++. . .
      Full featured Java API for Hadoop
      Pipes is the C++ API for Hadoop MapReduce
      Cascading is an API for developing general data
      processing systems that incorporate MapReduce
      (http://www.cascading.org)




                                     Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining        Hadoop Basics
                Hadoop = Open Source MapReduce              Hadoop Examples
                             Wider World of Hadoop          Developing Production Systems


Cascading

  Cascading allows developers to model very complex data flows at a higher level than Map & Reduce and then
  automatically generate and visualize the dozens or perhaps even hundreds of necessary Hadoop jobs as a graph




                                          Gordon Rios       Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining
                                                     Ad Hoc Analysis
              Hadoop = Open Source MapReduce
                                                     Further Reading
                           Wider World of Hadoop


Outline

  1   MapReduce / Hadoop for Scientific Data Mining
        Objectives for the Talk
        MapReduce as Simplified Parallel Computing
        Thinking in Terms of MapReduce
  2   Hadoop is Open Source MapReduce
        Basics of Hadoop
        Hadoop Examples
        Developing Production Systems with Hadoop
  3   Wider World of Hadoop
        Ad Hoc Analysis with Hadoop
        Further Reading


                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining
                                                     Ad Hoc Analysis
              Hadoop = Open Source MapReduce
                                                     Further Reading
                           Wider World of Hadoop


Ad Hoc Analysis

  What’s missing? Sometimes you need to do fast ad hoc
  queries. . . can we do that in a scalable way?
       Pig: “Pig is a scripting language for exploring large datasets”
       [White, 2011] (Yahoo!)
       Hive: provide an SQL interface for running ad hoc queries and
       other data processing tasks for SQL analysts (Facebook)
       Hbase: Column oriented database along the lines of Google’s
       Bigtable database (Powerset)
       Hypertable: GPL clone of Google’s Bigtable database written in
       C++ (Zvents)
  Google’s Bigtable database is described
  in [Chang, Dean, Ghemawat, et al., 2008]


                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining
                                                     Ad Hoc Analysis
              Hadoop = Open Source MapReduce
                                                     Further Reading
                           Wider World of Hadoop


Ad Hoc Analysis

  What’s missing? Sometimes you need to do fast ad hoc
  queries. . . can we do that in a scalable way?
       Pig: “Pig is a scripting language for exploring large datasets”
       [White, 2011] (Yahoo!)
       Hive: provide an SQL interface for running ad hoc queries and
       other data processing tasks for SQL analysts (Facebook)
       Hbase: Column oriented database along the lines of Google’s
       Bigtable database (Powerset)
       Hypertable: GPL clone of Google’s Bigtable database written in
       C++ (Zvents)
  Google’s Bigtable database is described
  in [Chang, Dean, Ghemawat, et al., 2008]


                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining
                                                     Ad Hoc Analysis
              Hadoop = Open Source MapReduce
                                                     Further Reading
                           Wider World of Hadoop


Ad Hoc Analysis

  What’s missing? Sometimes you need to do fast ad hoc
  queries. . . can we do that in a scalable way?
       Pig: “Pig is a scripting language for exploring large datasets”
       [White, 2011] (Yahoo!)
       Hive: provide an SQL interface for running ad hoc queries and
       other data processing tasks for SQL analysts (Facebook)
       Hbase: Column oriented database along the lines of Google’s
       Bigtable database (Powerset)
       Hypertable: GPL clone of Google’s Bigtable database written in
       C++ (Zvents)
  Google’s Bigtable database is described
  in [Chang, Dean, Ghemawat, et al., 2008]


                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining
                                                     Ad Hoc Analysis
              Hadoop = Open Source MapReduce
                                                     Further Reading
                           Wider World of Hadoop


Ad Hoc Analysis

  What’s missing? Sometimes you need to do fast ad hoc
  queries. . . can we do that in a scalable way?
       Pig: “Pig is a scripting language for exploring large datasets”
       [White, 2011] (Yahoo!)
       Hive: provide an SQL interface for running ad hoc queries and
       other data processing tasks for SQL analysts (Facebook)
       Hbase: Column oriented database along the lines of Google’s
       Bigtable database (Powerset)
       Hypertable: GPL clone of Google’s Bigtable database written in
       C++ (Zvents)
  Google’s Bigtable database is described
  in [Chang, Dean, Ghemawat, et al., 2008]


                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining
                                                     Ad Hoc Analysis
              Hadoop = Open Source MapReduce
                                                     Further Reading
                           Wider World of Hadoop


Ad Hoc Analysis

  What’s missing? Sometimes you need to do fast ad hoc
  queries. . . can we do that in a scalable way?
       Pig: “Pig is a scripting language for exploring large datasets”
       [White, 2011] (Yahoo!)
       Hive: provide an SQL interface for running ad hoc queries and
       other data processing tasks for SQL analysts (Facebook)
       Hbase: Column oriented database along the lines of Google’s
       Bigtable database (Powerset)
       Hypertable: GPL clone of Google’s Bigtable database written in
       C++ (Zvents)
  Google’s Bigtable database is described
  in [Chang, Dean, Ghemawat, et al., 2008]


                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining
                                                        Ad Hoc Analysis
                 Hadoop = Open Source MapReduce
                                                        Further Reading
                              Wider World of Hadoop


Interesting Application Frameworks with Hadoop

  Here are a few examples of frameworks in development or already
  available that use Hadoop as a platform. . .
          Apache Mahout: Ambitious project to implement popular
          machine learning algorithms and recommenders with Hadoop3
          Graph: Jake Hoffman from Yahoo Research has released some
          of his work on large scale network analysis with Hadoop with
          prototype code4 . Also see [Vassilvitskii, 2010] for related graph
          analysis research.
          Application to GIS: Nathan Kerr’s M.S. Thesis with lots of details
          on how to do GIS with Hadoop5

     3
         http://mahout.apache.org/
     4
         http://github.com/jhofman/icwsm2010_tutorial
     5
       http://www.nathankerr.com/projects/parallel-gis-processing/alternative_
  approaches_to_parallel_gis_processing.html
                                         Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining
                                                     Ad Hoc Analysis
              Hadoop = Open Source MapReduce
                                                     Further Reading
                           Wider World of Hadoop


Outline

  1   MapReduce / Hadoop for Scientific Data Mining
        Objectives for the Talk
        MapReduce as Simplified Parallel Computing
        Thinking in Terms of MapReduce
  2   Hadoop is Open Source MapReduce
        Basics of Hadoop
        Hadoop Examples
        Developing Production Systems with Hadoop
  3   Wider World of Hadoop
        Ad Hoc Analysis with Hadoop
        Further Reading


                                      Gordon Rios    Introduction to Hadoop
MapReduce / Hadoop for Scientific Data Mining
                                                        Ad Hoc Analysis
            Hadoop = Open Source MapReduce
                                                        Further Reading
                         Wider World of Hadoop


Further Reading
     White, T.
     Hadoop: The Definitive Guide, 2nd Edition
     O’Reilly Media, Inc., Sebastopol, CA, 2011

     Sanderson, D.
     Programming Google App Engine
     O’Reilly Media, Inc., Sebastopol, CA, 2009

     Murty, J.
     Programming Amazon Web Services
     O’Reilly Media, Inc., Sebastopol, CA, 2008

     Dean, J. and Ghemawat, S.
     MapReduce: simplified data processing on large clusters
     Communications of the ACM, 51(1):107–113, 2008

     Chang, Fay and Dean, Jeffrey and Ghemawat, Sanjay and Hsieh, Wilson C. and Wallach, Deborah A. and
     Burrows, Mike and Chandra, Tushar and Fikes, Andrew and Gruber, Robert E.
     Bigtable: a distributed storage system for structured data
     OSDI ’06: Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation,
     USENIX Assoc., Berkeley, CA, 2006

     MapReduce on Wikipedia
     http://en.wikipedia.org/wiki/MapReduce

     Vassilvitskii, S.
     XXL Graph Algorithms, Hadoop Summit 2010
     http://developer.yahoo.com/events/hadoopsummit2010/

                                       Gordon Rios      Introduction to Hadoop

More Related Content

What's hot

Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemMahabubur Rahaman
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2tcloudcomputing-tw
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architectureHarikrishnan K
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingCloudera, Inc.
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopVictoria López
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Senthil Kumar
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Hadoop tools with Examples
Hadoop tools with ExamplesHadoop tools with Examples
Hadoop tools with ExamplesJoe McTee
 

What's hot (20)

Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop technology doc
Hadoop technology docHadoop technology doc
Hadoop technology doc
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Apache Hadoop at 10
Apache Hadoop at 10Apache Hadoop at 10
Apache Hadoop at 10
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Hadoop tools with Examples
Hadoop tools with ExamplesHadoop tools with Examples
Hadoop tools with Examples
 

Viewers also liked

Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Introduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -IIntroduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -IEdureka!
 
BIG DATA Online Training | Hadoop Online Training with Placement Assistance
BIG DATA Online Training | Hadoop Online Training with Placement Assistance BIG DATA Online Training | Hadoop Online Training with Placement Assistance
BIG DATA Online Training | Hadoop Online Training with Placement Assistance Computer Trainings Online
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBuilding a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBradford Stephens
 
Tennis presentation slide FINAL
Tennis presentation slide  FINALTennis presentation slide  FINAL
Tennis presentation slide FINALYeh Hun Tee
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialDaniel Abadi
 
Connecting Hadoop and Oracle
Connecting Hadoop and OracleConnecting Hadoop and Oracle
Connecting Hadoop and OracleTanel Poder
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Emilio Coppa
 
Big Data Career Path | Big Data Learning Path | Hadoop Tutorial | Edureka
Big Data Career Path | Big Data Learning Path | Hadoop Tutorial | EdurekaBig Data Career Path | Big Data Learning Path | Hadoop Tutorial | Edureka
Big Data Career Path | Big Data Learning Path | Hadoop Tutorial | EdurekaEdureka!
 
Hadoop training by keylabs
Hadoop training by keylabsHadoop training by keylabs
Hadoop training by keylabsSiva Sankar
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiSlim Baltagi
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Uri Laserson
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsHortonworks
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 

Viewers also liked (20)

Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Tennis
TennisTennis
Tennis
 
Introduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -IIntroduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -I
 
Startup Recruiting Trends
Startup Recruiting TrendsStartup Recruiting Trends
Startup Recruiting Trends
 
BIG DATA Online Training | Hadoop Online Training with Placement Assistance
BIG DATA Online Training | Hadoop Online Training with Placement Assistance BIG DATA Online Training | Hadoop Online Training with Placement Assistance
BIG DATA Online Training | Hadoop Online Training with Placement Assistance
 
Using R with Hadoop
Using R with HadoopUsing R with Hadoop
Using R with Hadoop
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBuilding a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
 
Tennis presentation slide FINAL
Tennis presentation slide  FINALTennis presentation slide  FINAL
Tennis presentation slide FINAL
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
 
Connecting Hadoop and Oracle
Connecting Hadoop and OracleConnecting Hadoop and Oracle
Connecting Hadoop and Oracle
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)
 
Big Data Career Path | Big Data Learning Path | Hadoop Tutorial | Edureka
Big Data Career Path | Big Data Learning Path | Hadoop Tutorial | EdurekaBig Data Career Path | Big Data Learning Path | Hadoop Tutorial | Edureka
Big Data Career Path | Big Data Learning Path | Hadoop Tutorial | Edureka
 
Hadoop training by keylabs
Hadoop training by keylabsHadoop training by keylabs
Hadoop training by keylabs
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 

Similar to An Introduction to Applying MapReduce and Hadoop for Scientific Data Mining

Big Data and Hadoop with MapReduce Paradigms
Big Data and Hadoop with MapReduce ParadigmsBig Data and Hadoop with MapReduce Paradigms
Big Data and Hadoop with MapReduce ParadigmsArundhati Kanungo
 
Hadoop - A Very Short Introduction
Hadoop - A Very Short IntroductionHadoop - A Very Short Introduction
Hadoop - A Very Short Introductiondewang_mistry
 
62_Tazeen_Sayed_Hadoop_Ecosystem.pptx
62_Tazeen_Sayed_Hadoop_Ecosystem.pptx62_Tazeen_Sayed_Hadoop_Ecosystem.pptx
62_Tazeen_Sayed_Hadoop_Ecosystem.pptxTazeenSayed3
 
Intro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceIntro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceKrishna Sangeeth KS
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA
 
Learning How to Learn Hadoop
Learning How to Learn HadoopLearning How to Learn Hadoop
Learning How to Learn HadoopSilicon Halton
 
Cred_hadoop_presenatation
Cred_hadoop_presenatationCred_hadoop_presenatation
Cred_hadoop_presenatationAshish Saraf
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoopShashwat Shriparv
 
3.introduction to map reduce
3.introduction to map reduce3.introduction to map reduce
3.introduction to map reducedatabloginfo
 
Understanding hadoop
Understanding hadoopUnderstanding hadoop
Understanding hadoopRexRamos9
 
[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축
[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축
[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축Kwang Woo NAM
 
Hadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and OpportunitiesHadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and OpportunitiesDaniel Abadi
 

Similar to An Introduction to Applying MapReduce and Hadoop for Scientific Data Mining (20)

Big Data and Hadoop with MapReduce Paradigms
Big Data and Hadoop with MapReduce ParadigmsBig Data and Hadoop with MapReduce Paradigms
Big Data and Hadoop with MapReduce Paradigms
 
Hadoop - A Very Short Introduction
Hadoop - A Very Short IntroductionHadoop - A Very Short Introduction
Hadoop - A Very Short Introduction
 
Lecture 2 Hadoop.pptx
Lecture 2 Hadoop.pptxLecture 2 Hadoop.pptx
Lecture 2 Hadoop.pptx
 
62_Tazeen_Sayed_Hadoop_Ecosystem.pptx
62_Tazeen_Sayed_Hadoop_Ecosystem.pptx62_Tazeen_Sayed_Hadoop_Ecosystem.pptx
62_Tazeen_Sayed_Hadoop_Ecosystem.pptx
 
Hadoop map reduce
Hadoop map reduceHadoop map reduce
Hadoop map reduce
 
Hadoop programming
Hadoop programmingHadoop programming
Hadoop programming
 
Intro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceIntro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and Mapreduce
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Learning How to Learn Hadoop
Learning How to Learn HadoopLearning How to Learn Hadoop
Learning How to Learn Hadoop
 
Cred_hadoop_presenatation
Cred_hadoop_presenatationCred_hadoop_presenatation
Cred_hadoop_presenatation
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Hadoop Tutorial for Beginners
Hadoop Tutorial for BeginnersHadoop Tutorial for Beginners
Hadoop Tutorial for Beginners
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoop
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
3.introduction to map reduce
3.introduction to map reduce3.introduction to map reduce
3.introduction to map reduce
 
Understanding hadoop
Understanding hadoopUnderstanding hadoop
Understanding hadoop
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축
[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축
[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축
 
Hadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and OpportunitiesHadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and Opportunities
 
Mapreduce Hadop.pptx
Mapreduce Hadop.pptxMapreduce Hadop.pptx
Mapreduce Hadop.pptx
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 

Recently uploaded (20)

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

An Introduction to Applying MapReduce and Hadoop for Scientific Data Mining

  • 1. MapReduce / Hadoop for Scientific Data Mining Hadoop = Open Source MapReduce Wider World of Hadoop An Introduction to the World of Hadoop Applications to Scientific Data Mining Gordon Rios g.rios@4c.ucc.ie Cork Constraint Computation Centre (4C) University College Cork October 29, 2010 Gordon Rios Introduction to Hadoop
  • 2. MapReduce / Hadoop for Scientific Data Mining Hadoop = Open Source MapReduce Wider World of Hadoop Outline 1 MapReduce / Hadoop for Scientific Data Mining Objectives for the Talk MapReduce as Simplified Parallel Computing Thinking in Terms of MapReduce 2 Hadoop is Open Source MapReduce Basics of Hadoop Hadoop Examples Developing Production Systems with Hadoop 3 Wider World of Hadoop Ad Hoc Analysis with Hadoop Further Reading Gordon Rios Introduction to Hadoop
  • 3. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce Thinking Outline 1 MapReduce / Hadoop for Scientific Data Mining Objectives for the Talk MapReduce as Simplified Parallel Computing Thinking in Terms of MapReduce 2 Hadoop is Open Source MapReduce Basics of Hadoop Hadoop Examples Developing Production Systems with Hadoop 3 Wider World of Hadoop Ad Hoc Analysis with Hadoop Further Reading Gordon Rios Introduction to Hadoop
  • 4. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce Thinking Objectives At the end of this talk I want you to have ideas for how to apply MapReduce to your domains and confidence that Hadoop is a good way to do it. . . Introduce thinking in terms of MapReduce and why it’s a good idea Introduce Hadoop as an open source implementation of MapReduce Present a detailed example of using the Hadoop streaming API for a scientific data mining task Discuss higher level notions for performing ad hoc analysis and building systems with Hadoop Gordon Rios Introduction to Hadoop
  • 5. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce Thinking Objectives At the end of this talk I want you to have ideas for how to apply MapReduce to your domains and confidence that Hadoop is a good way to do it. . . Introduce thinking in terms of MapReduce and why it’s a good idea Introduce Hadoop as an open source implementation of MapReduce Present a detailed example of using the Hadoop streaming API for a scientific data mining task Discuss higher level notions for performing ad hoc analysis and building systems with Hadoop Gordon Rios Introduction to Hadoop
  • 6. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce Thinking Objectives At the end of this talk I want you to have ideas for how to apply MapReduce to your domains and confidence that Hadoop is a good way to do it. . . Introduce thinking in terms of MapReduce and why it’s a good idea Introduce Hadoop as an open source implementation of MapReduce Present a detailed example of using the Hadoop streaming API for a scientific data mining task Discuss higher level notions for performing ad hoc analysis and building systems with Hadoop Gordon Rios Introduction to Hadoop
  • 7. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce Thinking Objectives At the end of this talk I want you to have ideas for how to apply MapReduce to your domains and confidence that Hadoop is a good way to do it. . . Introduce thinking in terms of MapReduce and why it’s a good idea Introduce Hadoop as an open source implementation of MapReduce Present a detailed example of using the Hadoop streaming API for a scientific data mining task Discuss higher level notions for performing ad hoc analysis and building systems with Hadoop Gordon Rios Introduction to Hadoop
  • 8. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce Thinking Outline 1 MapReduce / Hadoop for Scientific Data Mining Objectives for the Talk MapReduce as Simplified Parallel Computing Thinking in Terms of MapReduce 2 Hadoop is Open Source MapReduce Basics of Hadoop Hadoop Examples Developing Production Systems with Hadoop 3 Wider World of Hadoop Ad Hoc Analysis with Hadoop Further Reading Gordon Rios Introduction to Hadoop
  • 9. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce Thinking MapReduce MapReduce is distributed computing where we take advantage of data locality to push the computation to the data. . . Distributed computing: clusters of computers with local memory and disk (network intensive for big data) Parallel Computing: multiple CPUs processing over shared memory and filesystem If we can decompose the problem into independent map and reduce tasks we can achieve “easy” parallelism with MapReduce. . . 1 Map works independently to convert input data to key value pairs. . . 2 Reduce works independently on all values for a given key and transforms them to a single output set (possibly even just the ∅) per key. . . Now, let’s expand that a bit. . . Gordon Rios Introduction to Hadoop
  • 10. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce Thinking MapReduce MapReduce is distributed computing where we take advantage of data locality to push the computation to the data. . . Distributed computing: clusters of computers with local memory and disk (network intensive for big data) Parallel Computing: multiple CPUs processing over shared memory and filesystem If we can decompose the problem into independent map and reduce tasks we can achieve “easy” parallelism with MapReduce. . . 1 Map works independently to convert input data to key value pairs. . . 2 Reduce works independently on all values for a given key and transforms them to a single output set (possibly even just the ∅) per key. . . Now, let’s expand that a bit. . . Gordon Rios Introduction to Hadoop
  • 11. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce Thinking MapReduce MapReduce is distributed computing where we take advantage of data locality to push the computation to the data. . . Distributed computing: clusters of computers with local memory and disk (network intensive for big data) Parallel Computing: multiple CPUs processing over shared memory and filesystem If we can decompose the problem into independent map and reduce tasks we can achieve “easy” parallelism with MapReduce. . . 1 Map works independently to convert input data to key value pairs. . . 2 Reduce works independently on all values for a given key and transforms them to a single output set (possibly even just the ∅) per key. . . Now, let’s expand that a bit. . . Gordon Rios Introduction to Hadoop
  • 12. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce Thinking MapReduce MapReduce is distributed computing where we take advantage of data locality to push the computation to the data. . . Distributed computing: clusters of computers with local memory and disk (network intensive for big data) Parallel Computing: multiple CPUs processing over shared memory and filesystem If we can decompose the problem into independent map and reduce tasks we can achieve “easy” parallelism with MapReduce. . . 1 Map works independently to convert input data to key value pairs. . . 2 Reduce works independently on all values for a given key and transforms them to a single output set (possibly even just the ∅) per key. . . Now, let’s expand that a bit. . . Gordon Rios Introduction to Hadoop
  • 13. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce Thinking MapReduce MapReduce is distributed computing where we take advantage of data locality to push the computation to the data. . . Distributed computing: clusters of computers with local memory and disk (network intensive for big data) Parallel Computing: multiple CPUs processing over shared memory and filesystem If we can decompose the problem into independent map and reduce tasks we can achieve “easy” parallelism with MapReduce. . . 1 Map works independently to convert input data to key value pairs. . . 2 Reduce works independently on all values for a given key and transforms them to a single output set (possibly even just the ∅) per key. . . Now, let’s expand that a bit. . . Gordon Rios Introduction to Hadoop
  • 14. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce Thinking MapReduce MapReduce is distributed computing where we take advantage of data locality to push the computation to the data. . . Distributed computing: clusters of computers with local memory and disk (network intensive for big data) Parallel Computing: multiple CPUs processing over shared memory and filesystem If we can decompose the problem into independent map and reduce tasks we can achieve “easy” parallelism with MapReduce. . . 1 Map works independently to convert input data to key value pairs. . . 2 Reduce works independently on all values for a given key and transforms them to a single output set (possibly even just the ∅) per key. . . Now, let’s expand that a bit. . . Gordon Rios Introduction to Hadoop
  • 15. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce Thinking Basics Elements of MapReduce MapReduce is distributed sort with specific places to insert application logic. . . an input reader: read work data W from file system1 and produce a set of splits S: W → S a Map function: (S) → (K , V ) combiner function: a mapper optimization. . . partition function: partition2 keys k ∈ K to reducers K → R compare function cmp(ki , kj ): sort keys presented to each reducer a Reduce function: reduce output from all mappers for a particular to another set of values for that key wk (k , V ) → (k , wk )) an output writer: write output to file system. 1 A distributed file system (DFS) for stability and scale 2 The default hash keys modulo number of reducers Gordon Rios Introduction to Hadoop
  • 16. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce Thinking Outline 1 MapReduce / Hadoop for Scientific Data Mining Objectives for the Talk MapReduce as Simplified Parallel Computing Thinking in Terms of MapReduce 2 Hadoop is Open Source MapReduce Basics of Hadoop Hadoop Examples Developing Production Systems with Hadoop 3 Wider World of Hadoop Ad Hoc Analysis with Hadoop Further Reading Gordon Rios Introduction to Hadoop
  • 17. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce Thinking Examples of Map and Reduce Let’s start with a few examples of Map. . . Word Count: read in a stream of text (e.g. a document or a set of documents) and emit each word as a key with a value of 1 Inverted Index: read in a stream of documents and emit each word as a key and the document ID as the value Max Temperature: read in formatted data and emit year as a key with temperature as the value Mean Rain Precipitation: read in daily data and emit (year-month, lat, long) as a key with temperature as the value Reduce in these cases simply applies a count, list, max, average, to a set of values for each key, respectively. [Dean, Ghemawat, 2008, Wikipedia, 2010, White, 2011] Gordon Rios Introduction to Hadoop
  • 18. MapReduce / Hadoop for Scientific Data Mining Objectives Hadoop = Open Source MapReduce Parallel Computing with MapReduce Wider World of Hadoop MapReduce Thinking Visualizing Word Count source: Chris Wensel from http://www.cascading.org Gordon Rios Introduction to Hadoop
  • 19. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Outline 1 MapReduce / Hadoop for Scientific Data Mining Objectives for the Talk MapReduce as Simplified Parallel Computing Thinking in Terms of MapReduce 2 Hadoop is Open Source MapReduce Basics of Hadoop Hadoop Examples Developing Production Systems with Hadoop 3 Wider World of Hadoop Ad Hoc Analysis with Hadoop Further Reading Gordon Rios Introduction to Hadoop
  • 20. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Engineering Intermezzo This is how easy it is to get Hadoop installed . . . given that you have Java 6 installed already. . . Get Hadoop: http://hadoop.apache.org/ % t a r x z f hadoop−x . y . z . t a r . gz % e x p o r t HADOOP_INSTALL=BUILD_DIR / hadoop−x . y . z % e x p o r t PATH=$PATH : $HADOOP_INSTALL / b i n Gordon Rios Introduction to Hadoop
  • 21. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems MapReduce with Hadoop and the streaming library Now, let’s take a closer look at how Hadoop implements MapReduce from [White, 2011]. . . Gordon Rios Introduction to Hadoop
  • 22. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Hadoop Streaming Library We’ll focus on the streaming library as it’s the most natural for scientific or technical computing. . . let’s look at the Definitive Guide’s weather example. . . Gordon Rios Introduction to Hadoop
  • 23. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Outline 1 MapReduce / Hadoop for Scientific Data Mining Objectives for the Talk MapReduce as Simplified Parallel Computing Thinking in Terms of MapReduce 2 Hadoop is Open Source MapReduce Basics of Hadoop Hadoop Examples Developing Production Systems with Hadoop 3 Wider World of Hadoop Ad Hoc Analysis with Hadoop Further Reading Gordon Rios Introduction to Hadoop
  • 24. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Hadoop Book Examples More examples from Hadoop: The Definitive Guide, 2nd Edition (Hadoop 20.1) http://www.hadoopbook.com/. . . here’s how to install and try them for yourself. . . Install Git: http://git-scm.com/ Visit github for book code: http://github.com/tomwhite/ hadoop-book/ Checkout code examples from The Definitive Guide % cd BUILD_DIR % git clone http://github.com/tomwhite/hadoop-book.git hadoop-book Gordon Rios Introduction to Hadoop
  • 25. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Example: ECA Mean Precipitation Let’s compute mean precipitation at over 2,000 weather stations and make some graphics. There are 2,186 files with median of 21,875 lines each, a minimum of 1,025 and a maximum of 78,090. ECA Daily Data The ECA dataset contains series of daily observations at meteorological stations throughout Europe and the Mediterranean. Part of the dataset is freely available for non-commercial research. To download this public data select one of the options below. Note that a gridded version with daily temperature and precipitation fields is also available. source: http://eca.knmi.nl/dailydata/index.php File Format FILE FORMAT ( MISSING VALUE CODE = −9999): 01−06 STAID : Station i d e n t i f i e r 08−13 SOUID : Source i d e n t i f i e r 15−22 DATE : Date YYYYMMDD 24−28 RR : P r e c i p i t a t i o n amount i n 0 . 1 mm 30−34 Q_RR : q u a l i t y code f o r RR ( 0 = ’ v a l i d ’ ; 1= ’ suspect ’ ; 9= ’ missing ’ ) Gordon Rios Introduction to Hadoop
  • 26. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Example: ECA Mean Precipitation Scientific Data Mining: use the Hadoop stream library and manually pipeline MapReduce jobs together as needed. . . Write hadoop scripts in python in two steps Test cat data | map.py | sort | reduce.py > output (not shown) Process data into individual files for each time period (Year/Month) of interest using hadoop stream library (local mode) Call R in batch mode to produce image files Gordon Rios Introduction to Hadoop
  • 27. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems ECA Mean Precipitation: Step One map_one.py def l a t _ l o n _ t o _ c o o r d ( s ) : sign = 1 d , m, s = map( lambda x : f l o a t ( x ) , s . s p l i t ( " : " ) ) s i g n = −1 i f d < 0 else 1 x = abs ( d ) + m / 60.0 + s / 3600.0 return f l o a t ( sign ∗ x ) f o r l i n e i n sys . s t d i n : # f l d s = ( s t a i d , souid , date , r r , q _ r r ) flds = line . strip (). split ( " , " ) i f len ( f l d s ) != 5: continue staid = flds [ 0 ] . strip () # station id date = f l d s [ 2 ] . s t r i p ( ) # YYYYMMDD i f date < BEGIN_DATE or date > END_DATE : continue rr = flds [3]. strip () # p r e c i p i t a t i o n i n 0 . 1 mm q_rr = f l d s [ 4 ] . s t r i p ( ) # q u a l i t y code " 0 " = v a l i d l a t , l o n = l a t l o n s . g e t ( s t a i d , ( None , None ) ) i f q _ r r == ’ 0 ’ and ( l a t i s not None ) and ( l o n i s not None ) : p r i n t "%s ,%.4 f ,%.4 f t%s " % ( date [ 0 : 6 ] , l a t , lon , r r ) Gordon Rios Introduction to Hadoop
  • 28. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems ECA Mean Precipitation: Step One (cont) reduce_one.py ( l a s t _ k e y , x , n ) = ( None , 0 . 0 , 0 ) f o r l i n e i n sys . s t d i n : ( key , v a l ) = l i n e . s t r i p ( ) . s p l i t ( " t " ) i f l a s t _ k e y and l a s t _ k e y ! = key : # t i m e t o e m i t reduced v a l u e i f n > 0: p r i n t "%s t %.2 f " % ( l a s t _ k e y , x / n ) x = 0.0 n = 0 # we j u s t want data f o r t h e year 2009 ( l a s t _ k e y , x , n ) = ( key , x + f l o a t ( v a l ) , n + 1 ) i f last_key : i f n > 0: p r i n t "%s t %.2 f " % ( l a s t _ k e y , x / n ) Gordon Rios Introduction to Hadoop
  • 29. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems ECA Mean Precipitation: Step Two Map ((yyyymm,lat,lon),mean) -> (yyyymm, (lat,lon,mean)) map_two.py f o r l i n e i n sys . s t d i n : yyyymm_lat_lon , mean_precip = l i n e . s t r i p ( ) . s p l i t ( " t " ) yyyymm , l a t , l o n = yyyymm_lat_lon . s t r i p ( ) . s p l i t ( " , " ) p r i n t "%s t%s %s %s " % ( yyyymm , l a t , lon , mean_precip ) Empty reduce just write to a local file (hack since we’re running locally) reduce_two.py l a s t _ k e y = None values = [ ] f o r l i n e i n sys . s t d i n : ( key , v a l ) = l i n e . s t r i p ( ) . s p l i t ( " t " ) i f l a s t _ k e y and l a s t _ k e y ! = key : # t i m e t o e m i t reduced v a l u e w r i t e _ f i l e ( last_key , values ) values = [ ] l a s t _ k e y = key v a l u e s . append ( v a l ) # c r e a t e a s t r i n g w i t h t h r e e v a l u e s i f last_key : w r i t e _ f i l e ( last_key , values ) Gordon Rios Introduction to Hadoop
  • 30. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems ECA Mean Precipitation: Step Two Map ((yyyymm,lat,lon),mean) -> (yyyymm, (lat,lon,mean)) map_two.py f o r l i n e i n sys . s t d i n : yyyymm_lat_lon , mean_precip = l i n e . s t r i p ( ) . s p l i t ( " t " ) yyyymm , l a t , l o n = yyyymm_lat_lon . s t r i p ( ) . s p l i t ( " , " ) p r i n t "%s t%s %s %s " % ( yyyymm , l a t , lon , mean_precip ) Empty reduce just write to a local file (hack since we’re running locally) reduce_two.py l a s t _ k e y = None values = [ ] f o r l i n e i n sys . s t d i n : ( key , v a l ) = l i n e . s t r i p ( ) . s p l i t ( " t " ) i f l a s t _ k e y and l a s t _ k e y ! = key : # t i m e t o e m i t reduced v a l u e w r i t e _ f i l e ( last_key , values ) values = [ ] l a s t _ k e y = key v a l u e s . append ( v a l ) # c r e a t e a s t r i n g w i t h t h r e e v a l u e s i f last_key : w r i t e _ f i l e ( last_key , values ) Gordon Rios Introduction to Hadoop
  • 31. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Example: ECA Mean Precipitation Step One: input -> (yyyymm,lat,lon), mean precip % hadoop jar /Users/gordon/build/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input /Downloads/tarragona/ECA_blend_all_data/RR_STAID00* -output output -mapper /Desktop/tmp/tarragona/python/map_one.py -reducer /Desktop/tmp/tarragona/python/reduce_one.py % 10/10/25 19:34:31 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= ... % 10/10/25 20:07:36 INFO streaming.StreamJob: Job complete: job_local_0001 % 10/10/25 20:07:36 INFO streaming.StreamJob: Output: output Step Two: (date,lat,lon), mean precip -> files(yymm) % hadoop jar /Users/gordon/build/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input output/part-00000 -output output_two -mapper /Desktop/tmp/tarragona/python/map_two.py -reducer /Desktop/tmp/tarragona/python/reduce_two.py % 10/10/25 20:41:43 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= ... % 10/10/25 20:41:47 INFO streaming.StreamJob: Job complete: job_local_0001 % 10/10/25 20:41:47 INFO streaming.StreamJob: Output: output_two Gordon Rios Introduction to Hadoop
  • 32. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Example: ECA Mean Precipitation Step One: input -> (yyyymm,lat,lon), mean precip % hadoop jar /Users/gordon/build/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input /Downloads/tarragona/ECA_blend_all_data/RR_STAID00* -output output -mapper /Desktop/tmp/tarragona/python/map_one.py -reducer /Desktop/tmp/tarragona/python/reduce_one.py % 10/10/25 19:34:31 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= ... % 10/10/25 20:07:36 INFO streaming.StreamJob: Job complete: job_local_0001 % 10/10/25 20:07:36 INFO streaming.StreamJob: Output: output Step Two: (date,lat,lon), mean precip -> files(yymm) % hadoop jar /Users/gordon/build/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input output/part-00000 -output output_two -mapper /Desktop/tmp/tarragona/python/map_two.py -reducer /Desktop/tmp/tarragona/python/reduce_two.py % 10/10/25 20:41:43 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= ... % 10/10/25 20:41:47 INFO streaming.StreamJob: Job complete: job_local_0001 % 10/10/25 20:41:47 INFO streaming.StreamJob: Output: output_two Gordon Rios Introduction to Hadoop
  • 33. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Batch Processing in R And, after a little batch processing with R. . . batch-graphics.R library ( fields ) f i l e s <− c ( " 200901. d a t " , " 200902. d a t " , " 200903. d a t " , " 200904. d a t " , " 200905. d a t " , " 200906. d a t " , " 200907. d a t " , " 200908. d a t " , " 200909. d a t " , " 200910. d a t " , " 200911. d a t " , " 200912. d a t " ) i <− 1 for ( f in f i l e s ) { mat <− read . t a b l e ( f ) names ( mat ) <− c ( " l a t " , " l o n g " , " p r e c i p " ) png ( f i l e n a m e =paste ( " p r e c i p−" , i , " . png " , sep= " " ) , h e i g h t =480 , w i d t h =480) q u i l t . p l o t ( mat $long , mat $ l a t , mat $ p r e c i p , n c o l =100 , nrow =100 , y l i m =c ( 2 2 . 0 , 7 9 . 0 ) , x l i m =c ( − 5 2 . 0 , 7 2 . 0 ) , c o l =two . c o l o r s ( 2 5 6 , s t a r t = " wheat " , end= " d a r k b l u e " , middle = " b l u e " ) , z l i m =c ( 0 , 4 1 0 ) , add . legend=T , cex . l a b = 0 . 6 ) p o i n t s ( 1 . 2 4 5 3 , 41.1187 , pch =1) t e x t ( 1 . 2 4 5 3 , 41.1187 , " t a r r a g o n a " , cex = 0 . 8 , pos=1) p o i n t s ( 2 . 3 5 0 8 3 , 4 8 . 8 9 , pch =1) t e x t ( 2 . 3 5 0 8 , 4 8 . 8 9 , " p a r i s " , cex = 0 . 8 , pos=4) p o i n t s ( 1 2 . 4 8 2 3 , 41.8955 , pch =1) t e x t ( 1 2 . 4 8 2 3 , 41.8955 , " rome " , cex = 0 . 8 , pos=4) dev . o f f ( ) i <− i + 1 } Gordon Rios Introduction to Hadoop
  • 34. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems ECA Precipitation 2009 Month: 1 Gordon Rios Introduction to Hadoop
  • 35. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems ECA Precipitation 2009 Month: 2 Gordon Rios Introduction to Hadoop
  • 36. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems ECA Precipitation 2009 Month: 3 Gordon Rios Introduction to Hadoop
  • 37. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems ECA Precipitation 2009 Month: 4 Gordon Rios Introduction to Hadoop
  • 38. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems ECA Precipitation 2009 Month: 5 Gordon Rios Introduction to Hadoop
  • 39. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems ECA Precipitation 2009 Month: 6 Gordon Rios Introduction to Hadoop
  • 40. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems ECA Precipitation 2009 Month: 7 Gordon Rios Introduction to Hadoop
  • 41. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems ECA Precipitation 2009 Month: 8 Gordon Rios Introduction to Hadoop
  • 42. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems ECA Precipitation 2009 Month: 9 Gordon Rios Introduction to Hadoop
  • 43. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems ECA Precipitation 2009 Month: 10 Gordon Rios Introduction to Hadoop
  • 44. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems ECA Precipitation 2009 Month: 11 Gordon Rios Introduction to Hadoop
  • 45. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems ECA Precipitation 2009 Month: 12 Gordon Rios Introduction to Hadoop
  • 46. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Summary of What We Did Work through a complete example but that’s not all since with very little additional work we can. . . Test the scripts in pseudo-distributed mode locally on our own machine Run the job on a compute cluster remotely Run the job in the cloud with EC2 there system as just another remote cluster Run the job with Amazon’s Elastic MapReduce http://aws.amazon.com/elasticmapreduce/ which allows you to pay for exactly as much computing as you use. See [White, 2011] for complete details on how to run in these different modes. . . Gordon Rios Introduction to Hadoop
  • 47. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Outline 1 MapReduce / Hadoop for Scientific Data Mining Objectives for the Talk MapReduce as Simplified Parallel Computing Thinking in Terms of MapReduce 2 Hadoop is Open Source MapReduce Basics of Hadoop Hadoop Examples Developing Production Systems with Hadoop 3 Wider World of Hadoop Ad Hoc Analysis with Hadoop Further Reading Gordon Rios Introduction to Hadoop
  • 48. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Systems Development APIs And, you can build production systems with Hadoop in either Java or C++. . . Full featured Java API for Hadoop Pipes is the C++ API for Hadoop MapReduce Cascading is an API for developing general data processing systems that incorporate MapReduce (http://www.cascading.org) Gordon Rios Introduction to Hadoop
  • 49. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Systems Development APIs And, you can build production systems with Hadoop in either Java or C++. . . Full featured Java API for Hadoop Pipes is the C++ API for Hadoop MapReduce Cascading is an API for developing general data processing systems that incorporate MapReduce (http://www.cascading.org) Gordon Rios Introduction to Hadoop
  • 50. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Systems Development APIs And, you can build production systems with Hadoop in either Java or C++. . . Full featured Java API for Hadoop Pipes is the C++ API for Hadoop MapReduce Cascading is an API for developing general data processing systems that incorporate MapReduce (http://www.cascading.org) Gordon Rios Introduction to Hadoop
  • 51. MapReduce / Hadoop for Scientific Data Mining Hadoop Basics Hadoop = Open Source MapReduce Hadoop Examples Wider World of Hadoop Developing Production Systems Cascading Cascading allows developers to model very complex data flows at a higher level than Map & Reduce and then automatically generate and visualize the dozens or perhaps even hundreds of necessary Hadoop jobs as a graph Gordon Rios Introduction to Hadoop
  • 52. MapReduce / Hadoop for Scientific Data Mining Ad Hoc Analysis Hadoop = Open Source MapReduce Further Reading Wider World of Hadoop Outline 1 MapReduce / Hadoop for Scientific Data Mining Objectives for the Talk MapReduce as Simplified Parallel Computing Thinking in Terms of MapReduce 2 Hadoop is Open Source MapReduce Basics of Hadoop Hadoop Examples Developing Production Systems with Hadoop 3 Wider World of Hadoop Ad Hoc Analysis with Hadoop Further Reading Gordon Rios Introduction to Hadoop
  • 53. MapReduce / Hadoop for Scientific Data Mining Ad Hoc Analysis Hadoop = Open Source MapReduce Further Reading Wider World of Hadoop Ad Hoc Analysis What’s missing? Sometimes you need to do fast ad hoc queries. . . can we do that in a scalable way? Pig: “Pig is a scripting language for exploring large datasets” [White, 2011] (Yahoo!) Hive: provide an SQL interface for running ad hoc queries and other data processing tasks for SQL analysts (Facebook) Hbase: Column oriented database along the lines of Google’s Bigtable database (Powerset) Hypertable: GPL clone of Google’s Bigtable database written in C++ (Zvents) Google’s Bigtable database is described in [Chang, Dean, Ghemawat, et al., 2008] Gordon Rios Introduction to Hadoop
  • 54. MapReduce / Hadoop for Scientific Data Mining Ad Hoc Analysis Hadoop = Open Source MapReduce Further Reading Wider World of Hadoop Ad Hoc Analysis What’s missing? Sometimes you need to do fast ad hoc queries. . . can we do that in a scalable way? Pig: “Pig is a scripting language for exploring large datasets” [White, 2011] (Yahoo!) Hive: provide an SQL interface for running ad hoc queries and other data processing tasks for SQL analysts (Facebook) Hbase: Column oriented database along the lines of Google’s Bigtable database (Powerset) Hypertable: GPL clone of Google’s Bigtable database written in C++ (Zvents) Google’s Bigtable database is described in [Chang, Dean, Ghemawat, et al., 2008] Gordon Rios Introduction to Hadoop
  • 55. MapReduce / Hadoop for Scientific Data Mining Ad Hoc Analysis Hadoop = Open Source MapReduce Further Reading Wider World of Hadoop Ad Hoc Analysis What’s missing? Sometimes you need to do fast ad hoc queries. . . can we do that in a scalable way? Pig: “Pig is a scripting language for exploring large datasets” [White, 2011] (Yahoo!) Hive: provide an SQL interface for running ad hoc queries and other data processing tasks for SQL analysts (Facebook) Hbase: Column oriented database along the lines of Google’s Bigtable database (Powerset) Hypertable: GPL clone of Google’s Bigtable database written in C++ (Zvents) Google’s Bigtable database is described in [Chang, Dean, Ghemawat, et al., 2008] Gordon Rios Introduction to Hadoop
  • 56. MapReduce / Hadoop for Scientific Data Mining Ad Hoc Analysis Hadoop = Open Source MapReduce Further Reading Wider World of Hadoop Ad Hoc Analysis What’s missing? Sometimes you need to do fast ad hoc queries. . . can we do that in a scalable way? Pig: “Pig is a scripting language for exploring large datasets” [White, 2011] (Yahoo!) Hive: provide an SQL interface for running ad hoc queries and other data processing tasks for SQL analysts (Facebook) Hbase: Column oriented database along the lines of Google’s Bigtable database (Powerset) Hypertable: GPL clone of Google’s Bigtable database written in C++ (Zvents) Google’s Bigtable database is described in [Chang, Dean, Ghemawat, et al., 2008] Gordon Rios Introduction to Hadoop
  • 57. MapReduce / Hadoop for Scientific Data Mining Ad Hoc Analysis Hadoop = Open Source MapReduce Further Reading Wider World of Hadoop Ad Hoc Analysis What’s missing? Sometimes you need to do fast ad hoc queries. . . can we do that in a scalable way? Pig: “Pig is a scripting language for exploring large datasets” [White, 2011] (Yahoo!) Hive: provide an SQL interface for running ad hoc queries and other data processing tasks for SQL analysts (Facebook) Hbase: Column oriented database along the lines of Google’s Bigtable database (Powerset) Hypertable: GPL clone of Google’s Bigtable database written in C++ (Zvents) Google’s Bigtable database is described in [Chang, Dean, Ghemawat, et al., 2008] Gordon Rios Introduction to Hadoop
  • 58. MapReduce / Hadoop for Scientific Data Mining Ad Hoc Analysis Hadoop = Open Source MapReduce Further Reading Wider World of Hadoop Interesting Application Frameworks with Hadoop Here are a few examples of frameworks in development or already available that use Hadoop as a platform. . . Apache Mahout: Ambitious project to implement popular machine learning algorithms and recommenders with Hadoop3 Graph: Jake Hoffman from Yahoo Research has released some of his work on large scale network analysis with Hadoop with prototype code4 . Also see [Vassilvitskii, 2010] for related graph analysis research. Application to GIS: Nathan Kerr’s M.S. Thesis with lots of details on how to do GIS with Hadoop5 3 http://mahout.apache.org/ 4 http://github.com/jhofman/icwsm2010_tutorial 5 http://www.nathankerr.com/projects/parallel-gis-processing/alternative_ approaches_to_parallel_gis_processing.html Gordon Rios Introduction to Hadoop
  • 59. MapReduce / Hadoop for Scientific Data Mining Ad Hoc Analysis Hadoop = Open Source MapReduce Further Reading Wider World of Hadoop Outline 1 MapReduce / Hadoop for Scientific Data Mining Objectives for the Talk MapReduce as Simplified Parallel Computing Thinking in Terms of MapReduce 2 Hadoop is Open Source MapReduce Basics of Hadoop Hadoop Examples Developing Production Systems with Hadoop 3 Wider World of Hadoop Ad Hoc Analysis with Hadoop Further Reading Gordon Rios Introduction to Hadoop
  • 60. MapReduce / Hadoop for Scientific Data Mining Ad Hoc Analysis Hadoop = Open Source MapReduce Further Reading Wider World of Hadoop Further Reading White, T. Hadoop: The Definitive Guide, 2nd Edition O’Reilly Media, Inc., Sebastopol, CA, 2011 Sanderson, D. Programming Google App Engine O’Reilly Media, Inc., Sebastopol, CA, 2009 Murty, J. Programming Amazon Web Services O’Reilly Media, Inc., Sebastopol, CA, 2008 Dean, J. and Ghemawat, S. MapReduce: simplified data processing on large clusters Communications of the ACM, 51(1):107–113, 2008 Chang, Fay and Dean, Jeffrey and Ghemawat, Sanjay and Hsieh, Wilson C. and Wallach, Deborah A. and Burrows, Mike and Chandra, Tushar and Fikes, Andrew and Gruber, Robert E. Bigtable: a distributed storage system for structured data OSDI ’06: Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation, USENIX Assoc., Berkeley, CA, 2006 MapReduce on Wikipedia http://en.wikipedia.org/wiki/MapReduce Vassilvitskii, S. XXL Graph Algorithms, Hadoop Summit 2010 http://developer.yahoo.com/events/hadoopsummit2010/ Gordon Rios Introduction to Hadoop