MapReduce Intro




                  The MapReduce Programming Model

                         Introduction and Examples

                       Dr. Jose Mar´ Alvarez-Rodr´
                                   ıa            ıguez

            “Quality Management in Service-based Systems and Cloud
                                Applications”

                               FP7 RELATE-ITN

                         South East European Research Center


                        Thessaloniki, 10th of April, 2013

                                                                     1 / 61
MapReduce Intro




      1   MapReduce in a nutshell

      2   Thinking in MapReduce

      3   Applying MapReduce

      4   Success Stories with MapReduce

      5   Summary and Conclusions




                                           2 / 61
MapReduce Intro
  MapReduce in a nutshell



 Features




      A programming model...
         1   Large-scale distributed data processing
         2   Simple but restricted
         3   Paralell programming
         4   Extensible




                                                       3 / 61
MapReduce Intro
  MapReduce in a nutshell



 Antecedents

      Functional programming
         1   Inspired
         2   ...but not equivalent

      Example in Python
      “Given a list of numbers between 1 and 50 print only even
      numbers”
              §                                                             ¤
                  print filter ( lambda x : x % 2 == 0 , range (1 , 50) )
             ¦
                                                                           ¥

             A list of numbers (data)
             A condition (even numbers)
             A function filter that is applied to the list (map)


                                                                                4 / 61
MapReduce Intro
  MapReduce in a nutshell



 Antecedents

      Functional programming
         1   Inspired
         2   ...but not equivalent

      Example in Python
      “Given a list of numbers between 1 and 50 print only even
      numbers”
              §                                                             ¤
                  print filter ( lambda x : x % 2 == 0 , range (1 , 50) )
             ¦
                                                                           ¥

             A list of numbers (data)
             A condition (even numbers)
             A function filter that is applied to the list (map)


                                                                                5 / 61
MapReduce Intro
  MapReduce in a nutshell



 ...Other examples...

      Example in Python
      “Return the sum of the squares of a list of numbers between 1 and
      50”
              §                                                                               ¤
                  import operator
                  reduce ( operator . add , map (( lambda x : x **2) , range (1 ,50) ) , 0)
             ¦
                                                                                             ¥



             “reduce” is equivalent to “foldl” in other func. languages as
             Haskell
             other math considerations should be taken into account (kind
             of operator)...


                                                                                                  6 / 61
MapReduce Intro
  MapReduce in a nutshell



 Some interesting points...



      The Map Reduce framework...
         1   Inspired in functional programming concepts (but not
             equivalent)
         2   Problems that can be paralellized
         3   Sometimes recursive solutions
         4   ...




                                                                    7 / 61
MapReduce Intro
  MapReduce in a nutshell



 Basic Model




      “MapReduce: The Programming Model and Practice”, SIGMETRICS, Turorials 2009, Google.




                                                                                             8 / 61
MapReduce Intro
  MapReduce in a nutshell



 Map Function




      Figure: Mapping creates a new output list by applying a function to
      individual elements of an input list.


      “Module 4: MapReduce”, Hadoop Tutorial, Yahoo!.




                                                                            9 / 61
MapReduce Intro
  MapReduce in a nutshell



 Reduce Function




      Figure: Reducing a list iterates over the input values to produce an
      aggregate value as output.


      “Module 4: MapReduce”, Hadoop Tutorial, Yahoo!.



                                                                             10 / 61
MapReduce Intro
  MapReduce in a nutshell



 MapReduce Flow




                              Figure: High-level MapReduce pipeline.


      “Module 4: MapReduce”, Hadoop Tutorial, Yahoo!.

                                                                       11 / 61
MapReduce Intro
  MapReduce in a nutshell



 MapReduce Flow




                       Figure: Detailed Hadoop MapReduce data flow.
                                                                     12 / 61
MapReduce Intro
  MapReduce in a nutshell



 Tip




      What is MapReduce?
      It is a framework inspired in functional programming to tackle
      problems in which steps can be paralellized applying a divide and
      conquer approach.




                                                                          13 / 61
MapReduce Intro
  Thinking in MapReduce



 When should I use MapReduce?
      Query
              Index and Search: inverted index
              Filtering
              Classification
              Recommendations: clustering or collaborative filtering


      Analytics
              Summarization and statistics
              Sorting and merging
              Frequency distribution
              SQL-based queries: group-by, having, etc.
              Generation of graphics: histograms, scatter plots.


      Others
      Message passing such as Breadth First-Search or PageRank algorithms.

                                                                             14 / 61
MapReduce Intro
  Thinking in MapReduce



 When should I use MapReduce?
      Query
              Index and Search: inverted index
              Filtering
              Classification
              Recommendations: clustering or collaborative filtering


      Analytics
              Summarization and statistics
              Sorting and merging
              Frequency distribution
              SQL-based queries: group-by, having, etc.
              Generation of graphics: histograms, scatter plots.


      Others
      Message passing such as Breadth First-Search or PageRank algorithms.

                                                                             15 / 61
MapReduce Intro
  Thinking in MapReduce



 When should I use MapReduce?
      Query
              Index and Search: inverted index
              Filtering
              Classification
              Recommendations: clustering or collaborative filtering


      Analytics
              Summarization and statistics
              Sorting and merging
              Frequency distribution
              SQL-based queries: group-by, having, etc.
              Generation of graphics: histograms, scatter plots.


      Others
      Message passing such as Breadth First-Search or PageRank algorithms.

                                                                             16 / 61
MapReduce Intro
  Thinking in MapReduce



 How Google uses MapReduce (80% of data processing)



             Large-scale web search indexing
             Clustering problems for Google News
             Produce reports for popular queries, e.g. Google Trend
             Processing of satellite imagery data
             Language model processing for statistical machine translation
             Large-scale machine learning problems
             ...




                                                                             17 / 61
MapReduce Intro
  Thinking in MapReduce



 Comparison of MapReduce and other approaches




      “MapReduce: The Programming Model and Practice”, SIGMETRICS, Turorials 2009, Google.


                                                                                             18 / 61
MapReduce Intro
  Thinking in MapReduce



 Evaluation of MapReduce and other approaches




      “MapReduce: The Programming Model and Practice”, SIGMETRICS, Turorials 2009, Google.




                                                                                             19 / 61
MapReduce Intro
  Thinking in MapReduce



 Apache Hadoop



   MapReduce definition
   The Apache Hadoop software
   library is a framework that
   allows for the distributed
   processing of large data sets
                                   Figure: Apache Hadoop Logo.
   across clusters of computers
   using simple programming
   models.




                                                                 20 / 61
MapReduce Intro
  Thinking in MapReduce



 Tip



      What can I do in MapReduce?
      Three main functions:
         1   Querying
         2   Summarizing
         3   Analyzing
      . . . large datasets in off-line mode for boosting other on-line
      processes.




                                                                        21 / 61
MapReduce Intro
  Applying MapReduce



 MapReduce in Action

      MapReduce Patterns
         1   Summarization
         2   Filtering
         3   Data Organization (sort, merging, etc.)
         4   Relational-based (join, selection, projection, etc.)
         5   Iterative Message Passing (graph processing)
         6   Others (depending on the implementation):
                   Simulation of distributed systems
                   Cross-correlation
                   Metapatterns
                   Input-output
                   ...

                                                                    22 / 61
MapReduce Intro
  Applying MapReduce



 Overview (stages)-Counting Letters




                                      23 / 61
MapReduce Intro
  Applying MapReduce



 Summarization




      Types
         1   Numerical summarizations
         2   Inverted index
         3   Counting and counters




                                        24 / 61
MapReduce Intro
  Applying MapReduce



 Numerical Summarization-I



      Description
      A general pattern for calculating aggregate statistical values over
      your data.

      Intent
      Group records together by a key field and calculate a numerical
      aggregate per group to get a top-level view of the larger data set.




                                                                            25 / 61
MapReduce Intro
  Applying MapReduce



 Numerical Summarization-II


      Applicability
          To deal with numerical data or counting.
              To group data by specific fields

      Examples

          1   Word count
          2   Record count
          3   Min/Max/Count
          4   Average/Median/Standard deviation
          5   ...




                                                     26 / 61
MapReduce Intro
  Applying MapReduce



 Numerical Summarization-Pseudocode


        class Mapper
          method Map(recordid id, record r)
             for all term t in record r do
                Emit(term t, count 1)

      class Reducer
         method Reduce(term t, counts [c1, c2,...])
            sum = 0
            for all count c in [c1, c2,...] do
                sum = sum + c
            Emit(term t, count sum)


                                                      27 / 61
MapReduce Intro
  Applying MapReduce



 Overview-Word Counter




                         28 / 61
MapReduce Intro
  Applying MapReduce



 Numerical Summarization-Word Counter

             §                                                                            ¤
                  public void map ( LongWritable key , Text value , Context context )
                        throws Exception {
                          String line = value . toString () ;
                          StringTokenizer tokenizer = new StringTokenizer ( line ) ;
                          while ( tokenizer . hasMoreTokens () ) {
                              word . set ( tokenizer . nextToken () ) ;
                              context . write ( word , one ) ;
                          }
                      }

                  public void reduce ( Text key , Iterable  IntWritable  values ,
                        Context context )
                         throws IOException , I n t e r r u p t e d E x c e p t i o n {
                           int sum = 0;
                           for ( IntWritable val : values ) {
                               sum += val . get () ;
                           }
                           context . write ( key , new IntWritable ( sum ) ) ;
                      }
             ¦
                                                                                         ¥



                                                                                              29 / 61
MapReduce Intro
  Applying MapReduce



 Example-II




      Min/Max
      Given a list of tweets (username, date, text) determine first and
      last time an user commented and the number of times.

      Implementation

      See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro




                                                                                    30 / 61
MapReduce Intro
  Applying MapReduce



 Overview - Min/Max




      ∗ Min and max creation date are the same in the map phase.
                                                                   31 / 61
MapReduce Intro
  Applying MapReduce



 Example II-Min/Max, function Map


             §                                                                            ¤
                  public void map ( Object key , Text value , Context context )
                        throws IOException , InterruptedException , ParseException {
                          Map  String , String  parsed = MRDPUtils . parse ( value .
                                 toString () ) ;
                          String strDate = parsed . get ( MRDPUtils . CREATION_DATE ) ;
                          String userId = parsed . get ( MRDPUtils . USER_ID ) ;
                          if ( strDate == null || userId == null ) {
                            return ;
                          }
                          Date creationDate = MRDPUtils . frmt . parse ( strDate ) ;
                          outTuple . setMin ( creationDate ) ;
                          outTuple . setMax ( creationDate ) ;
                          outTuple . setCount (1) ;
                          outUserId . set ( userId ) ;
                          context . write ( outUserId , outTuple ) ;
                  }
             ¦
                                                                                         ¥




                                                                                              32 / 61
MapReduce Intro
  Applying MapReduce



 Example II-Min/Max, function Reduce

             §                                                                                             ¤
                  public void reduce ( Text key , Iterable  MinMaxCountTuple  values ,
                        Context context ) throws IOException , I n t e r r u p t e d E x c e p t i o n {
                        result . setMin ( null ) ;
                        result . setMax ( null ) ;
                        int sum = 0;
                        for ( MinMaxCountTuple val : values ) {
                               if ( result . getMin () == null
                                      || val . getMin () . compareTo ( result . getMin () )  0)
                                               {
                                      result . setMin ( val . getMin () ) ;
                               }
                               if ( result . getMax () == null
                                      || val . getMax () . compareTo ( result . getMax () )  0)
                                               {
                                      result . setMax ( val . getMax () ) ;
                                      }
                                      sum += val . getCount () ;}
                        result . setCount ( sum ) ;
                        context . write ( key , result ) ;
                  }
             ¦
                                                                                                          ¥



                                                                                                               33 / 61
MapReduce Intro
  Applying MapReduce



 Example-III




      Average
      Given a list of tweets (username, date, text) determine the average
      comment length per hour of day.

      Implementation

      See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro




                                                                                    34 / 61
MapReduce Intro
  Applying MapReduce



 Overview - Average




                       35 / 61
MapReduce Intro
  Applying MapReduce



 Example III-Average, function Map


             §                                                                          ¤
                  public void map ( Object key , Text value , Context context )
                        throws IOException , InterruptedException , ParseException {
                        Map  String , String  parsed =
                                MRDPUtils . parse ( value . toString () ) ;
                        String strDate = parsed . get ( MRDPUtils . CREATION_DATE ) ;
                        String text = parsed . get ( MRDPUtils . TEXT ) ;
                        if ( strDate == null || text == null ) {
                                return ;
                        }
                        Date creationDate = MRDPUtils . frmt . parse ( strDate ) ;
                        outHour . set ( creationDate . getHours () ) ;
                        outCountAverage . setCount (1) ;
                        outCountAverage . setAverage ( text . length () ) ;
                        context . write ( outHour , outCountAverage ) ;
                  }
             ¦
                                                                                       ¥




                                                                                            36 / 61
MapReduce Intro
  Applying MapReduce



 Example III-Average, function Reduce


             §                                                                                             ¤
                  public void reduce ( IntWritable key , Iterable  CountAverageTuple 
                       values ,
                        Context context ) throws IOException , I n t e r r u p t e d E x c e p t i o n {
                        float sum = 0;
                        float count = 0;
                        for ( Co unt Ave rage Tup le val : values ) {
                               sum += val . getCount () * val . getAverage () ;
                               count += val . getCount () ;
                        }
                        result . setCount ( count ) ;
                        result . setAverage ( sum / count ) ;
                        context . write ( key , result ) ;
                  }
             ¦
                                                                                                          ¥




                                                                                                               37 / 61
MapReduce Intro
  Applying MapReduce



 Numerical Summarization-Other approaches

      Relation to SQL
             §                                                           ¤
                  SELECT MIN ( numcol1 ) , MAX ( numcol1 ) ,
                  COUNT (*) FROM table GROUP BY groupcol2 ;
             ¦
                                                                        ¥



      Implementation in PIG
             §                                                           ¤
                  b = GROUP a BY groupcol2 ;
                  c = FOREACH b GENERATE group , MIN ( a . numcol1 ) ,
                  MAX ( a . numcol1 ) , COUNT_STAR ( a ) ;
             ¦
                                                                        ¥




                                                                             38 / 61
MapReduce Intro
  Applying MapReduce



 Numerical Summarization-Other approaches

      Relation to SQL
             §                                                           ¤
                  SELECT MIN ( numcol1 ) , MAX ( numcol1 ) ,
                  COUNT (*) FROM table GROUP BY groupcol2 ;
             ¦
                                                                        ¥



      Implementation in PIG
             §                                                           ¤
                  b = GROUP a BY groupcol2 ;
                  c = FOREACH b GENERATE group , MIN ( a . numcol1 ) ,
                  MAX ( a . numcol1 ) , COUNT_STAR ( a ) ;
             ¦
                                                                        ¥




                                                                             39 / 61
MapReduce Intro
  Applying MapReduce



 Filtering




      Types
         1   Filtering
         2   Top N records
         3   Bloom filtering
         4   Distinct




                              40 / 61
MapReduce Intro
  Applying MapReduce



 Filtering-I



      Description
      It evaluates each record separately and decides, based on some
      condition, whether it should stay or go.

      Intent
      Filter out records that are not of interest and keep ones that are.




                                                                            41 / 61
MapReduce Intro
  Applying MapReduce



 Filtering-II


      Applicability
      To collate data

      Examples

          1   Closer view of dataset
          2   Data cleansing
          3   Tracking a thread of events
          4   Simple random sampling
          5   Distributed Grep
          6   Removing low scoring dataset
          7   Log Analysis
          8   Data Querying
          9   Data Validation
         10 . . .




                                             42 / 61
MapReduce Intro
  Applying MapReduce



 Filtering-Pseudocode


      class Mapper
         method Map(recordid id, record r)
            field f = extract(r)
            if predicate (f)
               Emit(recordid id, value(r))

      class Reducer
         method Reduce(recordid id, values [r1, r2,...])
            //Whatever
            Emit(recordid id, aggregate (values))



                                                           43 / 61
MapReduce Intro
  Applying MapReduce



 Example-IV




      Distributed Grep
      Given a list of tweets (username, date, text) determine the tweets
      that contain a word.

      Implementation

      See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro




                                                                                    44 / 61
MapReduce Intro
  Applying MapReduce



 Overview - Distributed Grep




                               45 / 61
MapReduce Intro
  Applying MapReduce



 Example IV-Distributed Grep, function Map


               §                                                                               ¤
                  public void map ( Object key , Text value , Context context )
                        throws IOException , I n t e r r u p t e d E x c e p t i o n {
                        Map  String , String  parsed =
                                MRDPUtils . parse ( value . toString () ) ;
                        String txt = parsed . get ( MRDPUtils . TEXT ) ;
                        String mapRegex =  .* b  + context . getConfiguration ()
                                . get (  mapregex  ) +  (.) * b .*  ;
                        if ( txt . matches ( mapRegex ) ) {
                                context . write ( NullWritable . get () , value ) ;
                        }
                  }
              ¦
                                                                                              ¥


      ...and the Reduce function?

      In this case it is not necessary and output values are directly writing to the output.




                                                                                                   46 / 61
MapReduce Intro
  Applying MapReduce



 Example-V




      Top 5
      Given a list of tweets (username, date, text) determine the 5 users
      that wrote longer tweets

      Implementation

      See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro




                                                                                    47 / 61
MapReduce Intro
  Applying MapReduce



 Overview - Top 5




                       48 / 61
MapReduce Intro
  Applying MapReduce



 Example V-Top 5, function Map

             §                                                                                    ¤
                  private TreeMap  Integer , Text  repToRecordMap = new TreeMap 
                       Integer , Text () ;
                  public void map ( Object key , Text value , Context context )
                        throws IOException , I n t e r r u p t e d E x c e p t i o n {
                        Map  String , String  parsed =
                        MRDPUtils . parse ( value . toString () ) ;
                        if ( parsed == null ) { return ;}
                        String userId = parsed . get ( MRDPUtils . USER_ID ) ;
                        String reputation = String . valueOf ( parsed . get ( MRDPUtils .
                               TEXT ) . length () ) ;
                        // Max reputation if you write tweets longer
                        if ( userId == null || reputation == null ) { return ;}
                                repToRecordMap . put ( Integer . parseInt ( reputation ) , new
                                        Text ( value ) ) ;
                                if ( repToRecordMap . size ()  MAX_TOP ) {
                                         repToRecordMap . remove ( repToRecordMap . firstKey ()
                                                );
                                }
                           }
             ¦
                                                                                                 ¥



                                                                                                      49 / 61
MapReduce Intro
  Applying MapReduce



 Example V-Top 5, function Reduce


             §                                                                                             ¤
                  public void reduce ( NullWritable key , Iterable  Text  values ,
                        Context context ) throws IOException , I n t e r r u p t e d E x c e p t i o n {
                              for ( Text value : values ) {
                              Map  String , String  parsed = MRDPUtils . parse ( value .
                                     toString () ) ;
                              repToRecordMap . put ( parsed . get ( MRDPUtils . TEXT ) . length
                                     () , new Text ( value ) ) ;
                              if ( repToRecordMap . size ()  MAX_TOP ) {
                                      repToRecordMap . remove ( repToRecordMap . firstKey ()
                                             );
                                      }
                                }
                              for ( Text t : repToRecordMap . descendingMap () . values ()
                                     ) {
                                      context . write ( NullWritable . get () , t ) ;
                              }
                  }
             ¦
                                                                                                          ¥




                                                                                                               50 / 61
MapReduce Intro
  Applying MapReduce



 Filtering-Other approaches


      Relation to SQL
             §                                                   ¤
                  SELECT * FROM table WHERE colvalue  VALUE ;
             ¦
                                                                ¥



      Implementation in PIG
             §                                                   ¤
                  b = FILTER a BY colvalue  VALUE ;
             ¦
                                                                ¥




                                                                     51 / 61
MapReduce Intro
  Applying MapReduce



 Filtering-Other approaches


      Relation to SQL
             §                                                   ¤
                  SELECT * FROM table WHERE colvalue  VALUE ;
             ¦
                                                                ¥



      Implementation in PIG
             §                                                   ¤
                  b = FILTER a BY colvalue  VALUE ;
             ¦
                                                                ¥




                                                                     52 / 61
MapReduce Intro
  Applying MapReduce



 Tip




      How can I use and run a MapReduce framework?
      You should identify what kind of problem you are addressing and
      apply a design pattern to be implemented in a framework such
      as Apache Hadoop.




                                                                        53 / 61
MapReduce Intro
  Success Stories with MapReduce



 Tip



      Who is using MapReduce?
      All companies that are dealing with Big Data problems for
      analytics such as:
             Cloudera
             Datasalt
             Elasticsearch
             ...




                                                                  54 / 61
MapReduce Intro
  Success Stories with MapReduce



 Apache Hadoop-Related Projects




                                   55 / 61
MapReduce Intro
  Success Stories with MapReduce



 More tips


      FAQ
             MapReduce is a framework based on a simple programming
             model
             ...to deal with large datasets in a distributed fashion
             ...scalability, replication, fault-tolerant, etc.
             Apache Hadoop is not a database
             New frameworks on top of Hadoop for specific tasks:
             querying, analysis, etc.
             Other similar frameworks: Storm, Signal/Collect, etc.
             ...


                                                                       56 / 61
MapReduce Intro
  Summary and Conclusions



 Summary




                            57 / 61
MapReduce Intro
  Summary and Conclusions



 Conclusions


      What is MapReduce?

      It is a framework inspired in functional programming to tackle problems in which steps can be paralellized
      applying a divide and conquer approach.


      What can I do in MapReduce?

      Three main functions:
          1   Querying
          2   Summarizing
          3   Analyzing
      . . . large datasets in off-line mode for boosting other on-line processes.


      How can I use and run a MapReduce framework?

      You should identify what kind of problem you are addressing and apply a design pattern to be implemented in a
      framework such as Apache Hadoop.




                                                                                                                      58 / 61
MapReduce Intro
  Summary and Conclusions



 Conclusions


      What is MapReduce?

      It is a framework inspired in functional programming to tackle problems in which steps can be paralellized
      applying a divide and conquer approach.


      What can I do in MapReduce?

      Three main functions:
          1   Querying
          2   Summarizing
          3   Analyzing
      . . . large datasets in off-line mode for boosting other on-line processes.


      How can I use and run a MapReduce framework?

      You should identify what kind of problem you are addressing and apply a design pattern to be implemented in a
      framework such as Apache Hadoop.




                                                                                                                      59 / 61
MapReduce Intro
  Summary and Conclusions



 Conclusions


      What is MapReduce?

      It is a framework inspired in functional programming to tackle problems in which steps can be paralellized
      applying a divide and conquer approach.


      What can I do in MapReduce?

      Three main functions:
          1   Querying
          2   Summarizing
          3   Analyzing
      . . . large datasets in off-line mode for boosting other on-line processes.


      How can I use and run a MapReduce framework?

      You should identify what kind of problem you are addressing and apply a design pattern to be implemented in a
      framework such as Apache Hadoop.




                                                                                                                      60 / 61
MapReduce Intro
  Summary and Conclusions



 What’s next?


      ...
             Concatenate MapReduce jobs
             Optimization using combiners and setting the parameters (size
             of partition, etc.)
             Pipelining with other languages such as Python
             Hadoop in Action: more examples, etc.
             New trending problems (image/video processing)
             Real-time processing
             ...



                                                                             61 / 61
MapReduce Intro
  References



               J. Dean and S. Ghemawat.
               MapReduce: simplified data processing on large clusters.
               Commun. ACM, 51(1):107–113, Jan. 2008.
               J. L. Jonathan R. Owens, Brian Femiano.
               Hadoop Real-World Solutions Cookbook.
               Packt Publishing Ltd, 2013.
               C. Lam.
               Hadoop in Action.
               Manning Publications Co., Greenwich, CT, USA, 1st edition,
               2010.
               J. Lin and C. Dyer.
               Data-intensive text processing with MapReduce.
               In Proceedings of Human Language Technologies: The 2009
               Annual Conference of the North American Chapter of the
               Association for Computational Linguistics, Companion
                                                                            62 / 61
MapReduce Intro
  References



               Volume: Tutorial Abstracts, NAACL-Tutorials ’09, pages 1–2,
               Stroudsburg, PA, USA, 2009. Association for Computational
               Linguistics.
               D. Miner and A. Shook.
               Mapreduce Design Patterns.
               Oreilly and Associates Inc, 2012.
               T. G. Srinath Perera.
               Hadoop MapReduce Cookbook.
               Packt Publishing Ltd, 2013.
               T. White.
               Hadoop: The Definitive Guide.
               O’Reilly Media, Inc., 1st edition, 2009.
               I. H. Witten and E. Frank.
               Data Mining: Practical Machine LearningTools and Techniques.

                                                                             63 / 61
MapReduce Intro
  References



               Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,
               2005.




                                                                          64 / 61

Map/Reduce intro

  • 1.
    MapReduce Intro The MapReduce Programming Model Introduction and Examples Dr. Jose Mar´ Alvarez-Rodr´ ıa ıguez “Quality Management in Service-based Systems and Cloud Applications” FP7 RELATE-ITN South East European Research Center Thessaloniki, 10th of April, 2013 1 / 61
  • 2.
    MapReduce Intro 1 MapReduce in a nutshell 2 Thinking in MapReduce 3 Applying MapReduce 4 Success Stories with MapReduce 5 Summary and Conclusions 2 / 61
  • 3.
    MapReduce Intro MapReduce in a nutshell Features A programming model... 1 Large-scale distributed data processing 2 Simple but restricted 3 Paralell programming 4 Extensible 3 / 61
  • 4.
    MapReduce Intro MapReduce in a nutshell Antecedents Functional programming 1 Inspired 2 ...but not equivalent Example in Python “Given a list of numbers between 1 and 50 print only even numbers” § ¤ print filter ( lambda x : x % 2 == 0 , range (1 , 50) ) ¦ ¥ A list of numbers (data) A condition (even numbers) A function filter that is applied to the list (map) 4 / 61
  • 5.
    MapReduce Intro MapReduce in a nutshell Antecedents Functional programming 1 Inspired 2 ...but not equivalent Example in Python “Given a list of numbers between 1 and 50 print only even numbers” § ¤ print filter ( lambda x : x % 2 == 0 , range (1 , 50) ) ¦ ¥ A list of numbers (data) A condition (even numbers) A function filter that is applied to the list (map) 5 / 61
  • 6.
    MapReduce Intro MapReduce in a nutshell ...Other examples... Example in Python “Return the sum of the squares of a list of numbers between 1 and 50” § ¤ import operator reduce ( operator . add , map (( lambda x : x **2) , range (1 ,50) ) , 0) ¦ ¥ “reduce” is equivalent to “foldl” in other func. languages as Haskell other math considerations should be taken into account (kind of operator)... 6 / 61
  • 7.
    MapReduce Intro MapReduce in a nutshell Some interesting points... The Map Reduce framework... 1 Inspired in functional programming concepts (but not equivalent) 2 Problems that can be paralellized 3 Sometimes recursive solutions 4 ... 7 / 61
  • 8.
    MapReduce Intro MapReduce in a nutshell Basic Model “MapReduce: The Programming Model and Practice”, SIGMETRICS, Turorials 2009, Google. 8 / 61
  • 9.
    MapReduce Intro MapReduce in a nutshell Map Function Figure: Mapping creates a new output list by applying a function to individual elements of an input list. “Module 4: MapReduce”, Hadoop Tutorial, Yahoo!. 9 / 61
  • 10.
    MapReduce Intro MapReduce in a nutshell Reduce Function Figure: Reducing a list iterates over the input values to produce an aggregate value as output. “Module 4: MapReduce”, Hadoop Tutorial, Yahoo!. 10 / 61
  • 11.
    MapReduce Intro MapReduce in a nutshell MapReduce Flow Figure: High-level MapReduce pipeline. “Module 4: MapReduce”, Hadoop Tutorial, Yahoo!. 11 / 61
  • 12.
    MapReduce Intro MapReduce in a nutshell MapReduce Flow Figure: Detailed Hadoop MapReduce data flow. 12 / 61
  • 13.
    MapReduce Intro MapReduce in a nutshell Tip What is MapReduce? It is a framework inspired in functional programming to tackle problems in which steps can be paralellized applying a divide and conquer approach. 13 / 61
  • 14.
    MapReduce Intro Thinking in MapReduce When should I use MapReduce? Query Index and Search: inverted index Filtering Classification Recommendations: clustering or collaborative filtering Analytics Summarization and statistics Sorting and merging Frequency distribution SQL-based queries: group-by, having, etc. Generation of graphics: histograms, scatter plots. Others Message passing such as Breadth First-Search or PageRank algorithms. 14 / 61
  • 15.
    MapReduce Intro Thinking in MapReduce When should I use MapReduce? Query Index and Search: inverted index Filtering Classification Recommendations: clustering or collaborative filtering Analytics Summarization and statistics Sorting and merging Frequency distribution SQL-based queries: group-by, having, etc. Generation of graphics: histograms, scatter plots. Others Message passing such as Breadth First-Search or PageRank algorithms. 15 / 61
  • 16.
    MapReduce Intro Thinking in MapReduce When should I use MapReduce? Query Index and Search: inverted index Filtering Classification Recommendations: clustering or collaborative filtering Analytics Summarization and statistics Sorting and merging Frequency distribution SQL-based queries: group-by, having, etc. Generation of graphics: histograms, scatter plots. Others Message passing such as Breadth First-Search or PageRank algorithms. 16 / 61
  • 17.
    MapReduce Intro Thinking in MapReduce How Google uses MapReduce (80% of data processing) Large-scale web search indexing Clustering problems for Google News Produce reports for popular queries, e.g. Google Trend Processing of satellite imagery data Language model processing for statistical machine translation Large-scale machine learning problems ... 17 / 61
  • 18.
    MapReduce Intro Thinking in MapReduce Comparison of MapReduce and other approaches “MapReduce: The Programming Model and Practice”, SIGMETRICS, Turorials 2009, Google. 18 / 61
  • 19.
    MapReduce Intro Thinking in MapReduce Evaluation of MapReduce and other approaches “MapReduce: The Programming Model and Practice”, SIGMETRICS, Turorials 2009, Google. 19 / 61
  • 20.
    MapReduce Intro Thinking in MapReduce Apache Hadoop MapReduce definition The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets Figure: Apache Hadoop Logo. across clusters of computers using simple programming models. 20 / 61
  • 21.
    MapReduce Intro Thinking in MapReduce Tip What can I do in MapReduce? Three main functions: 1 Querying 2 Summarizing 3 Analyzing . . . large datasets in off-line mode for boosting other on-line processes. 21 / 61
  • 22.
    MapReduce Intro Applying MapReduce MapReduce in Action MapReduce Patterns 1 Summarization 2 Filtering 3 Data Organization (sort, merging, etc.) 4 Relational-based (join, selection, projection, etc.) 5 Iterative Message Passing (graph processing) 6 Others (depending on the implementation): Simulation of distributed systems Cross-correlation Metapatterns Input-output ... 22 / 61
  • 23.
    MapReduce Intro Applying MapReduce Overview (stages)-Counting Letters 23 / 61
  • 24.
    MapReduce Intro Applying MapReduce Summarization Types 1 Numerical summarizations 2 Inverted index 3 Counting and counters 24 / 61
  • 25.
    MapReduce Intro Applying MapReduce Numerical Summarization-I Description A general pattern for calculating aggregate statistical values over your data. Intent Group records together by a key field and calculate a numerical aggregate per group to get a top-level view of the larger data set. 25 / 61
  • 26.
    MapReduce Intro Applying MapReduce Numerical Summarization-II Applicability To deal with numerical data or counting. To group data by specific fields Examples 1 Word count 2 Record count 3 Min/Max/Count 4 Average/Median/Standard deviation 5 ... 26 / 61
  • 27.
    MapReduce Intro Applying MapReduce Numerical Summarization-Pseudocode class Mapper method Map(recordid id, record r) for all term t in record r do Emit(term t, count 1) class Reducer method Reduce(term t, counts [c1, c2,...]) sum = 0 for all count c in [c1, c2,...] do sum = sum + c Emit(term t, count sum) 27 / 61
  • 28.
    MapReduce Intro Applying MapReduce Overview-Word Counter 28 / 61
  • 29.
    MapReduce Intro Applying MapReduce Numerical Summarization-Word Counter § ¤ public void map ( LongWritable key , Text value , Context context ) throws Exception { String line = value . toString () ; StringTokenizer tokenizer = new StringTokenizer ( line ) ; while ( tokenizer . hasMoreTokens () ) { word . set ( tokenizer . nextToken () ) ; context . write ( word , one ) ; } } public void reduce ( Text key , Iterable IntWritable values , Context context ) throws IOException , I n t e r r u p t e d E x c e p t i o n { int sum = 0; for ( IntWritable val : values ) { sum += val . get () ; } context . write ( key , new IntWritable ( sum ) ) ; } ¦ ¥ 29 / 61
  • 30.
    MapReduce Intro Applying MapReduce Example-II Min/Max Given a list of tweets (username, date, text) determine first and last time an user commented and the number of times. Implementation See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro 30 / 61
  • 31.
    MapReduce Intro Applying MapReduce Overview - Min/Max ∗ Min and max creation date are the same in the map phase. 31 / 61
  • 32.
    MapReduce Intro Applying MapReduce Example II-Min/Max, function Map § ¤ public void map ( Object key , Text value , Context context ) throws IOException , InterruptedException , ParseException { Map String , String parsed = MRDPUtils . parse ( value . toString () ) ; String strDate = parsed . get ( MRDPUtils . CREATION_DATE ) ; String userId = parsed . get ( MRDPUtils . USER_ID ) ; if ( strDate == null || userId == null ) { return ; } Date creationDate = MRDPUtils . frmt . parse ( strDate ) ; outTuple . setMin ( creationDate ) ; outTuple . setMax ( creationDate ) ; outTuple . setCount (1) ; outUserId . set ( userId ) ; context . write ( outUserId , outTuple ) ; } ¦ ¥ 32 / 61
  • 33.
    MapReduce Intro Applying MapReduce Example II-Min/Max, function Reduce § ¤ public void reduce ( Text key , Iterable MinMaxCountTuple values , Context context ) throws IOException , I n t e r r u p t e d E x c e p t i o n { result . setMin ( null ) ; result . setMax ( null ) ; int sum = 0; for ( MinMaxCountTuple val : values ) { if ( result . getMin () == null || val . getMin () . compareTo ( result . getMin () ) 0) { result . setMin ( val . getMin () ) ; } if ( result . getMax () == null || val . getMax () . compareTo ( result . getMax () ) 0) { result . setMax ( val . getMax () ) ; } sum += val . getCount () ;} result . setCount ( sum ) ; context . write ( key , result ) ; } ¦ ¥ 33 / 61
  • 34.
    MapReduce Intro Applying MapReduce Example-III Average Given a list of tweets (username, date, text) determine the average comment length per hour of day. Implementation See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro 34 / 61
  • 35.
    MapReduce Intro Applying MapReduce Overview - Average 35 / 61
  • 36.
    MapReduce Intro Applying MapReduce Example III-Average, function Map § ¤ public void map ( Object key , Text value , Context context ) throws IOException , InterruptedException , ParseException { Map String , String parsed = MRDPUtils . parse ( value . toString () ) ; String strDate = parsed . get ( MRDPUtils . CREATION_DATE ) ; String text = parsed . get ( MRDPUtils . TEXT ) ; if ( strDate == null || text == null ) { return ; } Date creationDate = MRDPUtils . frmt . parse ( strDate ) ; outHour . set ( creationDate . getHours () ) ; outCountAverage . setCount (1) ; outCountAverage . setAverage ( text . length () ) ; context . write ( outHour , outCountAverage ) ; } ¦ ¥ 36 / 61
  • 37.
    MapReduce Intro Applying MapReduce Example III-Average, function Reduce § ¤ public void reduce ( IntWritable key , Iterable CountAverageTuple values , Context context ) throws IOException , I n t e r r u p t e d E x c e p t i o n { float sum = 0; float count = 0; for ( Co unt Ave rage Tup le val : values ) { sum += val . getCount () * val . getAverage () ; count += val . getCount () ; } result . setCount ( count ) ; result . setAverage ( sum / count ) ; context . write ( key , result ) ; } ¦ ¥ 37 / 61
  • 38.
    MapReduce Intro Applying MapReduce Numerical Summarization-Other approaches Relation to SQL § ¤ SELECT MIN ( numcol1 ) , MAX ( numcol1 ) , COUNT (*) FROM table GROUP BY groupcol2 ; ¦ ¥ Implementation in PIG § ¤ b = GROUP a BY groupcol2 ; c = FOREACH b GENERATE group , MIN ( a . numcol1 ) , MAX ( a . numcol1 ) , COUNT_STAR ( a ) ; ¦ ¥ 38 / 61
  • 39.
    MapReduce Intro Applying MapReduce Numerical Summarization-Other approaches Relation to SQL § ¤ SELECT MIN ( numcol1 ) , MAX ( numcol1 ) , COUNT (*) FROM table GROUP BY groupcol2 ; ¦ ¥ Implementation in PIG § ¤ b = GROUP a BY groupcol2 ; c = FOREACH b GENERATE group , MIN ( a . numcol1 ) , MAX ( a . numcol1 ) , COUNT_STAR ( a ) ; ¦ ¥ 39 / 61
  • 40.
    MapReduce Intro Applying MapReduce Filtering Types 1 Filtering 2 Top N records 3 Bloom filtering 4 Distinct 40 / 61
  • 41.
    MapReduce Intro Applying MapReduce Filtering-I Description It evaluates each record separately and decides, based on some condition, whether it should stay or go. Intent Filter out records that are not of interest and keep ones that are. 41 / 61
  • 42.
    MapReduce Intro Applying MapReduce Filtering-II Applicability To collate data Examples 1 Closer view of dataset 2 Data cleansing 3 Tracking a thread of events 4 Simple random sampling 5 Distributed Grep 6 Removing low scoring dataset 7 Log Analysis 8 Data Querying 9 Data Validation 10 . . . 42 / 61
  • 43.
    MapReduce Intro Applying MapReduce Filtering-Pseudocode class Mapper method Map(recordid id, record r) field f = extract(r) if predicate (f) Emit(recordid id, value(r)) class Reducer method Reduce(recordid id, values [r1, r2,...]) //Whatever Emit(recordid id, aggregate (values)) 43 / 61
  • 44.
    MapReduce Intro Applying MapReduce Example-IV Distributed Grep Given a list of tweets (username, date, text) determine the tweets that contain a word. Implementation See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro 44 / 61
  • 45.
    MapReduce Intro Applying MapReduce Overview - Distributed Grep 45 / 61
  • 46.
    MapReduce Intro Applying MapReduce Example IV-Distributed Grep, function Map § ¤ public void map ( Object key , Text value , Context context ) throws IOException , I n t e r r u p t e d E x c e p t i o n { Map String , String parsed = MRDPUtils . parse ( value . toString () ) ; String txt = parsed . get ( MRDPUtils . TEXT ) ; String mapRegex = .* b + context . getConfiguration () . get ( mapregex ) + (.) * b .* ; if ( txt . matches ( mapRegex ) ) { context . write ( NullWritable . get () , value ) ; } } ¦ ¥ ...and the Reduce function? In this case it is not necessary and output values are directly writing to the output. 46 / 61
  • 47.
    MapReduce Intro Applying MapReduce Example-V Top 5 Given a list of tweets (username, date, text) determine the 5 users that wrote longer tweets Implementation See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro 47 / 61
  • 48.
    MapReduce Intro Applying MapReduce Overview - Top 5 48 / 61
  • 49.
    MapReduce Intro Applying MapReduce Example V-Top 5, function Map § ¤ private TreeMap Integer , Text repToRecordMap = new TreeMap Integer , Text () ; public void map ( Object key , Text value , Context context ) throws IOException , I n t e r r u p t e d E x c e p t i o n { Map String , String parsed = MRDPUtils . parse ( value . toString () ) ; if ( parsed == null ) { return ;} String userId = parsed . get ( MRDPUtils . USER_ID ) ; String reputation = String . valueOf ( parsed . get ( MRDPUtils . TEXT ) . length () ) ; // Max reputation if you write tweets longer if ( userId == null || reputation == null ) { return ;} repToRecordMap . put ( Integer . parseInt ( reputation ) , new Text ( value ) ) ; if ( repToRecordMap . size () MAX_TOP ) { repToRecordMap . remove ( repToRecordMap . firstKey () ); } } ¦ ¥ 49 / 61
  • 50.
    MapReduce Intro Applying MapReduce Example V-Top 5, function Reduce § ¤ public void reduce ( NullWritable key , Iterable Text values , Context context ) throws IOException , I n t e r r u p t e d E x c e p t i o n { for ( Text value : values ) { Map String , String parsed = MRDPUtils . parse ( value . toString () ) ; repToRecordMap . put ( parsed . get ( MRDPUtils . TEXT ) . length () , new Text ( value ) ) ; if ( repToRecordMap . size () MAX_TOP ) { repToRecordMap . remove ( repToRecordMap . firstKey () ); } } for ( Text t : repToRecordMap . descendingMap () . values () ) { context . write ( NullWritable . get () , t ) ; } } ¦ ¥ 50 / 61
  • 51.
    MapReduce Intro Applying MapReduce Filtering-Other approaches Relation to SQL § ¤ SELECT * FROM table WHERE colvalue VALUE ; ¦ ¥ Implementation in PIG § ¤ b = FILTER a BY colvalue VALUE ; ¦ ¥ 51 / 61
  • 52.
    MapReduce Intro Applying MapReduce Filtering-Other approaches Relation to SQL § ¤ SELECT * FROM table WHERE colvalue VALUE ; ¦ ¥ Implementation in PIG § ¤ b = FILTER a BY colvalue VALUE ; ¦ ¥ 52 / 61
  • 53.
    MapReduce Intro Applying MapReduce Tip How can I use and run a MapReduce framework? You should identify what kind of problem you are addressing and apply a design pattern to be implemented in a framework such as Apache Hadoop. 53 / 61
  • 54.
    MapReduce Intro Success Stories with MapReduce Tip Who is using MapReduce? All companies that are dealing with Big Data problems for analytics such as: Cloudera Datasalt Elasticsearch ... 54 / 61
  • 55.
    MapReduce Intro Success Stories with MapReduce Apache Hadoop-Related Projects 55 / 61
  • 56.
    MapReduce Intro Success Stories with MapReduce More tips FAQ MapReduce is a framework based on a simple programming model ...to deal with large datasets in a distributed fashion ...scalability, replication, fault-tolerant, etc. Apache Hadoop is not a database New frameworks on top of Hadoop for specific tasks: querying, analysis, etc. Other similar frameworks: Storm, Signal/Collect, etc. ... 56 / 61
  • 57.
    MapReduce Intro Summary and Conclusions Summary 57 / 61
  • 58.
    MapReduce Intro Summary and Conclusions Conclusions What is MapReduce? It is a framework inspired in functional programming to tackle problems in which steps can be paralellized applying a divide and conquer approach. What can I do in MapReduce? Three main functions: 1 Querying 2 Summarizing 3 Analyzing . . . large datasets in off-line mode for boosting other on-line processes. How can I use and run a MapReduce framework? You should identify what kind of problem you are addressing and apply a design pattern to be implemented in a framework such as Apache Hadoop. 58 / 61
  • 59.
    MapReduce Intro Summary and Conclusions Conclusions What is MapReduce? It is a framework inspired in functional programming to tackle problems in which steps can be paralellized applying a divide and conquer approach. What can I do in MapReduce? Three main functions: 1 Querying 2 Summarizing 3 Analyzing . . . large datasets in off-line mode for boosting other on-line processes. How can I use and run a MapReduce framework? You should identify what kind of problem you are addressing and apply a design pattern to be implemented in a framework such as Apache Hadoop. 59 / 61
  • 60.
    MapReduce Intro Summary and Conclusions Conclusions What is MapReduce? It is a framework inspired in functional programming to tackle problems in which steps can be paralellized applying a divide and conquer approach. What can I do in MapReduce? Three main functions: 1 Querying 2 Summarizing 3 Analyzing . . . large datasets in off-line mode for boosting other on-line processes. How can I use and run a MapReduce framework? You should identify what kind of problem you are addressing and apply a design pattern to be implemented in a framework such as Apache Hadoop. 60 / 61
  • 61.
    MapReduce Intro Summary and Conclusions What’s next? ... Concatenate MapReduce jobs Optimization using combiners and setting the parameters (size of partition, etc.) Pipelining with other languages such as Python Hadoop in Action: more examples, etc. New trending problems (image/video processing) Real-time processing ... 61 / 61
  • 62.
    MapReduce Intro References J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Commun. ACM, 51(1):107–113, Jan. 2008. J. L. Jonathan R. Owens, Brian Femiano. Hadoop Real-World Solutions Cookbook. Packt Publishing Ltd, 2013. C. Lam. Hadoop in Action. Manning Publications Co., Greenwich, CT, USA, 1st edition, 2010. J. Lin and C. Dyer. Data-intensive text processing with MapReduce. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion 62 / 61
  • 63.
    MapReduce Intro References Volume: Tutorial Abstracts, NAACL-Tutorials ’09, pages 1–2, Stroudsburg, PA, USA, 2009. Association for Computational Linguistics. D. Miner and A. Shook. Mapreduce Design Patterns. Oreilly and Associates Inc, 2012. T. G. Srinath Perera. Hadoop MapReduce Cookbook. Packt Publishing Ltd, 2013. T. White. Hadoop: The Definitive Guide. O’Reilly Media, Inc., 1st edition, 2009. I. H. Witten and E. Frank. Data Mining: Practical Machine LearningTools and Techniques. 63 / 61
  • 64.
    MapReduce Intro References Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2005. 64 / 61