SlideShare a Scribd company logo
1 of 64
Download to read offline
MapReduce Intro




                  The MapReduce Programming Model

                         Introduction and Examples

                       Dr. Jose Mar´ Alvarez-Rodr´
                                   ıa            ıguez

            “Quality Management in Service-based Systems and Cloud
                                Applications”

                               FP7 RELATE-ITN

                         South East European Research Center


                        Thessaloniki, 10th of April, 2013

                                                                     1 / 61
MapReduce Intro




      1   MapReduce in a nutshell

      2   Thinking in MapReduce

      3   Applying MapReduce

      4   Success Stories with MapReduce

      5   Summary and Conclusions




                                           2 / 61
MapReduce Intro
  MapReduce in a nutshell



 Features




      A programming model...
         1   Large-scale distributed data processing
         2   Simple but restricted
         3   Paralell programming
         4   Extensible




                                                       3 / 61
MapReduce Intro
  MapReduce in a nutshell



 Antecedents

      Functional programming
         1   Inspired
         2   ...but not equivalent

      Example in Python
      “Given a list of numbers between 1 and 50 print only even
      numbers”
              §                                                             ¤
                  print filter ( lambda x : x % 2 == 0 , range (1 , 50) )
             ¦
                                                                           ¥

             A list of numbers (data)
             A condition (even numbers)
             A function filter that is applied to the list (map)


                                                                                4 / 61
MapReduce Intro
  MapReduce in a nutshell



 Antecedents

      Functional programming
         1   Inspired
         2   ...but not equivalent

      Example in Python
      “Given a list of numbers between 1 and 50 print only even
      numbers”
              §                                                             ¤
                  print filter ( lambda x : x % 2 == 0 , range (1 , 50) )
             ¦
                                                                           ¥

             A list of numbers (data)
             A condition (even numbers)
             A function filter that is applied to the list (map)


                                                                                5 / 61
MapReduce Intro
  MapReduce in a nutshell



 ...Other examples...

      Example in Python
      “Return the sum of the squares of a list of numbers between 1 and
      50”
              §                                                                               ¤
                  import operator
                  reduce ( operator . add , map (( lambda x : x **2) , range (1 ,50) ) , 0)
             ¦
                                                                                             ¥



             “reduce” is equivalent to “foldl” in other func. languages as
             Haskell
             other math considerations should be taken into account (kind
             of operator)...


                                                                                                  6 / 61
MapReduce Intro
  MapReduce in a nutshell



 Some interesting points...



      The Map Reduce framework...
         1   Inspired in functional programming concepts (but not
             equivalent)
         2   Problems that can be paralellized
         3   Sometimes recursive solutions
         4   ...




                                                                    7 / 61
MapReduce Intro
  MapReduce in a nutshell



 Basic Model




      “MapReduce: The Programming Model and Practice”, SIGMETRICS, Turorials 2009, Google.




                                                                                             8 / 61
MapReduce Intro
  MapReduce in a nutshell



 Map Function




      Figure: Mapping creates a new output list by applying a function to
      individual elements of an input list.


      “Module 4: MapReduce”, Hadoop Tutorial, Yahoo!.




                                                                            9 / 61
MapReduce Intro
  MapReduce in a nutshell



 Reduce Function




      Figure: Reducing a list iterates over the input values to produce an
      aggregate value as output.


      “Module 4: MapReduce”, Hadoop Tutorial, Yahoo!.



                                                                             10 / 61
MapReduce Intro
  MapReduce in a nutshell



 MapReduce Flow




                              Figure: High-level MapReduce pipeline.


      “Module 4: MapReduce”, Hadoop Tutorial, Yahoo!.

                                                                       11 / 61
MapReduce Intro
  MapReduce in a nutshell



 MapReduce Flow




                       Figure: Detailed Hadoop MapReduce data flow.
                                                                     12 / 61
MapReduce Intro
  MapReduce in a nutshell



 Tip




      What is MapReduce?
      It is a framework inspired in functional programming to tackle
      problems in which steps can be paralellized applying a divide and
      conquer approach.




                                                                          13 / 61
MapReduce Intro
  Thinking in MapReduce



 When should I use MapReduce?
      Query
              Index and Search: inverted index
              Filtering
              Classification
              Recommendations: clustering or collaborative filtering


      Analytics
              Summarization and statistics
              Sorting and merging
              Frequency distribution
              SQL-based queries: group-by, having, etc.
              Generation of graphics: histograms, scatter plots.


      Others
      Message passing such as Breadth First-Search or PageRank algorithms.

                                                                             14 / 61
MapReduce Intro
  Thinking in MapReduce



 When should I use MapReduce?
      Query
              Index and Search: inverted index
              Filtering
              Classification
              Recommendations: clustering or collaborative filtering


      Analytics
              Summarization and statistics
              Sorting and merging
              Frequency distribution
              SQL-based queries: group-by, having, etc.
              Generation of graphics: histograms, scatter plots.


      Others
      Message passing such as Breadth First-Search or PageRank algorithms.

                                                                             15 / 61
MapReduce Intro
  Thinking in MapReduce



 When should I use MapReduce?
      Query
              Index and Search: inverted index
              Filtering
              Classification
              Recommendations: clustering or collaborative filtering


      Analytics
              Summarization and statistics
              Sorting and merging
              Frequency distribution
              SQL-based queries: group-by, having, etc.
              Generation of graphics: histograms, scatter plots.


      Others
      Message passing such as Breadth First-Search or PageRank algorithms.

                                                                             16 / 61
MapReduce Intro
  Thinking in MapReduce



 How Google uses MapReduce (80% of data processing)



             Large-scale web search indexing
             Clustering problems for Google News
             Produce reports for popular queries, e.g. Google Trend
             Processing of satellite imagery data
             Language model processing for statistical machine translation
             Large-scale machine learning problems
             ...




                                                                             17 / 61
MapReduce Intro
  Thinking in MapReduce



 Comparison of MapReduce and other approaches




      “MapReduce: The Programming Model and Practice”, SIGMETRICS, Turorials 2009, Google.


                                                                                             18 / 61
MapReduce Intro
  Thinking in MapReduce



 Evaluation of MapReduce and other approaches




      “MapReduce: The Programming Model and Practice”, SIGMETRICS, Turorials 2009, Google.




                                                                                             19 / 61
MapReduce Intro
  Thinking in MapReduce



 Apache Hadoop



   MapReduce definition
   The Apache Hadoop software
   library is a framework that
   allows for the distributed
   processing of large data sets
                                   Figure: Apache Hadoop Logo.
   across clusters of computers
   using simple programming
   models.




                                                                 20 / 61
MapReduce Intro
  Thinking in MapReduce



 Tip



      What can I do in MapReduce?
      Three main functions:
         1   Querying
         2   Summarizing
         3   Analyzing
      . . . large datasets in off-line mode for boosting other on-line
      processes.




                                                                        21 / 61
MapReduce Intro
  Applying MapReduce



 MapReduce in Action

      MapReduce Patterns
         1   Summarization
         2   Filtering
         3   Data Organization (sort, merging, etc.)
         4   Relational-based (join, selection, projection, etc.)
         5   Iterative Message Passing (graph processing)
         6   Others (depending on the implementation):
                   Simulation of distributed systems
                   Cross-correlation
                   Metapatterns
                   Input-output
                   ...

                                                                    22 / 61
MapReduce Intro
  Applying MapReduce



 Overview (stages)-Counting Letters




                                      23 / 61
MapReduce Intro
  Applying MapReduce



 Summarization




      Types
         1   Numerical summarizations
         2   Inverted index
         3   Counting and counters




                                        24 / 61
MapReduce Intro
  Applying MapReduce



 Numerical Summarization-I



      Description
      A general pattern for calculating aggregate statistical values over
      your data.

      Intent
      Group records together by a key field and calculate a numerical
      aggregate per group to get a top-level view of the larger data set.




                                                                            25 / 61
MapReduce Intro
  Applying MapReduce



 Numerical Summarization-II


      Applicability
          To deal with numerical data or counting.
              To group data by specific fields

      Examples

          1   Word count
          2   Record count
          3   Min/Max/Count
          4   Average/Median/Standard deviation
          5   ...




                                                     26 / 61
MapReduce Intro
  Applying MapReduce



 Numerical Summarization-Pseudocode


        class Mapper
          method Map(recordid id, record r)
             for all term t in record r do
                Emit(term t, count 1)

      class Reducer
         method Reduce(term t, counts [c1, c2,...])
            sum = 0
            for all count c in [c1, c2,...] do
                sum = sum + c
            Emit(term t, count sum)


                                                      27 / 61
MapReduce Intro
  Applying MapReduce



 Overview-Word Counter




                         28 / 61
MapReduce Intro
  Applying MapReduce



 Numerical Summarization-Word Counter

             §                                                                            ¤
                  public void map ( LongWritable key , Text value , Context context )
                        throws Exception {
                          String line = value . toString () ;
                          StringTokenizer tokenizer = new StringTokenizer ( line ) ;
                          while ( tokenizer . hasMoreTokens () ) {
                              word . set ( tokenizer . nextToken () ) ;
                              context . write ( word , one ) ;
                          }
                      }

                  public void reduce ( Text key , Iterable  IntWritable  values ,
                        Context context )
                         throws IOException , I n t e r r u p t e d E x c e p t i o n {
                           int sum = 0;
                           for ( IntWritable val : values ) {
                               sum += val . get () ;
                           }
                           context . write ( key , new IntWritable ( sum ) ) ;
                      }
             ¦
                                                                                         ¥



                                                                                              29 / 61
MapReduce Intro
  Applying MapReduce



 Example-II




      Min/Max
      Given a list of tweets (username, date, text) determine first and
      last time an user commented and the number of times.

      Implementation

      See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro




                                                                                    30 / 61
MapReduce Intro
  Applying MapReduce



 Overview - Min/Max




      ∗ Min and max creation date are the same in the map phase.
                                                                   31 / 61
MapReduce Intro
  Applying MapReduce



 Example II-Min/Max, function Map


             §                                                                            ¤
                  public void map ( Object key , Text value , Context context )
                        throws IOException , InterruptedException , ParseException {
                          Map  String , String  parsed = MRDPUtils . parse ( value .
                                 toString () ) ;
                          String strDate = parsed . get ( MRDPUtils . CREATION_DATE ) ;
                          String userId = parsed . get ( MRDPUtils . USER_ID ) ;
                          if ( strDate == null || userId == null ) {
                            return ;
                          }
                          Date creationDate = MRDPUtils . frmt . parse ( strDate ) ;
                          outTuple . setMin ( creationDate ) ;
                          outTuple . setMax ( creationDate ) ;
                          outTuple . setCount (1) ;
                          outUserId . set ( userId ) ;
                          context . write ( outUserId , outTuple ) ;
                  }
             ¦
                                                                                         ¥




                                                                                              32 / 61
MapReduce Intro
  Applying MapReduce



 Example II-Min/Max, function Reduce

             §                                                                                             ¤
                  public void reduce ( Text key , Iterable  MinMaxCountTuple  values ,
                        Context context ) throws IOException , I n t e r r u p t e d E x c e p t i o n {
                        result . setMin ( null ) ;
                        result . setMax ( null ) ;
                        int sum = 0;
                        for ( MinMaxCountTuple val : values ) {
                               if ( result . getMin () == null
                                      || val . getMin () . compareTo ( result . getMin () )  0)
                                               {
                                      result . setMin ( val . getMin () ) ;
                               }
                               if ( result . getMax () == null
                                      || val . getMax () . compareTo ( result . getMax () )  0)
                                               {
                                      result . setMax ( val . getMax () ) ;
                                      }
                                      sum += val . getCount () ;}
                        result . setCount ( sum ) ;
                        context . write ( key , result ) ;
                  }
             ¦
                                                                                                          ¥



                                                                                                               33 / 61
MapReduce Intro
  Applying MapReduce



 Example-III




      Average
      Given a list of tweets (username, date, text) determine the average
      comment length per hour of day.

      Implementation

      See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro




                                                                                    34 / 61
MapReduce Intro
  Applying MapReduce



 Overview - Average




                       35 / 61
MapReduce Intro
  Applying MapReduce



 Example III-Average, function Map


             §                                                                          ¤
                  public void map ( Object key , Text value , Context context )
                        throws IOException , InterruptedException , ParseException {
                        Map  String , String  parsed =
                                MRDPUtils . parse ( value . toString () ) ;
                        String strDate = parsed . get ( MRDPUtils . CREATION_DATE ) ;
                        String text = parsed . get ( MRDPUtils . TEXT ) ;
                        if ( strDate == null || text == null ) {
                                return ;
                        }
                        Date creationDate = MRDPUtils . frmt . parse ( strDate ) ;
                        outHour . set ( creationDate . getHours () ) ;
                        outCountAverage . setCount (1) ;
                        outCountAverage . setAverage ( text . length () ) ;
                        context . write ( outHour , outCountAverage ) ;
                  }
             ¦
                                                                                       ¥




                                                                                            36 / 61
MapReduce Intro
  Applying MapReduce



 Example III-Average, function Reduce


             §                                                                                             ¤
                  public void reduce ( IntWritable key , Iterable  CountAverageTuple 
                       values ,
                        Context context ) throws IOException , I n t e r r u p t e d E x c e p t i o n {
                        float sum = 0;
                        float count = 0;
                        for ( Co unt Ave rage Tup le val : values ) {
                               sum += val . getCount () * val . getAverage () ;
                               count += val . getCount () ;
                        }
                        result . setCount ( count ) ;
                        result . setAverage ( sum / count ) ;
                        context . write ( key , result ) ;
                  }
             ¦
                                                                                                          ¥




                                                                                                               37 / 61
MapReduce Intro
  Applying MapReduce



 Numerical Summarization-Other approaches

      Relation to SQL
             §                                                           ¤
                  SELECT MIN ( numcol1 ) , MAX ( numcol1 ) ,
                  COUNT (*) FROM table GROUP BY groupcol2 ;
             ¦
                                                                        ¥



      Implementation in PIG
             §                                                           ¤
                  b = GROUP a BY groupcol2 ;
                  c = FOREACH b GENERATE group , MIN ( a . numcol1 ) ,
                  MAX ( a . numcol1 ) , COUNT_STAR ( a ) ;
             ¦
                                                                        ¥




                                                                             38 / 61
MapReduce Intro
  Applying MapReduce



 Numerical Summarization-Other approaches

      Relation to SQL
             §                                                           ¤
                  SELECT MIN ( numcol1 ) , MAX ( numcol1 ) ,
                  COUNT (*) FROM table GROUP BY groupcol2 ;
             ¦
                                                                        ¥



      Implementation in PIG
             §                                                           ¤
                  b = GROUP a BY groupcol2 ;
                  c = FOREACH b GENERATE group , MIN ( a . numcol1 ) ,
                  MAX ( a . numcol1 ) , COUNT_STAR ( a ) ;
             ¦
                                                                        ¥




                                                                             39 / 61
MapReduce Intro
  Applying MapReduce



 Filtering




      Types
         1   Filtering
         2   Top N records
         3   Bloom filtering
         4   Distinct




                              40 / 61
MapReduce Intro
  Applying MapReduce



 Filtering-I



      Description
      It evaluates each record separately and decides, based on some
      condition, whether it should stay or go.

      Intent
      Filter out records that are not of interest and keep ones that are.




                                                                            41 / 61
MapReduce Intro
  Applying MapReduce



 Filtering-II


      Applicability
      To collate data

      Examples

          1   Closer view of dataset
          2   Data cleansing
          3   Tracking a thread of events
          4   Simple random sampling
          5   Distributed Grep
          6   Removing low scoring dataset
          7   Log Analysis
          8   Data Querying
          9   Data Validation
         10 . . .




                                             42 / 61
MapReduce Intro
  Applying MapReduce



 Filtering-Pseudocode


      class Mapper
         method Map(recordid id, record r)
            field f = extract(r)
            if predicate (f)
               Emit(recordid id, value(r))

      class Reducer
         method Reduce(recordid id, values [r1, r2,...])
            //Whatever
            Emit(recordid id, aggregate (values))



                                                           43 / 61
MapReduce Intro
  Applying MapReduce



 Example-IV




      Distributed Grep
      Given a list of tweets (username, date, text) determine the tweets
      that contain a word.

      Implementation

      See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro




                                                                                    44 / 61
MapReduce Intro
  Applying MapReduce



 Overview - Distributed Grep




                               45 / 61
MapReduce Intro
  Applying MapReduce



 Example IV-Distributed Grep, function Map


               §                                                                               ¤
                  public void map ( Object key , Text value , Context context )
                        throws IOException , I n t e r r u p t e d E x c e p t i o n {
                        Map  String , String  parsed =
                                MRDPUtils . parse ( value . toString () ) ;
                        String txt = parsed . get ( MRDPUtils . TEXT ) ;
                        String mapRegex =  .* b  + context . getConfiguration ()
                                . get (  mapregex  ) +  (.) * b .*  ;
                        if ( txt . matches ( mapRegex ) ) {
                                context . write ( NullWritable . get () , value ) ;
                        }
                  }
              ¦
                                                                                              ¥


      ...and the Reduce function?

      In this case it is not necessary and output values are directly writing to the output.




                                                                                                   46 / 61
MapReduce Intro
  Applying MapReduce



 Example-V




      Top 5
      Given a list of tweets (username, date, text) determine the 5 users
      that wrote longer tweets

      Implementation

      See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro




                                                                                    47 / 61
MapReduce Intro
  Applying MapReduce



 Overview - Top 5




                       48 / 61
MapReduce Intro
  Applying MapReduce



 Example V-Top 5, function Map

             §                                                                                    ¤
                  private TreeMap  Integer , Text  repToRecordMap = new TreeMap 
                       Integer , Text () ;
                  public void map ( Object key , Text value , Context context )
                        throws IOException , I n t e r r u p t e d E x c e p t i o n {
                        Map  String , String  parsed =
                        MRDPUtils . parse ( value . toString () ) ;
                        if ( parsed == null ) { return ;}
                        String userId = parsed . get ( MRDPUtils . USER_ID ) ;
                        String reputation = String . valueOf ( parsed . get ( MRDPUtils .
                               TEXT ) . length () ) ;
                        // Max reputation if you write tweets longer
                        if ( userId == null || reputation == null ) { return ;}
                                repToRecordMap . put ( Integer . parseInt ( reputation ) , new
                                        Text ( value ) ) ;
                                if ( repToRecordMap . size ()  MAX_TOP ) {
                                         repToRecordMap . remove ( repToRecordMap . firstKey ()
                                                );
                                }
                           }
             ¦
                                                                                                 ¥



                                                                                                      49 / 61
MapReduce Intro
  Applying MapReduce



 Example V-Top 5, function Reduce


             §                                                                                             ¤
                  public void reduce ( NullWritable key , Iterable  Text  values ,
                        Context context ) throws IOException , I n t e r r u p t e d E x c e p t i o n {
                              for ( Text value : values ) {
                              Map  String , String  parsed = MRDPUtils . parse ( value .
                                     toString () ) ;
                              repToRecordMap . put ( parsed . get ( MRDPUtils . TEXT ) . length
                                     () , new Text ( value ) ) ;
                              if ( repToRecordMap . size ()  MAX_TOP ) {
                                      repToRecordMap . remove ( repToRecordMap . firstKey ()
                                             );
                                      }
                                }
                              for ( Text t : repToRecordMap . descendingMap () . values ()
                                     ) {
                                      context . write ( NullWritable . get () , t ) ;
                              }
                  }
             ¦
                                                                                                          ¥




                                                                                                               50 / 61
MapReduce Intro
  Applying MapReduce



 Filtering-Other approaches


      Relation to SQL
             §                                                   ¤
                  SELECT * FROM table WHERE colvalue  VALUE ;
             ¦
                                                                ¥



      Implementation in PIG
             §                                                   ¤
                  b = FILTER a BY colvalue  VALUE ;
             ¦
                                                                ¥




                                                                     51 / 61
MapReduce Intro
  Applying MapReduce



 Filtering-Other approaches


      Relation to SQL
             §                                                   ¤
                  SELECT * FROM table WHERE colvalue  VALUE ;
             ¦
                                                                ¥



      Implementation in PIG
             §                                                   ¤
                  b = FILTER a BY colvalue  VALUE ;
             ¦
                                                                ¥




                                                                     52 / 61
MapReduce Intro
  Applying MapReduce



 Tip




      How can I use and run a MapReduce framework?
      You should identify what kind of problem you are addressing and
      apply a design pattern to be implemented in a framework such
      as Apache Hadoop.




                                                                        53 / 61
MapReduce Intro
  Success Stories with MapReduce



 Tip



      Who is using MapReduce?
      All companies that are dealing with Big Data problems for
      analytics such as:
             Cloudera
             Datasalt
             Elasticsearch
             ...




                                                                  54 / 61
MapReduce Intro
  Success Stories with MapReduce



 Apache Hadoop-Related Projects




                                   55 / 61
MapReduce Intro
  Success Stories with MapReduce



 More tips


      FAQ
             MapReduce is a framework based on a simple programming
             model
             ...to deal with large datasets in a distributed fashion
             ...scalability, replication, fault-tolerant, etc.
             Apache Hadoop is not a database
             New frameworks on top of Hadoop for specific tasks:
             querying, analysis, etc.
             Other similar frameworks: Storm, Signal/Collect, etc.
             ...


                                                                       56 / 61
MapReduce Intro
  Summary and Conclusions



 Summary




                            57 / 61
MapReduce Intro
  Summary and Conclusions



 Conclusions


      What is MapReduce?

      It is a framework inspired in functional programming to tackle problems in which steps can be paralellized
      applying a divide and conquer approach.


      What can I do in MapReduce?

      Three main functions:
          1   Querying
          2   Summarizing
          3   Analyzing
      . . . large datasets in off-line mode for boosting other on-line processes.


      How can I use and run a MapReduce framework?

      You should identify what kind of problem you are addressing and apply a design pattern to be implemented in a
      framework such as Apache Hadoop.




                                                                                                                      58 / 61
MapReduce Intro
  Summary and Conclusions



 Conclusions


      What is MapReduce?

      It is a framework inspired in functional programming to tackle problems in which steps can be paralellized
      applying a divide and conquer approach.


      What can I do in MapReduce?

      Three main functions:
          1   Querying
          2   Summarizing
          3   Analyzing
      . . . large datasets in off-line mode for boosting other on-line processes.


      How can I use and run a MapReduce framework?

      You should identify what kind of problem you are addressing and apply a design pattern to be implemented in a
      framework such as Apache Hadoop.




                                                                                                                      59 / 61
MapReduce Intro
  Summary and Conclusions



 Conclusions


      What is MapReduce?

      It is a framework inspired in functional programming to tackle problems in which steps can be paralellized
      applying a divide and conquer approach.


      What can I do in MapReduce?

      Three main functions:
          1   Querying
          2   Summarizing
          3   Analyzing
      . . . large datasets in off-line mode for boosting other on-line processes.


      How can I use and run a MapReduce framework?

      You should identify what kind of problem you are addressing and apply a design pattern to be implemented in a
      framework such as Apache Hadoop.




                                                                                                                      60 / 61
MapReduce Intro
  Summary and Conclusions



 What’s next?


      ...
             Concatenate MapReduce jobs
             Optimization using combiners and setting the parameters (size
             of partition, etc.)
             Pipelining with other languages such as Python
             Hadoop in Action: more examples, etc.
             New trending problems (image/video processing)
             Real-time processing
             ...



                                                                             61 / 61
MapReduce Intro
  References



               J. Dean and S. Ghemawat.
               MapReduce: simplified data processing on large clusters.
               Commun. ACM, 51(1):107–113, Jan. 2008.
               J. L. Jonathan R. Owens, Brian Femiano.
               Hadoop Real-World Solutions Cookbook.
               Packt Publishing Ltd, 2013.
               C. Lam.
               Hadoop in Action.
               Manning Publications Co., Greenwich, CT, USA, 1st edition,
               2010.
               J. Lin and C. Dyer.
               Data-intensive text processing with MapReduce.
               In Proceedings of Human Language Technologies: The 2009
               Annual Conference of the North American Chapter of the
               Association for Computational Linguistics, Companion
                                                                            62 / 61
MapReduce Intro
  References



               Volume: Tutorial Abstracts, NAACL-Tutorials ’09, pages 1–2,
               Stroudsburg, PA, USA, 2009. Association for Computational
               Linguistics.
               D. Miner and A. Shook.
               Mapreduce Design Patterns.
               Oreilly and Associates Inc, 2012.
               T. G. Srinath Perera.
               Hadoop MapReduce Cookbook.
               Packt Publishing Ltd, 2013.
               T. White.
               Hadoop: The Definitive Guide.
               O’Reilly Media, Inc., 1st edition, 2009.
               I. H. Witten and E. Frank.
               Data Mining: Practical Machine LearningTools and Techniques.

                                                                             63 / 61
MapReduce Intro
  References



               Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,
               2005.




                                                                          64 / 61

More Related Content

What's hot

What's hot (20)

Lecture1 data structure(introduction)
Lecture1 data structure(introduction)Lecture1 data structure(introduction)
Lecture1 data structure(introduction)
 
Numpy python cheat_sheet
Numpy python cheat_sheetNumpy python cheat_sheet
Numpy python cheat_sheet
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
MapReduce
MapReduceMapReduce
MapReduce
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
R data-import, data-export
R data-import, data-exportR data-import, data-export
R data-import, data-export
 
RDF, linked data and semantic web
RDF, linked data and semantic webRDF, linked data and semantic web
RDF, linked data and semantic web
 
Graph Analytics
Graph AnalyticsGraph Analytics
Graph Analytics
 
Unit 2
Unit 2Unit 2
Unit 2
 
Distributed Systems
Distributed SystemsDistributed Systems
Distributed Systems
 
Apache Spark RDD 101
Apache Spark RDD 101Apache Spark RDD 101
Apache Spark RDD 101
 
Data Structure and Algorithms.pptx
Data Structure and Algorithms.pptxData Structure and Algorithms.pptx
Data Structure and Algorithms.pptx
 
Data structures logical and physical
Data structures   logical and physical Data structures   logical and physical
Data structures logical and physical
 
10. XML in DBMS
10. XML in DBMS10. XML in DBMS
10. XML in DBMS
 
Divide and Conquer - Part 1
Divide and Conquer - Part 1Divide and Conquer - Part 1
Divide and Conquer - Part 1
 
1 introduction databases and database users
1 introduction databases and database users1 introduction databases and database users
1 introduction databases and database users
 
Neo4j Bloom: What’s New with Neo4j's Data Visualization Tool
Neo4j Bloom: What’s New with Neo4j's Data Visualization ToolNeo4j Bloom: What’s New with Neo4j's Data Visualization Tool
Neo4j Bloom: What’s New with Neo4j's Data Visualization Tool
 
Presentation on Elementary data structures
Presentation on Elementary data structuresPresentation on Elementary data structures
Presentation on Elementary data structures
 
Asymptotic Notations
Asymptotic NotationsAsymptotic Notations
Asymptotic Notations
 

Viewers also liked

Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word countJeff Patti
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
(GAM406) Glu Mobile: Real-time Analytics Processing og 10 MM+ Devices
(GAM406) Glu Mobile: Real-time Analytics Processing og 10 MM+ Devices(GAM406) Glu Mobile: Real-time Analytics Processing og 10 MM+ Devices
(GAM406) Glu Mobile: Real-time Analytics Processing og 10 MM+ DevicesAmazon Web Services
 
[150824]symposium v4
[150824]symposium v4[150824]symposium v4
[150824]symposium v4yyooooon
 
Scalable Data Analysis in R Webinar Presentation
Scalable Data Analysis in R Webinar PresentationScalable Data Analysis in R Webinar Presentation
Scalable Data Analysis in R Webinar PresentationRevolution Analytics
 
Hadoop M/R Pig Hive
Hadoop M/R Pig HiveHadoop M/R Pig Hive
Hadoop M/R Pig Hivezahid-mian
 
Top 3 design patterns in Map Reduce
Top 3 design patterns in Map ReduceTop 3 design patterns in Map Reduce
Top 3 design patterns in Map ReduceEdureka!
 

Viewers also liked (20)

WP4-QoS Management in the Cloud
WP4-QoS Management in the CloudWP4-QoS Management in the Cloud
WP4-QoS Management in the Cloud
 
MOLDEAS at City College
MOLDEAS at City CollegeMOLDEAS at City College
MOLDEAS at City College
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word count
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
(GAM406) Glu Mobile: Real-time Analytics Processing og 10 MM+ Devices
(GAM406) Glu Mobile: Real-time Analytics Processing og 10 MM+ Devices(GAM406) Glu Mobile: Real-time Analytics Processing og 10 MM+ Devices
(GAM406) Glu Mobile: Real-time Analytics Processing og 10 MM+ Devices
 
MapReduce DesignPatterns
MapReduce DesignPatternsMapReduce DesignPatterns
MapReduce DesignPatterns
 
[150824]symposium v4
[150824]symposium v4[150824]symposium v4
[150824]symposium v4
 
MOLDEAS-PhD Summary
MOLDEAS-PhD SummaryMOLDEAS-PhD Summary
MOLDEAS-PhD Summary
 
Scalable Data Analysis in R Webinar Presentation
Scalable Data Analysis in R Webinar PresentationScalable Data Analysis in R Webinar Presentation
Scalable Data Analysis in R Webinar Presentation
 
Researching Semantic Web-Overview
Researching Semantic Web-OverviewResearching Semantic Web-Overview
Researching Semantic Web-Overview
 
Hadoop M/R Pig Hive
Hadoop M/R Pig HiveHadoop M/R Pig Hive
Hadoop M/R Pig Hive
 
Internet, Web 2.0 y Salud 2.0
Internet, Web 2.0 y Salud 2.0Internet, Web 2.0 y Salud 2.0
Internet, Web 2.0 y Salud 2.0
 
HTML5 Audio & Vídeo
HTML5 Audio & VídeoHTML5 Audio & Vídeo
HTML5 Audio & Vídeo
 
QoS Management in Cloud Computing-Draft proposal
QoS Management in Cloud Computing-Draft proposalQoS Management in Cloud Computing-Draft proposal
QoS Management in Cloud Computing-Draft proposal
 
HTML5-Aplicaciones web
HTML5-Aplicaciones webHTML5-Aplicaciones web
HTML5-Aplicaciones web
 
Introducción a Sistemas de Información
Introducción a Sistemas de InformaciónIntroducción a Sistemas de Información
Introducción a Sistemas de Información
 
Ejemplos prácticos de Búsqueda en Salud
Ejemplos prácticos de Búsqueda en SaludEjemplos prácticos de Búsqueda en Salud
Ejemplos prácticos de Búsqueda en Salud
 
Top 3 design patterns in Map Reduce
Top 3 design patterns in Map ReduceTop 3 design patterns in Map Reduce
Top 3 design patterns in Map Reduce
 
Introducción a "La Web como una Base de Datos"
Introducción a "La Web como una Base de Datos"Introducción a "La Web como una Base de Datos"
Introducción a "La Web como una Base de Datos"
 

Similar to Map/Reduce intro

Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyKyong-Ha Lee
 
Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop
Mochi: Visual Log-Analysis Based Tools for Debugging HadoopMochi: Visual Log-Analysis Based Tools for Debugging Hadoop
Mochi: Visual Log-Analysis Based Tools for Debugging HadoopGeorge Ang
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsDilum Bandara
 
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFTed Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFMLconf
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingMohammad Mustaqeem
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizationsscottcrespo
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoMark Kromer
 
Python for data science
Python for data sciencePython for data science
Python for data sciencebotsplash.com
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingAnimesh Chaturvedi
 
MAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine LearningMAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine LearningGianvito Siciliano
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Optimization for iterative queries on Mapreduce
Optimization for iterative queries on MapreduceOptimization for iterative queries on Mapreduce
Optimization for iterative queries on Mapreducemakoto onizuka
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014cdmaxime
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015cdmaxime
 

Similar to Map/Reduce intro (20)

Tutorial5
Tutorial5Tutorial5
Tutorial5
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A Survey
 
Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop
Mochi: Visual Log-Analysis Based Tools for Debugging HadoopMochi: Visual Log-Analysis Based Tools for Debugging Hadoop
Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel Problems
 
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFTed Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
 
Python for data science
Python for data sciencePython for data science
Python for data science
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
MAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine LearningMAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine Learning
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
2 mapreduce-model-principles
2 mapreduce-model-principles2 mapreduce-model-principles
2 mapreduce-model-principles
 
Optimization for iterative queries on Mapreduce
Optimization for iterative queries on MapreduceOptimization for iterative queries on Mapreduce
Optimization for iterative queries on Mapreduce
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
 

More from CARLOS III UNIVERSITY OF MADRID

Engineering 4.0: Digitization through task automation and reuse
Engineering 4.0:  Digitization through task automation and reuseEngineering 4.0:  Digitization through task automation and reuse
Engineering 4.0: Digitization through task automation and reuseCARLOS III UNIVERSITY OF MADRID
 
LOTAR-PDES: Engineering digitalization through task automation and reuse in t...
LOTAR-PDES: Engineering digitalization through task automation and reuse in t...LOTAR-PDES: Engineering digitalization through task automation and reuse in t...
LOTAR-PDES: Engineering digitalization through task automation and reuse in t...CARLOS III UNIVERSITY OF MADRID
 
Sailing the V: Engineering digitalization through task automation and reuse i...
Sailing the V: Engineering digitalization through task automation and reuse i...Sailing the V: Engineering digitalization through task automation and reuse i...
Sailing the V: Engineering digitalization through task automation and reuse i...CARLOS III UNIVERSITY OF MADRID
 
AI4SE: Challenges and opportunities in the integration of Systems Engineering...
AI4SE: Challenges and opportunities in the integration of Systems Engineering...AI4SE: Challenges and opportunities in the integration of Systems Engineering...
AI4SE: Challenges and opportunities in the integration of Systems Engineering...CARLOS III UNIVERSITY OF MADRID
 
Challenges in the integration of Systems Engineering and the AI/ML model life...
Challenges in the integration of Systems Engineering and the AI/ML model life...Challenges in the integration of Systems Engineering and the AI/ML model life...
Challenges in the integration of Systems Engineering and the AI/ML model life...CARLOS III UNIVERSITY OF MADRID
 
OSLC KM: Elevating the meaning of data and operations within the toolchain
OSLC KM: Elevating the meaning of data and operations within the toolchainOSLC KM: Elevating the meaning of data and operations within the toolchain
OSLC KM: Elevating the meaning of data and operations within the toolchainCARLOS III UNIVERSITY OF MADRID
 
OSLC KM (Knowledge Management): elevating the meaning of data and operations ...
OSLC KM (Knowledge Management): elevating the meaning of data and operations ...OSLC KM (Knowledge Management): elevating the meaning of data and operations ...
OSLC KM (Knowledge Management): elevating the meaning of data and operations ...CARLOS III UNIVERSITY OF MADRID
 
Systems and Software Architecture: an introduction to architectural modelling
Systems and Software Architecture: an introduction to architectural modellingSystems and Software Architecture: an introduction to architectural modelling
Systems and Software Architecture: an introduction to architectural modellingCARLOS III UNIVERSITY OF MADRID
 
Detection of fraud in financial blockchain-based transactions through big dat...
Detection of fraud in financial blockchain-based transactions through big dat...Detection of fraud in financial blockchain-based transactions through big dat...
Detection of fraud in financial blockchain-based transactions through big dat...CARLOS III UNIVERSITY OF MADRID
 
News headline generation with sentiment and patterns: A case study of sports ...
News headline generation with sentiment and patterns: A case study of sports ...News headline generation with sentiment and patterns: A case study of sports ...
News headline generation with sentiment and patterns: A case study of sports ...CARLOS III UNIVERSITY OF MADRID
 

More from CARLOS III UNIVERSITY OF MADRID (20)

Proyecto IVERES-UC3M
Proyecto IVERES-UC3MProyecto IVERES-UC3M
Proyecto IVERES-UC3M
 
RTVE: Sustainable Development Goal Radar
RTVE: Sustainable Development Goal  RadarRTVE: Sustainable Development Goal  Radar
RTVE: Sustainable Development Goal Radar
 
Engineering 4.0: Digitization through task automation and reuse
Engineering 4.0:  Digitization through task automation and reuseEngineering 4.0:  Digitization through task automation and reuse
Engineering 4.0: Digitization through task automation and reuse
 
LOTAR-PDES: Engineering digitalization through task automation and reuse in t...
LOTAR-PDES: Engineering digitalization through task automation and reuse in t...LOTAR-PDES: Engineering digitalization through task automation and reuse in t...
LOTAR-PDES: Engineering digitalization through task automation and reuse in t...
 
SESE 2021: Where Systems Engineering meets AI/ML
SESE 2021: Where Systems Engineering meets AI/MLSESE 2021: Where Systems Engineering meets AI/ML
SESE 2021: Where Systems Engineering meets AI/ML
 
Sailing the V: Engineering digitalization through task automation and reuse i...
Sailing the V: Engineering digitalization through task automation and reuse i...Sailing the V: Engineering digitalization through task automation and reuse i...
Sailing the V: Engineering digitalization through task automation and reuse i...
 
Deep Learning Notes
Deep Learning NotesDeep Learning Notes
Deep Learning Notes
 
H2020-AHTOOLS Use Case 3 Functional Design
H2020-AHTOOLS Use Case 3 Functional DesignH2020-AHTOOLS Use Case 3 Functional Design
H2020-AHTOOLS Use Case 3 Functional Design
 
AI4SE: Challenges and opportunities in the integration of Systems Engineering...
AI4SE: Challenges and opportunities in the integration of Systems Engineering...AI4SE: Challenges and opportunities in the integration of Systems Engineering...
AI4SE: Challenges and opportunities in the integration of Systems Engineering...
 
INCOSE IS 2019: AI and Systems Engineering
INCOSE IS 2019: AI and Systems EngineeringINCOSE IS 2019: AI and Systems Engineering
INCOSE IS 2019: AI and Systems Engineering
 
Challenges in the integration of Systems Engineering and the AI/ML model life...
Challenges in the integration of Systems Engineering and the AI/ML model life...Challenges in the integration of Systems Engineering and the AI/ML model life...
Challenges in the integration of Systems Engineering and the AI/ML model life...
 
Blockchain en la Industria Musical
Blockchain en la Industria MusicalBlockchain en la Industria Musical
Blockchain en la Industria Musical
 
OSLC KM: Elevating the meaning of data and operations within the toolchain
OSLC KM: Elevating the meaning of data and operations within the toolchainOSLC KM: Elevating the meaning of data and operations within the toolchain
OSLC KM: Elevating the meaning of data and operations within the toolchain
 
Blockchain y sector asegurador
Blockchain y sector aseguradorBlockchain y sector asegurador
Blockchain y sector asegurador
 
OSLC KM (Knowledge Management): elevating the meaning of data and operations ...
OSLC KM (Knowledge Management): elevating the meaning of data and operations ...OSLC KM (Knowledge Management): elevating the meaning of data and operations ...
OSLC KM (Knowledge Management): elevating the meaning of data and operations ...
 
Systems and Software Architecture: an introduction to architectural modelling
Systems and Software Architecture: an introduction to architectural modellingSystems and Software Architecture: an introduction to architectural modelling
Systems and Software Architecture: an introduction to architectural modelling
 
Detection of fraud in financial blockchain-based transactions through big dat...
Detection of fraud in financial blockchain-based transactions through big dat...Detection of fraud in financial blockchain-based transactions through big dat...
Detection of fraud in financial blockchain-based transactions through big dat...
 
News headline generation with sentiment and patterns: A case study of sports ...
News headline generation with sentiment and patterns: A case study of sports ...News headline generation with sentiment and patterns: A case study of sports ...
News headline generation with sentiment and patterns: A case study of sports ...
 
Blockchain y la industria musical
Blockchain y la industria musicalBlockchain y la industria musical
Blockchain y la industria musical
 
Preparing your Big Data start-up pitch
Preparing your Big Data start-up pitchPreparing your Big Data start-up pitch
Preparing your Big Data start-up pitch
 

Recently uploaded

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 

Recently uploaded (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 

Map/Reduce intro

  • 1. MapReduce Intro The MapReduce Programming Model Introduction and Examples Dr. Jose Mar´ Alvarez-Rodr´ ıa ıguez “Quality Management in Service-based Systems and Cloud Applications” FP7 RELATE-ITN South East European Research Center Thessaloniki, 10th of April, 2013 1 / 61
  • 2. MapReduce Intro 1 MapReduce in a nutshell 2 Thinking in MapReduce 3 Applying MapReduce 4 Success Stories with MapReduce 5 Summary and Conclusions 2 / 61
  • 3. MapReduce Intro MapReduce in a nutshell Features A programming model... 1 Large-scale distributed data processing 2 Simple but restricted 3 Paralell programming 4 Extensible 3 / 61
  • 4. MapReduce Intro MapReduce in a nutshell Antecedents Functional programming 1 Inspired 2 ...but not equivalent Example in Python “Given a list of numbers between 1 and 50 print only even numbers” § ¤ print filter ( lambda x : x % 2 == 0 , range (1 , 50) ) ¦ ¥ A list of numbers (data) A condition (even numbers) A function filter that is applied to the list (map) 4 / 61
  • 5. MapReduce Intro MapReduce in a nutshell Antecedents Functional programming 1 Inspired 2 ...but not equivalent Example in Python “Given a list of numbers between 1 and 50 print only even numbers” § ¤ print filter ( lambda x : x % 2 == 0 , range (1 , 50) ) ¦ ¥ A list of numbers (data) A condition (even numbers) A function filter that is applied to the list (map) 5 / 61
  • 6. MapReduce Intro MapReduce in a nutshell ...Other examples... Example in Python “Return the sum of the squares of a list of numbers between 1 and 50” § ¤ import operator reduce ( operator . add , map (( lambda x : x **2) , range (1 ,50) ) , 0) ¦ ¥ “reduce” is equivalent to “foldl” in other func. languages as Haskell other math considerations should be taken into account (kind of operator)... 6 / 61
  • 7. MapReduce Intro MapReduce in a nutshell Some interesting points... The Map Reduce framework... 1 Inspired in functional programming concepts (but not equivalent) 2 Problems that can be paralellized 3 Sometimes recursive solutions 4 ... 7 / 61
  • 8. MapReduce Intro MapReduce in a nutshell Basic Model “MapReduce: The Programming Model and Practice”, SIGMETRICS, Turorials 2009, Google. 8 / 61
  • 9. MapReduce Intro MapReduce in a nutshell Map Function Figure: Mapping creates a new output list by applying a function to individual elements of an input list. “Module 4: MapReduce”, Hadoop Tutorial, Yahoo!. 9 / 61
  • 10. MapReduce Intro MapReduce in a nutshell Reduce Function Figure: Reducing a list iterates over the input values to produce an aggregate value as output. “Module 4: MapReduce”, Hadoop Tutorial, Yahoo!. 10 / 61
  • 11. MapReduce Intro MapReduce in a nutshell MapReduce Flow Figure: High-level MapReduce pipeline. “Module 4: MapReduce”, Hadoop Tutorial, Yahoo!. 11 / 61
  • 12. MapReduce Intro MapReduce in a nutshell MapReduce Flow Figure: Detailed Hadoop MapReduce data flow. 12 / 61
  • 13. MapReduce Intro MapReduce in a nutshell Tip What is MapReduce? It is a framework inspired in functional programming to tackle problems in which steps can be paralellized applying a divide and conquer approach. 13 / 61
  • 14. MapReduce Intro Thinking in MapReduce When should I use MapReduce? Query Index and Search: inverted index Filtering Classification Recommendations: clustering or collaborative filtering Analytics Summarization and statistics Sorting and merging Frequency distribution SQL-based queries: group-by, having, etc. Generation of graphics: histograms, scatter plots. Others Message passing such as Breadth First-Search or PageRank algorithms. 14 / 61
  • 15. MapReduce Intro Thinking in MapReduce When should I use MapReduce? Query Index and Search: inverted index Filtering Classification Recommendations: clustering or collaborative filtering Analytics Summarization and statistics Sorting and merging Frequency distribution SQL-based queries: group-by, having, etc. Generation of graphics: histograms, scatter plots. Others Message passing such as Breadth First-Search or PageRank algorithms. 15 / 61
  • 16. MapReduce Intro Thinking in MapReduce When should I use MapReduce? Query Index and Search: inverted index Filtering Classification Recommendations: clustering or collaborative filtering Analytics Summarization and statistics Sorting and merging Frequency distribution SQL-based queries: group-by, having, etc. Generation of graphics: histograms, scatter plots. Others Message passing such as Breadth First-Search or PageRank algorithms. 16 / 61
  • 17. MapReduce Intro Thinking in MapReduce How Google uses MapReduce (80% of data processing) Large-scale web search indexing Clustering problems for Google News Produce reports for popular queries, e.g. Google Trend Processing of satellite imagery data Language model processing for statistical machine translation Large-scale machine learning problems ... 17 / 61
  • 18. MapReduce Intro Thinking in MapReduce Comparison of MapReduce and other approaches “MapReduce: The Programming Model and Practice”, SIGMETRICS, Turorials 2009, Google. 18 / 61
  • 19. MapReduce Intro Thinking in MapReduce Evaluation of MapReduce and other approaches “MapReduce: The Programming Model and Practice”, SIGMETRICS, Turorials 2009, Google. 19 / 61
  • 20. MapReduce Intro Thinking in MapReduce Apache Hadoop MapReduce definition The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets Figure: Apache Hadoop Logo. across clusters of computers using simple programming models. 20 / 61
  • 21. MapReduce Intro Thinking in MapReduce Tip What can I do in MapReduce? Three main functions: 1 Querying 2 Summarizing 3 Analyzing . . . large datasets in off-line mode for boosting other on-line processes. 21 / 61
  • 22. MapReduce Intro Applying MapReduce MapReduce in Action MapReduce Patterns 1 Summarization 2 Filtering 3 Data Organization (sort, merging, etc.) 4 Relational-based (join, selection, projection, etc.) 5 Iterative Message Passing (graph processing) 6 Others (depending on the implementation): Simulation of distributed systems Cross-correlation Metapatterns Input-output ... 22 / 61
  • 23. MapReduce Intro Applying MapReduce Overview (stages)-Counting Letters 23 / 61
  • 24. MapReduce Intro Applying MapReduce Summarization Types 1 Numerical summarizations 2 Inverted index 3 Counting and counters 24 / 61
  • 25. MapReduce Intro Applying MapReduce Numerical Summarization-I Description A general pattern for calculating aggregate statistical values over your data. Intent Group records together by a key field and calculate a numerical aggregate per group to get a top-level view of the larger data set. 25 / 61
  • 26. MapReduce Intro Applying MapReduce Numerical Summarization-II Applicability To deal with numerical data or counting. To group data by specific fields Examples 1 Word count 2 Record count 3 Min/Max/Count 4 Average/Median/Standard deviation 5 ... 26 / 61
  • 27. MapReduce Intro Applying MapReduce Numerical Summarization-Pseudocode class Mapper method Map(recordid id, record r) for all term t in record r do Emit(term t, count 1) class Reducer method Reduce(term t, counts [c1, c2,...]) sum = 0 for all count c in [c1, c2,...] do sum = sum + c Emit(term t, count sum) 27 / 61
  • 28. MapReduce Intro Applying MapReduce Overview-Word Counter 28 / 61
  • 29. MapReduce Intro Applying MapReduce Numerical Summarization-Word Counter § ¤ public void map ( LongWritable key , Text value , Context context ) throws Exception { String line = value . toString () ; StringTokenizer tokenizer = new StringTokenizer ( line ) ; while ( tokenizer . hasMoreTokens () ) { word . set ( tokenizer . nextToken () ) ; context . write ( word , one ) ; } } public void reduce ( Text key , Iterable IntWritable values , Context context ) throws IOException , I n t e r r u p t e d E x c e p t i o n { int sum = 0; for ( IntWritable val : values ) { sum += val . get () ; } context . write ( key , new IntWritable ( sum ) ) ; } ¦ ¥ 29 / 61
  • 30. MapReduce Intro Applying MapReduce Example-II Min/Max Given a list of tweets (username, date, text) determine first and last time an user commented and the number of times. Implementation See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro 30 / 61
  • 31. MapReduce Intro Applying MapReduce Overview - Min/Max ∗ Min and max creation date are the same in the map phase. 31 / 61
  • 32. MapReduce Intro Applying MapReduce Example II-Min/Max, function Map § ¤ public void map ( Object key , Text value , Context context ) throws IOException , InterruptedException , ParseException { Map String , String parsed = MRDPUtils . parse ( value . toString () ) ; String strDate = parsed . get ( MRDPUtils . CREATION_DATE ) ; String userId = parsed . get ( MRDPUtils . USER_ID ) ; if ( strDate == null || userId == null ) { return ; } Date creationDate = MRDPUtils . frmt . parse ( strDate ) ; outTuple . setMin ( creationDate ) ; outTuple . setMax ( creationDate ) ; outTuple . setCount (1) ; outUserId . set ( userId ) ; context . write ( outUserId , outTuple ) ; } ¦ ¥ 32 / 61
  • 33. MapReduce Intro Applying MapReduce Example II-Min/Max, function Reduce § ¤ public void reduce ( Text key , Iterable MinMaxCountTuple values , Context context ) throws IOException , I n t e r r u p t e d E x c e p t i o n { result . setMin ( null ) ; result . setMax ( null ) ; int sum = 0; for ( MinMaxCountTuple val : values ) { if ( result . getMin () == null || val . getMin () . compareTo ( result . getMin () ) 0) { result . setMin ( val . getMin () ) ; } if ( result . getMax () == null || val . getMax () . compareTo ( result . getMax () ) 0) { result . setMax ( val . getMax () ) ; } sum += val . getCount () ;} result . setCount ( sum ) ; context . write ( key , result ) ; } ¦ ¥ 33 / 61
  • 34. MapReduce Intro Applying MapReduce Example-III Average Given a list of tweets (username, date, text) determine the average comment length per hour of day. Implementation See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro 34 / 61
  • 35. MapReduce Intro Applying MapReduce Overview - Average 35 / 61
  • 36. MapReduce Intro Applying MapReduce Example III-Average, function Map § ¤ public void map ( Object key , Text value , Context context ) throws IOException , InterruptedException , ParseException { Map String , String parsed = MRDPUtils . parse ( value . toString () ) ; String strDate = parsed . get ( MRDPUtils . CREATION_DATE ) ; String text = parsed . get ( MRDPUtils . TEXT ) ; if ( strDate == null || text == null ) { return ; } Date creationDate = MRDPUtils . frmt . parse ( strDate ) ; outHour . set ( creationDate . getHours () ) ; outCountAverage . setCount (1) ; outCountAverage . setAverage ( text . length () ) ; context . write ( outHour , outCountAverage ) ; } ¦ ¥ 36 / 61
  • 37. MapReduce Intro Applying MapReduce Example III-Average, function Reduce § ¤ public void reduce ( IntWritable key , Iterable CountAverageTuple values , Context context ) throws IOException , I n t e r r u p t e d E x c e p t i o n { float sum = 0; float count = 0; for ( Co unt Ave rage Tup le val : values ) { sum += val . getCount () * val . getAverage () ; count += val . getCount () ; } result . setCount ( count ) ; result . setAverage ( sum / count ) ; context . write ( key , result ) ; } ¦ ¥ 37 / 61
  • 38. MapReduce Intro Applying MapReduce Numerical Summarization-Other approaches Relation to SQL § ¤ SELECT MIN ( numcol1 ) , MAX ( numcol1 ) , COUNT (*) FROM table GROUP BY groupcol2 ; ¦ ¥ Implementation in PIG § ¤ b = GROUP a BY groupcol2 ; c = FOREACH b GENERATE group , MIN ( a . numcol1 ) , MAX ( a . numcol1 ) , COUNT_STAR ( a ) ; ¦ ¥ 38 / 61
  • 39. MapReduce Intro Applying MapReduce Numerical Summarization-Other approaches Relation to SQL § ¤ SELECT MIN ( numcol1 ) , MAX ( numcol1 ) , COUNT (*) FROM table GROUP BY groupcol2 ; ¦ ¥ Implementation in PIG § ¤ b = GROUP a BY groupcol2 ; c = FOREACH b GENERATE group , MIN ( a . numcol1 ) , MAX ( a . numcol1 ) , COUNT_STAR ( a ) ; ¦ ¥ 39 / 61
  • 40. MapReduce Intro Applying MapReduce Filtering Types 1 Filtering 2 Top N records 3 Bloom filtering 4 Distinct 40 / 61
  • 41. MapReduce Intro Applying MapReduce Filtering-I Description It evaluates each record separately and decides, based on some condition, whether it should stay or go. Intent Filter out records that are not of interest and keep ones that are. 41 / 61
  • 42. MapReduce Intro Applying MapReduce Filtering-II Applicability To collate data Examples 1 Closer view of dataset 2 Data cleansing 3 Tracking a thread of events 4 Simple random sampling 5 Distributed Grep 6 Removing low scoring dataset 7 Log Analysis 8 Data Querying 9 Data Validation 10 . . . 42 / 61
  • 43. MapReduce Intro Applying MapReduce Filtering-Pseudocode class Mapper method Map(recordid id, record r) field f = extract(r) if predicate (f) Emit(recordid id, value(r)) class Reducer method Reduce(recordid id, values [r1, r2,...]) //Whatever Emit(recordid id, aggregate (values)) 43 / 61
  • 44. MapReduce Intro Applying MapReduce Example-IV Distributed Grep Given a list of tweets (username, date, text) determine the tweets that contain a word. Implementation See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro 44 / 61
  • 45. MapReduce Intro Applying MapReduce Overview - Distributed Grep 45 / 61
  • 46. MapReduce Intro Applying MapReduce Example IV-Distributed Grep, function Map § ¤ public void map ( Object key , Text value , Context context ) throws IOException , I n t e r r u p t e d E x c e p t i o n { Map String , String parsed = MRDPUtils . parse ( value . toString () ) ; String txt = parsed . get ( MRDPUtils . TEXT ) ; String mapRegex = .* b + context . getConfiguration () . get ( mapregex ) + (.) * b .* ; if ( txt . matches ( mapRegex ) ) { context . write ( NullWritable . get () , value ) ; } } ¦ ¥ ...and the Reduce function? In this case it is not necessary and output values are directly writing to the output. 46 / 61
  • 47. MapReduce Intro Applying MapReduce Example-V Top 5 Given a list of tweets (username, date, text) determine the 5 users that wrote longer tweets Implementation See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro 47 / 61
  • 48. MapReduce Intro Applying MapReduce Overview - Top 5 48 / 61
  • 49. MapReduce Intro Applying MapReduce Example V-Top 5, function Map § ¤ private TreeMap Integer , Text repToRecordMap = new TreeMap Integer , Text () ; public void map ( Object key , Text value , Context context ) throws IOException , I n t e r r u p t e d E x c e p t i o n { Map String , String parsed = MRDPUtils . parse ( value . toString () ) ; if ( parsed == null ) { return ;} String userId = parsed . get ( MRDPUtils . USER_ID ) ; String reputation = String . valueOf ( parsed . get ( MRDPUtils . TEXT ) . length () ) ; // Max reputation if you write tweets longer if ( userId == null || reputation == null ) { return ;} repToRecordMap . put ( Integer . parseInt ( reputation ) , new Text ( value ) ) ; if ( repToRecordMap . size () MAX_TOP ) { repToRecordMap . remove ( repToRecordMap . firstKey () ); } } ¦ ¥ 49 / 61
  • 50. MapReduce Intro Applying MapReduce Example V-Top 5, function Reduce § ¤ public void reduce ( NullWritable key , Iterable Text values , Context context ) throws IOException , I n t e r r u p t e d E x c e p t i o n { for ( Text value : values ) { Map String , String parsed = MRDPUtils . parse ( value . toString () ) ; repToRecordMap . put ( parsed . get ( MRDPUtils . TEXT ) . length () , new Text ( value ) ) ; if ( repToRecordMap . size () MAX_TOP ) { repToRecordMap . remove ( repToRecordMap . firstKey () ); } } for ( Text t : repToRecordMap . descendingMap () . values () ) { context . write ( NullWritable . get () , t ) ; } } ¦ ¥ 50 / 61
  • 51. MapReduce Intro Applying MapReduce Filtering-Other approaches Relation to SQL § ¤ SELECT * FROM table WHERE colvalue VALUE ; ¦ ¥ Implementation in PIG § ¤ b = FILTER a BY colvalue VALUE ; ¦ ¥ 51 / 61
  • 52. MapReduce Intro Applying MapReduce Filtering-Other approaches Relation to SQL § ¤ SELECT * FROM table WHERE colvalue VALUE ; ¦ ¥ Implementation in PIG § ¤ b = FILTER a BY colvalue VALUE ; ¦ ¥ 52 / 61
  • 53. MapReduce Intro Applying MapReduce Tip How can I use and run a MapReduce framework? You should identify what kind of problem you are addressing and apply a design pattern to be implemented in a framework such as Apache Hadoop. 53 / 61
  • 54. MapReduce Intro Success Stories with MapReduce Tip Who is using MapReduce? All companies that are dealing with Big Data problems for analytics such as: Cloudera Datasalt Elasticsearch ... 54 / 61
  • 55. MapReduce Intro Success Stories with MapReduce Apache Hadoop-Related Projects 55 / 61
  • 56. MapReduce Intro Success Stories with MapReduce More tips FAQ MapReduce is a framework based on a simple programming model ...to deal with large datasets in a distributed fashion ...scalability, replication, fault-tolerant, etc. Apache Hadoop is not a database New frameworks on top of Hadoop for specific tasks: querying, analysis, etc. Other similar frameworks: Storm, Signal/Collect, etc. ... 56 / 61
  • 57. MapReduce Intro Summary and Conclusions Summary 57 / 61
  • 58. MapReduce Intro Summary and Conclusions Conclusions What is MapReduce? It is a framework inspired in functional programming to tackle problems in which steps can be paralellized applying a divide and conquer approach. What can I do in MapReduce? Three main functions: 1 Querying 2 Summarizing 3 Analyzing . . . large datasets in off-line mode for boosting other on-line processes. How can I use and run a MapReduce framework? You should identify what kind of problem you are addressing and apply a design pattern to be implemented in a framework such as Apache Hadoop. 58 / 61
  • 59. MapReduce Intro Summary and Conclusions Conclusions What is MapReduce? It is a framework inspired in functional programming to tackle problems in which steps can be paralellized applying a divide and conquer approach. What can I do in MapReduce? Three main functions: 1 Querying 2 Summarizing 3 Analyzing . . . large datasets in off-line mode for boosting other on-line processes. How can I use and run a MapReduce framework? You should identify what kind of problem you are addressing and apply a design pattern to be implemented in a framework such as Apache Hadoop. 59 / 61
  • 60. MapReduce Intro Summary and Conclusions Conclusions What is MapReduce? It is a framework inspired in functional programming to tackle problems in which steps can be paralellized applying a divide and conquer approach. What can I do in MapReduce? Three main functions: 1 Querying 2 Summarizing 3 Analyzing . . . large datasets in off-line mode for boosting other on-line processes. How can I use and run a MapReduce framework? You should identify what kind of problem you are addressing and apply a design pattern to be implemented in a framework such as Apache Hadoop. 60 / 61
  • 61. MapReduce Intro Summary and Conclusions What’s next? ... Concatenate MapReduce jobs Optimization using combiners and setting the parameters (size of partition, etc.) Pipelining with other languages such as Python Hadoop in Action: more examples, etc. New trending problems (image/video processing) Real-time processing ... 61 / 61
  • 62. MapReduce Intro References J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Commun. ACM, 51(1):107–113, Jan. 2008. J. L. Jonathan R. Owens, Brian Femiano. Hadoop Real-World Solutions Cookbook. Packt Publishing Ltd, 2013. C. Lam. Hadoop in Action. Manning Publications Co., Greenwich, CT, USA, 1st edition, 2010. J. Lin and C. Dyer. Data-intensive text processing with MapReduce. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion 62 / 61
  • 63. MapReduce Intro References Volume: Tutorial Abstracts, NAACL-Tutorials ’09, pages 1–2, Stroudsburg, PA, USA, 2009. Association for Computational Linguistics. D. Miner and A. Shook. Mapreduce Design Patterns. Oreilly and Associates Inc, 2012. T. G. Srinath Perera. Hadoop MapReduce Cookbook. Packt Publishing Ltd, 2013. T. White. Hadoop: The Definitive Guide. O’Reilly Media, Inc., 1st edition, 2009. I. H. Witten and E. Frank. Data Mining: Practical Machine LearningTools and Techniques. 63 / 61
  • 64. MapReduce Intro References Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2005. 64 / 61