SlideShare a Scribd company logo
Introduction to
MapReduce

   Zuhair Khayyat
     3/11/2012
What is MapReduce

   ●   A programming model introduced by Google in OSDI '04 for
       processing large datasets efficiently.
   ●   Features:
               –   Automatic parallelization, no parallel experience required.
               –   Data and process redundancy for failure recovery.
               –   Auto scheduling and Load balancing.
               –   Easy to program, based on two simple functions:
                        ●   Map
                        ●   Reduce.

CS245 - 2012                      Introduction to MapReduce                 2
Why MapReduce?

   ●   For a cluster of:
               –   2000 machines.
               –   Total 16 TB Ram (≈ 8 GB each).
               –   Total 2 PB Disk space (≈ 1 TB each).
   ●   Use the maximum capacity of the cluster to:
               –   Implement a parallel word count for input size 100 TB.




CS245 - 2012                   Introduction to MapReduce              3
Why MapReduce?

   ●   For a cluster of:
               –   2000 machines.
               –   Total 16 TB Ram (≈ 8 GB each).
               –   Total 2 PB Disk space (≈ 1 TB each).
   ●   Use the maximum capacity of the cluster to:
               –   Implement a parallel word count for input size 100 TB.
               –   Implement a parallel sort for the same input file.
   ●    Can you use the same code for both applications?

CS245 - 2012                    Introduction to MapReduce               4
How Fast is MapReduce (Hadoop)

   ●   Sort Benchmark competition (http://sortbenchmark.org/):
               –   2009: 100 TB in 173 minutes using 3452 nodes:
                        ● 2 x Quad core Xeons @ 2.5 GHz.
                        ● 8 GB RAM.


               –   2008: 1TB in 3.48 minutes using 910 nodes:
                        ●   4 x Dual core Xeons @ 2.0 GHz.
                        ●   8 GB RAM.



CS245 - 2012                     Introduction to MapReduce         5
Who uses MapReduce?




CS245 - 2012       Introduction to MapReduce   6
Map & Reduce functions

   ●   The Mapper (Pick a key):
               –   Input: Read input from disk.
               –   Output: Create pairs of <key, value>, known as
                    intermediate pairs.
               –   More input partitions == More parallel Mappers.
   ●   The Reducer (Process values):
               –   Input: a list of <key,value> pairs with a unique key.
               –   Output: Single or multiple of <key, values>
               –   More unique keys == More Parallel Reducers.
CS245 - 2012                    Introduction to MapReduce                  7
How MapReduce Work

   1) Partition input file into M partitions.
   2) Create M Map tasks, read M partitions in parallel and emits
      intermediate <key, value> pairs. Store them into local storage.
   3) Wait for all Map workers to finish, sort and partition
      intermediate <key, value> pairs into R regions.
   4) Start R reduce workers, each reads a list of intermediate with
      a unique key from remote disks.
   5) Write the output of reduce workers to file(s).


CS245 - 2012                Introduction to MapReduce            8
Example – Word count

   ●   Assume an input as following:

       cat flower picture
          snow cat cat
       prince flower sun
        king queen AC




CS245 - 2012                Introduction to MapReduce   9
Example – Word count
   ●   Step1: Partition input file into M partitions.


       cat flower picture                cat flower picture
          snow cat cat
       prince flower sun                     snow cat cat
        king queen AC
                                         prince flower sun

                                           king queen AC




CS245 - 2012                Introduction to MapReduce         10
Example – Word count
●    Step2: Create M Map tasks, read M partitions in parallel and
     emits intermediate <key, value> pairs. Store them into local
     storage.


cat flower picture         Mapper 1            <cat,1> <flower,1> <picture,1>

    snow cat cat
                           Mapper 2              <snow,1> <cat,1> <cat,1>
prince flower sun
                           Mapper 3            <prince,1> <flower,1> <sun,1>
 king queen AC

CS245 - 2012              Introduction to 4
                            Mapper MapReduce    <king,1> <queen,1> <AC,1>
                                                                     11
Example – Word count
  ●   Step3: Wait for all Map workers to finish, sort and partition
      intermediate <key, value> pairs into R regions.

                                         <cat,1>              <AC,1>
<cat,1> <flower,1> <picture,1>          <flower,1>            <cat,1>
                                        <picture,1>           <cat,1>
                                          <cat,1>             <cat,1>
  <snow,1> <cat,1> <cat,1>                <cat,1>           <flower,1>
                                         <snow,1>           <flower,1>
                                         <flower,1>          <king,1>
<prince,1> <flower,1> <sun,1>            <prince,1>         <picture,1>
                                          <sun,1>           <prince,1>
                                                            <queen,1>
                                            <AC,1>           <snow,1>
 CS245 - 2012
                                            <king,1>
 <king,1> <queen,1> <AC,1> Introduction to MapReduce          <sun,1> 12
                                           <queen,1>
Example – Word count
●   Step4: Start R reduce workers, each reads a list of intermediate
    with a unique key from remote disks.

   <AC,1>                  Reducer 1                  <AC,1>
   <cat,1>
   <cat,1>
   <cat,1>                  Reducer 2                 <cat,3>
 <flower,1>
 <flower,1>                 Reducer 3                <flower,2>
  <king,1>
 <picture,1>
 <prince,1>
 <queen,1>
  <snow,1>
CS245 - 2012
   <sun,1>                  Reducer 9
                         Introduction to MapReduce    <sun,1>     13
Example – Word count
●   Step5: Write the output of reduce workers to file(s).

                   <AC,1>
                                                     <AC,1>
                   <cat,3>                           <cat,3>
                                                   <flower,2>
                  <flower,2>
                                                    <king,1>
                   <king,1>                        <picture,1>
                                                   <prince,1>
                  <picture,1>                      <queen,1>
                                                    <snow,1>
                                                     <sun,1>

                   <sun,1>
CS245 - 2012                 Introduction to MapReduce           14
MapReduce framework




CS245 - 2012       Introduction to MapReduce   15
MapReduce Failure Recovery

   ●   The framework works as master worker paradigm.
   ●   The master keeps records of the work done on each worker.
   ●   If a worker fails, the master assigns the same work to another
       worker.
   ●   If a worker is late, another copy of the same work is assigned
       to another worker.
   ●   If the master fails, another backup copy of the master can pick
       up and continue execution from the last check points.


CS245 - 2012               Introduction to MapReduce              16
Advantages of MapReduce

   ●   Parallel IO: hides disk latency.
   ●   Parallel Processing:
               –   Map functions works independently in parallel, each
                    process one unique partition.
               –   Reduce functions work independently in parallel, each
                    on a unique intermediate key.
   ●   Using large clusters of commodity machines gives better
       results than small expensive clusters.



CS245 - 2012                   Introduction to MapReduce             17
Advantages of MapReduce

   ●   Parallel IO: hides disk latency.
   ●   Parallel Processing:
               –   Map functions works independently in parallel, each
                    process one unique partition.
               –   Reduce functions work independently in parallel, each
                    on a unique intermediate key.
   ●   Using large clusters of commodity machines gives
       comparable results than small expensive clusters.



CS245 - 2012                   Introduction to MapReduce             18
Hadoop vs. others
   ●   Algorithm: Sorting 100 TB data.


                         Hadoop                   DEMSort         TritonSort
       Nodes Count         3452                       195             47
        Processor      2x Quad-core            2x Quad-core      2x Quad-core
                     Xeons @ 2.5 GHz         Xeons @ 2.6 GHz   Xeons @ 2.27 GHz
         Memory            8 GB                     16 GB           24 GB
         Network     1 Gigabit Ethernet           InfiniBand    10 Gigabit Fiber
       Throughput      0.578 TB/Min             0.564 TB/Min     0.582 TB/Min



CS245 - 2012                    Introduction to MapReduce                          19
MapReduce weak points

   ●   Overhead of MapReduce is huge.
   ●   Data dependent applications may need multiple iterations of
       MapReduce, for example:
               –   K-means.
               –   PageRank.
   ●   Complex algorithms can be very hard to implement.
               –   Range Queries.
   ●   Sensitive to <key,value> pairs' skewed distribution

CS245 - 2012                   Introduction to MapReduce        20
Implementations of MapReduce

   ●   Hadoop in Java.
   ●   Mars in C++ & CUDA.
   ●   Skynet in Ruby.
   ●   Phoenix in C++
   ●   Microsoft Dryad:
               –   Schedule multiple levels of “MapReduce” like
                     operations..



CS245 - 2012                   Introduction to MapReduce          21
MapReduce in Database



CS245 - 2012          Introduction to MapReduce   22
MapReduce in Database - Ex1

   ●   Select Name from Students where age = 23;


                     Students:
                      Name          ID               Age
                      Ahmed        1177              23
                       Bob         1131              20
                       Sara        1197              22




CS245 - 2012             Introduction to MapReduce         23
MapReduce in Database - Ex2

   ●   Select COUNT(Name) from Students where age > 20 group
       by Name;

                    Students:
                      Name         ID               Age
                     Ahmed        1177              23
                      Bob         1131              20
                      Sara        1197              22




CS245 - 2012            Introduction to MapReduce         24
MapReduce in Database - Ex3

   ●    Select Name, Term from Students, Enrolment where ID = SID
        and age != 20;

   Students:                         Enrolment:
       Name     ID     Age                CID            SID    Term
       Ahmed   1177     23               CS290           1177   042
        Bob    1131     20               CS260           1177   052
       Sara    1197     22              ME222            1131   051
                                      AMCS220            1197   051




CS245 - 2012                 Introduction to MapReduce                 25
MapReduce in Database - Ex4

   ●    Select Name, Term from Students, Enrolment where ID !=
        SID;
   Students:                          Enrolment:
       Name      ID      Age                CID           SID    Term
       Ahmed    1177     23               CS290           1177   042
        Bob     1131     20               CS260           1177   052
       Sara     1197     22               ME222           1131   051
                                        AMCS220           1197   051

   ●    What if the condition ID > SID?

CS245 - 2012                  Introduction to MapReduce                 26
MapReduce in Database - Ex5

    ●   Select Name, Term from Students, Enrolment where ID = SID
        and Admission != Term;

Students:
    Students:                                       Enrolment:
                                                        Enrolment:
  Name           ID    Age         Admission              CID        SID   Term
 Ahmed          1177   23              042               CS290   1177      042
   Bob          1131   20              051               CS260   1177      052
  Sara          1197   22              042               ME222   1131      051
                                                     AMCS220     1197      051




 CS245 - 2012                Introduction to MapReduce                      27
MapReduce in Database - Ex6

   ●    Select y from R, S, T where R.x = S.x and T.a = S.a;


   R:                                                  S:
         x         y             z                          a   b   x


                       T:
                            m                n              a




CS245 - 2012                    Introduction to MapReduce           28
MapReduce in Academic Papers
   ●   NIPS '07: Map-Reduce for Machine Learning on Multicore.
   ●   Escience '08: CloudBLAST: Combining MapReduce and Virtualization on
       Distributed Resources for Bioinformatics Applications.
   ●   KDD '09: Large-scale behavioral targeting.
   ●   GCC '09: Spatial Queries Evaluation with MapReduce.
   ●   SIGIR '09: On single-pass indexing with MapReduce.
   ●   MDAC '10: A novel approach to multiple sequence alignment using
       hadoop data grids.
   ●   VLDB Endowment '11: Social Content Matching in MapReduce.
   ●   VLDB '12: Building Wavelet Histograms on Large Data in MapReduce.
CS245 - 2012                  Introduction to MapReduce                  29
Links
●   http://code.google.com/edu/parallel/mapreduce-tutorial.html
●   http://hadoop.apache.org/mapreduce/
●   http://www.cse.ust.hk/gpuqp/Mars.html
●   http://skynet.rubyforge.org/
●   http://mapreduce.stanford.edu/
●   http://wiki.apache.org/hadoop/PoweredBy
●   http://atbrox.com/2011/05/16/mapreduce-hadoop-algorithms-in-
    academic-papers-4th-update-may-2011/


CS245 - 2012              Introduction to MapReduce               30

More Related Content

Similar to MapReduce

Intro to Map Reduce
Intro to Map ReduceIntro to Map Reduce
Intro to Map Reduce
Doron Vainrub
 
Graph analysis platform comparison, pregel/goldenorb/giraph
Graph analysis platform comparison, pregel/goldenorb/giraphGraph analysis platform comparison, pregel/goldenorb/giraph
Graph analysis platform comparison, pregel/goldenorb/giraph
Andrew Yongjoon Kong
 
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentationNoha Elprince
 
Cloud computing_processing frameworks
Cloud computing_processing frameworksCloud computing_processing frameworks
Cloud computing_processing frameworks
Reem Abdel-Rahman
 
MapReduce: teoria e prática
MapReduce: teoria e práticaMapReduce: teoria e prática
MapReduce: teoria e prática
PET Computação
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
Geoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel ProcessingGeoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel Processing
Geoff Rothman
 
Map reduce
Map reduceMap reduce
Map reduce
Somesh Maliye
 
Mapreduce introduction
Mapreduce introductionMapreduce introduction
Mapreduce introduction
Yogender Singh
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduce
David Gleich
 
Hanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduceHanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduce
Hanborq Inc.
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time Together
MapR Technologies
 
Hanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221aHanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221a
Schubert Zhang
 
MapReduce
MapReduceMapReduce
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
How to build TiDB
How to build TiDBHow to build TiDB
How to build TiDB
PingCAP
 
Zh Tw Introduction To Map Reduce
Zh Tw Introduction To Map ReduceZh Tw Introduction To Map Reduce
Zh Tw Introduction To Map Reduce
kevin liao
 

Similar to MapReduce (20)

Intro to Map Reduce
Intro to Map ReduceIntro to Map Reduce
Intro to Map Reduce
 
Graph analysis platform comparison, pregel/goldenorb/giraph
Graph analysis platform comparison, pregel/goldenorb/giraphGraph analysis platform comparison, pregel/goldenorb/giraph
Graph analysis platform comparison, pregel/goldenorb/giraph
 
02 Map Reduce
02 Map Reduce02 Map Reduce
02 Map Reduce
 
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentation
 
Cloud computing_processing frameworks
Cloud computing_processing frameworksCloud computing_processing frameworks
Cloud computing_processing frameworks
 
MapReduce: teoria e prática
MapReduce: teoria e práticaMapReduce: teoria e prática
MapReduce: teoria e prática
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Geoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel ProcessingGeoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel Processing
 
Map reduce
Map reduceMap reduce
Map reduce
 
Mapreduce introduction
Mapreduce introductionMapreduce introduction
Mapreduce introduction
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduce
 
48a tuning
48a tuning48a tuning
48a tuning
 
Hanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduceHanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduce
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time Together
 
Hanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221aHanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221a
 
MapReduce
MapReduceMapReduce
MapReduce
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
 
How to build TiDB
How to build TiDBHow to build TiDB
How to build TiDB
 
Zh Tw Introduction To Map Reduce
Zh Tw Introduction To Map ReduceZh Tw Introduction To Map Reduce
Zh Tw Introduction To Map Reduce
 

More from Zuhair khayyat

Scaling Big Data Cleansing
Scaling Big Data CleansingScaling Big Data Cleansing
Scaling Big Data Cleansing
Zuhair khayyat
 
BigDansing presentation slides for KAUST
BigDansing presentation slides for KAUSTBigDansing presentation slides for KAUST
BigDansing presentation slides for KAUST
Zuhair khayyat
 
IEJoin and Big Data Cleansing
IEJoin and Big Data CleansingIEJoin and Big Data Cleansing
IEJoin and Big Data Cleansing
Zuhair khayyat
 
BigDansing presentation slides for SIGMOD 2015
BigDansing presentation slides for SIGMOD 2015BigDansing presentation slides for SIGMOD 2015
BigDansing presentation slides for SIGMOD 2015
Zuhair khayyat
 
Presentation on "Mizan: A System for Dynamic Load Balancing in Large-scale Gr...
Presentation on "Mizan: A System for Dynamic Load Balancing in Large-scale Gr...Presentation on "Mizan: A System for Dynamic Load Balancing in Large-scale Gr...
Presentation on "Mizan: A System for Dynamic Load Balancing in Large-scale Gr...
Zuhair khayyat
 
Large Graph Processing
Large Graph ProcessingLarge Graph Processing
Large Graph Processing
Zuhair khayyat
 
Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
Mizan: A System for Dynamic Load Balancing in Large-scale Graph ProcessingMizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
Zuhair khayyat
 
Graphlab under the hood
Graphlab under the hoodGraphlab under the hood
Graphlab under the hoodZuhair khayyat
 

More from Zuhair khayyat (11)

Scaling Big Data Cleansing
Scaling Big Data CleansingScaling Big Data Cleansing
Scaling Big Data Cleansing
 
BigDansing presentation slides for KAUST
BigDansing presentation slides for KAUSTBigDansing presentation slides for KAUST
BigDansing presentation slides for KAUST
 
IEJoin and Big Data Cleansing
IEJoin and Big Data CleansingIEJoin and Big Data Cleansing
IEJoin and Big Data Cleansing
 
BigDansing presentation slides for SIGMOD 2015
BigDansing presentation slides for SIGMOD 2015BigDansing presentation slides for SIGMOD 2015
BigDansing presentation slides for SIGMOD 2015
 
Presentation on "Mizan: A System for Dynamic Load Balancing in Large-scale Gr...
Presentation on "Mizan: A System for Dynamic Load Balancing in Large-scale Gr...Presentation on "Mizan: A System for Dynamic Load Balancing in Large-scale Gr...
Presentation on "Mizan: A System for Dynamic Load Balancing in Large-scale Gr...
 
Large Graph Processing
Large Graph ProcessingLarge Graph Processing
Large Graph Processing
 
Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
Mizan: A System for Dynamic Load Balancing in Large-scale Graph ProcessingMizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
 
Google appengine
Google appengineGoogle appengine
Google appengine
 
Kineograph
KineographKineograph
Kineograph
 
Graphlab under the hood
Graphlab under the hoodGraphlab under the hood
Graphlab under the hood
 
Dynamo db
Dynamo dbDynamo db
Dynamo db
 

Recently uploaded

Digital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments UnitDigital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments Unit
chanes7
 
MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...
MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...
MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...
NelTorrente
 
MERN Stack Developer Roadmap By ScholarHat PDF
MERN Stack Developer Roadmap By ScholarHat PDFMERN Stack Developer Roadmap By ScholarHat PDF
MERN Stack Developer Roadmap By ScholarHat PDF
scholarhattraining
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
Dr. Shivangi Singh Parihar
 
Reflective and Evaluative Practice...pdf
Reflective and Evaluative Practice...pdfReflective and Evaluative Practice...pdf
Reflective and Evaluative Practice...pdf
amberjdewit93
 
Assignment_4_ArianaBusciglio Marvel(1).docx
Assignment_4_ArianaBusciglio Marvel(1).docxAssignment_4_ArianaBusciglio Marvel(1).docx
Assignment_4_ArianaBusciglio Marvel(1).docx
ArianaBusciglio
 
kitab khulasah nurul yaqin jilid 1 - 2.pptx
kitab khulasah nurul yaqin jilid 1 - 2.pptxkitab khulasah nurul yaqin jilid 1 - 2.pptx
kitab khulasah nurul yaqin jilid 1 - 2.pptx
datarid22
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
Jean Carlos Nunes Paixão
 
World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024
ak6969907
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
Priyankaranawat4
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
David Douglas School District
 
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
Celine George
 
DRUGS AND ITS classification slide share
DRUGS AND ITS classification slide shareDRUGS AND ITS classification slide share
DRUGS AND ITS classification slide share
taiba qazi
 
Advantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO PerspectiveAdvantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO Perspective
Krisztián Száraz
 
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
RitikBhardwaj56
 
What is the purpose of studying mathematics.pptx
What is the purpose of studying mathematics.pptxWhat is the purpose of studying mathematics.pptx
What is the purpose of studying mathematics.pptx
christianmathematics
 
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
IreneSebastianRueco1
 
Top five deadliest dog breeds in America
Top five deadliest dog breeds in AmericaTop five deadliest dog breeds in America
Top five deadliest dog breeds in America
Bisnar Chase Personal Injury Attorneys
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
Scholarhat
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 

Recently uploaded (20)

Digital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments UnitDigital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments Unit
 
MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...
MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...
MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...
 
MERN Stack Developer Roadmap By ScholarHat PDF
MERN Stack Developer Roadmap By ScholarHat PDFMERN Stack Developer Roadmap By ScholarHat PDF
MERN Stack Developer Roadmap By ScholarHat PDF
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
 
Reflective and Evaluative Practice...pdf
Reflective and Evaluative Practice...pdfReflective and Evaluative Practice...pdf
Reflective and Evaluative Practice...pdf
 
Assignment_4_ArianaBusciglio Marvel(1).docx
Assignment_4_ArianaBusciglio Marvel(1).docxAssignment_4_ArianaBusciglio Marvel(1).docx
Assignment_4_ArianaBusciglio Marvel(1).docx
 
kitab khulasah nurul yaqin jilid 1 - 2.pptx
kitab khulasah nurul yaqin jilid 1 - 2.pptxkitab khulasah nurul yaqin jilid 1 - 2.pptx
kitab khulasah nurul yaqin jilid 1 - 2.pptx
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
 
World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
 
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
 
DRUGS AND ITS classification slide share
DRUGS AND ITS classification slide shareDRUGS AND ITS classification slide share
DRUGS AND ITS classification slide share
 
Advantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO PerspectiveAdvantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO Perspective
 
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
 
What is the purpose of studying mathematics.pptx
What is the purpose of studying mathematics.pptxWhat is the purpose of studying mathematics.pptx
What is the purpose of studying mathematics.pptx
 
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
 
Top five deadliest dog breeds in America
Top five deadliest dog breeds in AmericaTop five deadliest dog breeds in America
Top five deadliest dog breeds in America
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
 

MapReduce

  • 1. Introduction to MapReduce Zuhair Khayyat 3/11/2012
  • 2. What is MapReduce ● A programming model introduced by Google in OSDI '04 for processing large datasets efficiently. ● Features: – Automatic parallelization, no parallel experience required. – Data and process redundancy for failure recovery. – Auto scheduling and Load balancing. – Easy to program, based on two simple functions: ● Map ● Reduce. CS245 - 2012 Introduction to MapReduce 2
  • 3. Why MapReduce? ● For a cluster of: – 2000 machines. – Total 16 TB Ram (≈ 8 GB each). – Total 2 PB Disk space (≈ 1 TB each). ● Use the maximum capacity of the cluster to: – Implement a parallel word count for input size 100 TB. CS245 - 2012 Introduction to MapReduce 3
  • 4. Why MapReduce? ● For a cluster of: – 2000 machines. – Total 16 TB Ram (≈ 8 GB each). – Total 2 PB Disk space (≈ 1 TB each). ● Use the maximum capacity of the cluster to: – Implement a parallel word count for input size 100 TB. – Implement a parallel sort for the same input file. ● Can you use the same code for both applications? CS245 - 2012 Introduction to MapReduce 4
  • 5. How Fast is MapReduce (Hadoop) ● Sort Benchmark competition (http://sortbenchmark.org/): – 2009: 100 TB in 173 minutes using 3452 nodes: ● 2 x Quad core Xeons @ 2.5 GHz. ● 8 GB RAM. – 2008: 1TB in 3.48 minutes using 910 nodes: ● 4 x Dual core Xeons @ 2.0 GHz. ● 8 GB RAM. CS245 - 2012 Introduction to MapReduce 5
  • 6. Who uses MapReduce? CS245 - 2012 Introduction to MapReduce 6
  • 7. Map & Reduce functions ● The Mapper (Pick a key): – Input: Read input from disk. – Output: Create pairs of <key, value>, known as intermediate pairs. – More input partitions == More parallel Mappers. ● The Reducer (Process values): – Input: a list of <key,value> pairs with a unique key. – Output: Single or multiple of <key, values> – More unique keys == More Parallel Reducers. CS245 - 2012 Introduction to MapReduce 7
  • 8. How MapReduce Work 1) Partition input file into M partitions. 2) Create M Map tasks, read M partitions in parallel and emits intermediate <key, value> pairs. Store them into local storage. 3) Wait for all Map workers to finish, sort and partition intermediate <key, value> pairs into R regions. 4) Start R reduce workers, each reads a list of intermediate with a unique key from remote disks. 5) Write the output of reduce workers to file(s). CS245 - 2012 Introduction to MapReduce 8
  • 9. Example – Word count ● Assume an input as following: cat flower picture snow cat cat prince flower sun king queen AC CS245 - 2012 Introduction to MapReduce 9
  • 10. Example – Word count ● Step1: Partition input file into M partitions. cat flower picture cat flower picture snow cat cat prince flower sun snow cat cat king queen AC prince flower sun king queen AC CS245 - 2012 Introduction to MapReduce 10
  • 11. Example – Word count ● Step2: Create M Map tasks, read M partitions in parallel and emits intermediate <key, value> pairs. Store them into local storage. cat flower picture Mapper 1 <cat,1> <flower,1> <picture,1> snow cat cat Mapper 2 <snow,1> <cat,1> <cat,1> prince flower sun Mapper 3 <prince,1> <flower,1> <sun,1> king queen AC CS245 - 2012 Introduction to 4 Mapper MapReduce <king,1> <queen,1> <AC,1> 11
  • 12. Example – Word count ● Step3: Wait for all Map workers to finish, sort and partition intermediate <key, value> pairs into R regions. <cat,1> <AC,1> <cat,1> <flower,1> <picture,1> <flower,1> <cat,1> <picture,1> <cat,1> <cat,1> <cat,1> <snow,1> <cat,1> <cat,1> <cat,1> <flower,1> <snow,1> <flower,1> <flower,1> <king,1> <prince,1> <flower,1> <sun,1> <prince,1> <picture,1> <sun,1> <prince,1> <queen,1> <AC,1> <snow,1> CS245 - 2012 <king,1> <king,1> <queen,1> <AC,1> Introduction to MapReduce <sun,1> 12 <queen,1>
  • 13. Example – Word count ● Step4: Start R reduce workers, each reads a list of intermediate with a unique key from remote disks. <AC,1> Reducer 1 <AC,1> <cat,1> <cat,1> <cat,1> Reducer 2 <cat,3> <flower,1> <flower,1> Reducer 3 <flower,2> <king,1> <picture,1> <prince,1> <queen,1> <snow,1> CS245 - 2012 <sun,1> Reducer 9 Introduction to MapReduce <sun,1> 13
  • 14. Example – Word count ● Step5: Write the output of reduce workers to file(s). <AC,1> <AC,1> <cat,3> <cat,3> <flower,2> <flower,2> <king,1> <king,1> <picture,1> <prince,1> <picture,1> <queen,1> <snow,1> <sun,1> <sun,1> CS245 - 2012 Introduction to MapReduce 14
  • 15. MapReduce framework CS245 - 2012 Introduction to MapReduce 15
  • 16. MapReduce Failure Recovery ● The framework works as master worker paradigm. ● The master keeps records of the work done on each worker. ● If a worker fails, the master assigns the same work to another worker. ● If a worker is late, another copy of the same work is assigned to another worker. ● If the master fails, another backup copy of the master can pick up and continue execution from the last check points. CS245 - 2012 Introduction to MapReduce 16
  • 17. Advantages of MapReduce ● Parallel IO: hides disk latency. ● Parallel Processing: – Map functions works independently in parallel, each process one unique partition. – Reduce functions work independently in parallel, each on a unique intermediate key. ● Using large clusters of commodity machines gives better results than small expensive clusters. CS245 - 2012 Introduction to MapReduce 17
  • 18. Advantages of MapReduce ● Parallel IO: hides disk latency. ● Parallel Processing: – Map functions works independently in parallel, each process one unique partition. – Reduce functions work independently in parallel, each on a unique intermediate key. ● Using large clusters of commodity machines gives comparable results than small expensive clusters. CS245 - 2012 Introduction to MapReduce 18
  • 19. Hadoop vs. others ● Algorithm: Sorting 100 TB data. Hadoop DEMSort TritonSort Nodes Count 3452 195 47 Processor 2x Quad-core 2x Quad-core 2x Quad-core Xeons @ 2.5 GHz Xeons @ 2.6 GHz Xeons @ 2.27 GHz Memory 8 GB 16 GB 24 GB Network 1 Gigabit Ethernet InfiniBand 10 Gigabit Fiber Throughput 0.578 TB/Min 0.564 TB/Min 0.582 TB/Min CS245 - 2012 Introduction to MapReduce 19
  • 20. MapReduce weak points ● Overhead of MapReduce is huge. ● Data dependent applications may need multiple iterations of MapReduce, for example: – K-means. – PageRank. ● Complex algorithms can be very hard to implement. – Range Queries. ● Sensitive to <key,value> pairs' skewed distribution CS245 - 2012 Introduction to MapReduce 20
  • 21. Implementations of MapReduce ● Hadoop in Java. ● Mars in C++ & CUDA. ● Skynet in Ruby. ● Phoenix in C++ ● Microsoft Dryad: – Schedule multiple levels of “MapReduce” like operations.. CS245 - 2012 Introduction to MapReduce 21
  • 22. MapReduce in Database CS245 - 2012 Introduction to MapReduce 22
  • 23. MapReduce in Database - Ex1 ● Select Name from Students where age = 23; Students: Name ID Age Ahmed 1177 23 Bob 1131 20 Sara 1197 22 CS245 - 2012 Introduction to MapReduce 23
  • 24. MapReduce in Database - Ex2 ● Select COUNT(Name) from Students where age > 20 group by Name; Students: Name ID Age Ahmed 1177 23 Bob 1131 20 Sara 1197 22 CS245 - 2012 Introduction to MapReduce 24
  • 25. MapReduce in Database - Ex3 ● Select Name, Term from Students, Enrolment where ID = SID and age != 20; Students: Enrolment: Name ID Age CID SID Term Ahmed 1177 23 CS290 1177 042 Bob 1131 20 CS260 1177 052 Sara 1197 22 ME222 1131 051 AMCS220 1197 051 CS245 - 2012 Introduction to MapReduce 25
  • 26. MapReduce in Database - Ex4 ● Select Name, Term from Students, Enrolment where ID != SID; Students: Enrolment: Name ID Age CID SID Term Ahmed 1177 23 CS290 1177 042 Bob 1131 20 CS260 1177 052 Sara 1197 22 ME222 1131 051 AMCS220 1197 051 ● What if the condition ID > SID? CS245 - 2012 Introduction to MapReduce 26
  • 27. MapReduce in Database - Ex5 ● Select Name, Term from Students, Enrolment where ID = SID and Admission != Term; Students: Students: Enrolment: Enrolment: Name ID Age Admission CID SID Term Ahmed 1177 23 042 CS290 1177 042 Bob 1131 20 051 CS260 1177 052 Sara 1197 22 042 ME222 1131 051 AMCS220 1197 051 CS245 - 2012 Introduction to MapReduce 27
  • 28. MapReduce in Database - Ex6 ● Select y from R, S, T where R.x = S.x and T.a = S.a; R: S: x y z a b x T: m n a CS245 - 2012 Introduction to MapReduce 28
  • 29. MapReduce in Academic Papers ● NIPS '07: Map-Reduce for Machine Learning on Multicore. ● Escience '08: CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications. ● KDD '09: Large-scale behavioral targeting. ● GCC '09: Spatial Queries Evaluation with MapReduce. ● SIGIR '09: On single-pass indexing with MapReduce. ● MDAC '10: A novel approach to multiple sequence alignment using hadoop data grids. ● VLDB Endowment '11: Social Content Matching in MapReduce. ● VLDB '12: Building Wavelet Histograms on Large Data in MapReduce. CS245 - 2012 Introduction to MapReduce 29
  • 30. Links ● http://code.google.com/edu/parallel/mapreduce-tutorial.html ● http://hadoop.apache.org/mapreduce/ ● http://www.cse.ust.hk/gpuqp/Mars.html ● http://skynet.rubyforge.org/ ● http://mapreduce.stanford.edu/ ● http://wiki.apache.org/hadoop/PoweredBy ● http://atbrox.com/2011/05/16/mapreduce-hadoop-algorithms-in- academic-papers-4th-update-may-2011/ CS245 - 2012 Introduction to MapReduce 30