SlideShare a Scribd company logo
1 of 34
Download to read offline
An Introduction to MapReduce
             Presented by Frane Bandov
    at the Operating Complex IT-Systems seminar
                  Berlin, 1/26/2010
Outline
•  Introduction
•  Google MapReduce
    –  Idea
    –  Overview
    –  Fault Tolerance
    –  GFS: Google File System
    –  Job Example
•  Alternative Implementations
•  Reception and Criticism
•  Trends and Future Development
•  Conclusion
2/16/10             An Introduction to MapReduce   2
Outline
•  Introduction
•  Google MapReduce
    –  Idea
    –  Overview
    –  Fault Tolerance
    –  GFS: Google File System
    –  Job Example
•  Alternative Implementations
•  Reception and Criticism
•  Trends and Future Development
•  Conclusion
2/16/10             An Introduction to MapReduce   3
Introduction – Problem
Sometimes we have to deal with huge amounts
                 of data
TBytes
250

200

 150

100

 50

  0
            You   Facebook              Yahoo! Groups    German Climate
                                                        Computing Centre

  2/16/10          An Introduction to MapReduce                       4
Introduction – Problem
    The data needs to be processed, but how?


     Can‘t process all of this data on one machine
     Distribute the processing to many machines




2/16/10             An Introduction to MapReduce     5
Introduction – Approach
           Distributed computing is the solution
           “Let’s write our own distributed computing
              software as a solution to our problem”
         Checklist
 design protocols             evelopment takes a long time
                              D
 design data structures
 write the code              Expensive: Cost-benefit ratio?
 assure failure tolerance



   Build complex software for simple computations?

 2/16/10                     An Introduction to MapReduce   6
Outline
•  Introduction
•  Google MapReduce
    –  Idea
    –  Overview
    –  Fault Tolerance
    –  GFS: Google File System
    –  Job Example
•  Alternative Implementations
•  Reception and Criticism
•  Trends and Future Development
•  Conclusion
2/16/10             An Introduction to MapReduce   7
Google MapReduce – Idea
      A framework for distributed computing

  Don‘t care about protocols, failure tolerance, etc.

           Just write your simple computation




2/16/10              An Introduction to MapReduce       8
Google MapReduce – Idea
              MapReduce Paradigm
Map:                                  Reduce:
 Apply function to all                  Combine all elements
 elements of a list                     of a list


square x = x * x;                     reduce (+)[1, 2, 3, 4, 5];
map square [1, 2, 3, 4, 5];
 [1, 4, 9, 16, 25]                    15




2/16/10               An Introduction to MapReduce                 9
Google MapReduce – Idea
               Basic functioning



      Input     Map                     Reduce   Output




2/16/10           An Introduction to MapReduce            10
Google MapReduce – Overview
                       MapReduce-Based User Program

 GFS                                                              GFS

 Split 1                              Master


 Split 2                      Intermediate
              Worker                                     Worker   File 1
                                  File 1

 Split 3
                              Intermediate
              Worker
                                  File 2                 Worker   File 2
 Split 4

                              Intermediate
 Split 5      Worker
                                  File 3
                                                         Reduce   Output
Input file   Map Phase                                   Phase     files
2/16/10                   An Introduction to MapReduce               11
MapReduce – Fault Tolerance
•  Workers are periodically pinged by master
•  No answer over certain time  worker failed

Mapper fails:
     –  Reset map job as idle
     –  Even if job was completed  intermediate files are
        inaccessible
     –  Notify reducers where to get the new intermediate file
Reducer fails:
     –  Reset its job as idle
2/16/10                   An Introduction to MapReduce       12
MapReduce – Fault Tolerance
Master fails:
     –  Periodically sets checkpoints
     –  In case of failure MapReduce-Operation is aborted
     –  Operation can be restarted from last checkpoint




2/16/10                An Introduction to MapReduce         13
Google MapReduce – GFS
               Google File System
•  In-house distributed file system at Google
•  Stores all input an output files
•  Stores files…
     – divided into 64 MB blocks
     – on at least 3 different machines
•  Machines running GFS also
   run MapReduce
2/16/10              An Introduction to MapReduce   14
Google MapReduce – Job Example




2/16/10    An Introduction to MapReduce   15
Google MapReduce – Job Example




2/16/10    An Introduction to MapReduce   16
Google MapReduce – Job Example




2/16/10    An Introduction to MapReduce   17
Google MapReduce – Job Example




2/16/10    An Introduction to MapReduce   18
Outline
•  Introduction
•  Google MapReduce
    –  Idea
    –  Overview
    –  Fault Tolerance
    –  GFS: Google File System
    –  Job Example
•  Alternative Implementations
•  Reception and Criticism
•  Trends and Future Development
•  Conclusion
2/16/10             An Introduction to MapReduce   19
Alternative Implementations
Apache Hadoop

•    Open-Source-Implementation in Java
•    Jobs can be written in C++, Java, Python, etc.
•    Used by Yahoo!, Facebook, Amazon and others
•    Most commonly used implementation
•    HDFS as open-source-implementation of GFS
•    Can also use Amazon S3, HTTP(S) or FTP
•    Extensions: Hive, Pig, HBase
2/16/10              An Introduction to MapReduce     20
Alternative Implementations
                              Mars
          MapReduce-Implementation for nVidia GPU
                using the CUDA framework

                    MapReduce-Cell
            Implementation for the Cell multi-core
                         processor

                             Qizmt
     MySpace’s implementation of MapReduce in C#

2/16/10                An Introduction to MapReduce   21
Alternative Implementations


     There are many other open- and closed-
     source implementations of MapReduce!




2/16/10           An Introduction to MapReduce   22
Outline
•  Introduction
•  Google MapReduce
    –  Idea
    –  Overview
    –  Fault Tolerance
    –  GFS: Google File System
    –  Job Example
•  Alternative Implementations
•  Reception and Criticism
•  Trends and Future Development
•  Conclusion
2/16/10             An Introduction to MapReduce   23
Reception and Criticism
•  Yahoo!: Hadoop on a 10,000 server cluster
•  Facebook analyses the daily log (25TB) on
   a 1,000 server cluster
•  Amazon Elastic MapReduce: Hadoop
   clusters for rent on EC2 and S3
•  IBM and Google: Support university
   courses in distributed programming
•  UC Berkley announced to teach freashmen
   programming MapReduce
2/16/10          An Introduction to MapReduce   24
Reception and Criticism




2/16/10          An Introduction to MapReduce   25
Reception and Criticism
•  Criticism mainly by RDBMS experts
   DeWitt and Stonebraker
•  MapReduce
     – is a step backwards in database access
     – is a poor implementation
     – is not novel
     – is missing features that are routinely provided
       by modern DBMSs
     – is incompatible with the DBMS tools
2/16/10              An Introduction to MapReduce    26
Reception and Criticism
               Response to criticism

              MapReduce is no RDBMS

   It suits well for processing and structuring huge
              amounts of unstructured data

      MapReduce's big inovation is that it enables
     distributing data processing across a network of
         cheap and possibly unreliable computers
2/16/10              An Introduction to MapReduce      27
Outline
•  Introduction
•  Google MapReduce
    –  Idea
    –  Overview
    –  Fault Tolerance
    –  GFS: Google File System
    –  Job Example
•  Alternative Implementations
•  Reception and Criticism
•  Trends and Future Development
•  Conclusion
2/16/10             An Introduction to MapReduce   28
Trends and Future Development
   Trend of utilizing MapReduce/Hadoop as
                 parallel database

•  Hive: Query language for Hadoop
•  HBase: Column-oriented distributed database
   (modeled after Google’s BigTable)
•  Map-Reduce-Merge: Adding merge to the
   paradigm allows implementing features of
   relational algebra
2/16/10           An Introduction to MapReduce   29
Trends and Future Development
   Trend to use the MapReduce-paradigm to
         better utilize multi-core CPUs

•  Qt Concurrent
     –  Simplified C++ version of MapReduce for distributing
        tasks between multiple processor cores
•  Mars
•  MapReduce-Cell


2/16/10                An Introduction to MapReduce        30
Outline
•  Introduction
•  Google MapReduce
    –  Idea
    –  Overview
    –  Fault Tolerance
    –  GFS: Google File System
    –  Job Example
•  Alternative Implementations
•  Reception and Criticism
•  Trends and Future Development
•  Conclusion
2/16/10             An Introduction to MapReduce   31
Conclusion
                        MapReduce

     provides an easy solution for the processing of
                  large amounts of data

          brings a paradigm shift in programming

                      changed the world,
          i.e. made data processing more efficient and
            cheaper, is the foundation of many other
                   approaches and solutions
2/16/10                 An Introduction to MapReduce     32
Questions?




2/16/10    An Introduction to MapReduce   33
Thank You!




2/16/10    An Introduction to MapReduce   34

More Related Content

What's hot

MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka Edureka!
 
Big data unit iv and v lecture notes qb model exam
Big data unit iv and v lecture notes   qb model examBig data unit iv and v lecture notes   qb model exam
Big data unit iv and v lecture notes qb model examIndhujeni
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performanceDataWorks Summit
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Apache Hadoop YARN
Apache Hadoop YARNApache Hadoop YARN
Apache Hadoop YARNAdam Kawa
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Лекция 2. Основы Hadoop
Лекция 2. Основы HadoopЛекция 2. Основы Hadoop
Лекция 2. Основы HadoopTechnopark
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
Testing Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnitTesting Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnitEric Wendelin
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFSEdureka!
 
Integrating Big Data Technologies
Integrating Big Data TechnologiesIntegrating Big Data Technologies
Integrating Big Data TechnologiesDATAVERSITY
 
Learn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node ClusterLearn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node ClusterEdureka!
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless DatabasesDan Gunter
 

What's hot (20)

MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Big data unit iv and v lecture notes qb model exam
Big data unit iv and v lecture notes   qb model examBig data unit iv and v lecture notes   qb model exam
Big data unit iv and v lecture notes qb model exam
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performance
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Apache Hadoop YARN
Apache Hadoop YARNApache Hadoop YARN
Apache Hadoop YARN
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Лекция 2. Основы Hadoop
Лекция 2. Основы HadoopЛекция 2. Основы Hadoop
Лекция 2. Основы Hadoop
 
Hadoop
Hadoop Hadoop
Hadoop
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Testing Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnitTesting Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnit
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Integrating Big Data Technologies
Integrating Big Data TechnologiesIntegrating Big Data Technologies
Integrating Big Data Technologies
 
Learn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node ClusterLearn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node Cluster
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 

Similar to An Introduction to MapReduce

Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
Map Reduce Workloads: A Dynamic Job Ordering and Slot ConfigurationsMap Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurationsdbpublications
 
Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)Ankit Gupta
 
MapReduce Programming Model
MapReduce Programming ModelMapReduce Programming Model
MapReduce Programming ModelAdarshaDhakal
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
An Enhanced MapReduce Model (on BSP)
An Enhanced MapReduce Model (on BSP)An Enhanced MapReduce Model (on BSP)
An Enhanced MapReduce Model (on BSP)Yu Liu
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseLukas Vlcek
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415SANTOSH WAYAL
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringBADR
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniquesijsrd.com
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programsjani shaik
 
Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
Hybrid Map Task Scheduling for GPU-based Heterogeneous ClustersHybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
Hybrid Map Task Scheduling for GPU-based Heterogeneous ClustersKoichi Shirahata
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopVictoria López
 

Similar to An Introduction to MapReduce (20)

Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Mapreduce Hadop.pptx
Mapreduce Hadop.pptxMapreduce Hadop.pptx
Mapreduce Hadop.pptx
 
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
Map Reduce Workloads: A Dynamic Job Ordering and Slot ConfigurationsMap Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
 
Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)
 
MapReduce Programming Model
MapReduce Programming ModelMapReduce Programming Model
MapReduce Programming Model
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
E031201032036
E031201032036E031201032036
E031201032036
 
An Enhanced MapReduce Model (on BSP)
An Enhanced MapReduce Model (on BSP)An Enhanced MapReduce Model (on BSP)
An Enhanced MapReduce Model (on BSP)
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBase
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniques
 
Big Data Technology
Big Data TechnologyBig Data Technology
Big Data Technology
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programs
 
Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
Hybrid Map Task Scheduling for GPU-based Heterogeneous ClustersHybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 

Recently uploaded (20)

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

An Introduction to MapReduce

  • 1. An Introduction to MapReduce Presented by Frane Bandov at the Operating Complex IT-Systems seminar Berlin, 1/26/2010
  • 2. Outline •  Introduction •  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example •  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 An Introduction to MapReduce 2
  • 3. Outline •  Introduction •  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example •  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 An Introduction to MapReduce 3
  • 4. Introduction – Problem Sometimes we have to deal with huge amounts of data TBytes 250 200 150 100 50 0 You Facebook Yahoo! Groups German Climate Computing Centre 2/16/10 An Introduction to MapReduce 4
  • 5. Introduction – Problem The data needs to be processed, but how? Can‘t process all of this data on one machine  Distribute the processing to many machines 2/16/10 An Introduction to MapReduce 5
  • 6. Introduction – Approach Distributed computing is the solution “Let’s write our own distributed computing software as a solution to our problem” Checklist  design protocols   evelopment takes a long time D  design data structures  write the code  Expensive: Cost-benefit ratio?  assure failure tolerance Build complex software for simple computations? 2/16/10 An Introduction to MapReduce 6
  • 7. Outline •  Introduction •  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example •  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 An Introduction to MapReduce 7
  • 8. Google MapReduce – Idea A framework for distributed computing Don‘t care about protocols, failure tolerance, etc. Just write your simple computation 2/16/10 An Introduction to MapReduce 8
  • 9. Google MapReduce – Idea MapReduce Paradigm Map: Reduce: Apply function to all Combine all elements elements of a list of a list square x = x * x; reduce (+)[1, 2, 3, 4, 5]; map square [1, 2, 3, 4, 5];  [1, 4, 9, 16, 25]  15 2/16/10 An Introduction to MapReduce 9
  • 10. Google MapReduce – Idea Basic functioning Input Map Reduce Output 2/16/10 An Introduction to MapReduce 10
  • 11. Google MapReduce – Overview MapReduce-Based User Program GFS GFS Split 1 Master Split 2 Intermediate Worker Worker File 1 File 1 Split 3 Intermediate Worker File 2 Worker File 2 Split 4 Intermediate Split 5 Worker File 3 Reduce Output Input file Map Phase Phase files 2/16/10 An Introduction to MapReduce 11
  • 12. MapReduce – Fault Tolerance •  Workers are periodically pinged by master •  No answer over certain time  worker failed Mapper fails: –  Reset map job as idle –  Even if job was completed  intermediate files are inaccessible –  Notify reducers where to get the new intermediate file Reducer fails: –  Reset its job as idle 2/16/10 An Introduction to MapReduce 12
  • 13. MapReduce – Fault Tolerance Master fails: –  Periodically sets checkpoints –  In case of failure MapReduce-Operation is aborted –  Operation can be restarted from last checkpoint 2/16/10 An Introduction to MapReduce 13
  • 14. Google MapReduce – GFS Google File System •  In-house distributed file system at Google •  Stores all input an output files •  Stores files… – divided into 64 MB blocks – on at least 3 different machines •  Machines running GFS also run MapReduce 2/16/10 An Introduction to MapReduce 14
  • 15. Google MapReduce – Job Example 2/16/10 An Introduction to MapReduce 15
  • 16. Google MapReduce – Job Example 2/16/10 An Introduction to MapReduce 16
  • 17. Google MapReduce – Job Example 2/16/10 An Introduction to MapReduce 17
  • 18. Google MapReduce – Job Example 2/16/10 An Introduction to MapReduce 18
  • 19. Outline •  Introduction •  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example •  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 An Introduction to MapReduce 19
  • 20. Alternative Implementations Apache Hadoop •  Open-Source-Implementation in Java •  Jobs can be written in C++, Java, Python, etc. •  Used by Yahoo!, Facebook, Amazon and others •  Most commonly used implementation •  HDFS as open-source-implementation of GFS •  Can also use Amazon S3, HTTP(S) or FTP •  Extensions: Hive, Pig, HBase 2/16/10 An Introduction to MapReduce 20
  • 21. Alternative Implementations Mars MapReduce-Implementation for nVidia GPU using the CUDA framework MapReduce-Cell Implementation for the Cell multi-core processor Qizmt MySpace’s implementation of MapReduce in C# 2/16/10 An Introduction to MapReduce 21
  • 22. Alternative Implementations There are many other open- and closed- source implementations of MapReduce! 2/16/10 An Introduction to MapReduce 22
  • 23. Outline •  Introduction •  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example •  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 An Introduction to MapReduce 23
  • 24. Reception and Criticism •  Yahoo!: Hadoop on a 10,000 server cluster •  Facebook analyses the daily log (25TB) on a 1,000 server cluster •  Amazon Elastic MapReduce: Hadoop clusters for rent on EC2 and S3 •  IBM and Google: Support university courses in distributed programming •  UC Berkley announced to teach freashmen programming MapReduce 2/16/10 An Introduction to MapReduce 24
  • 25. Reception and Criticism 2/16/10 An Introduction to MapReduce 25
  • 26. Reception and Criticism •  Criticism mainly by RDBMS experts DeWitt and Stonebraker •  MapReduce – is a step backwards in database access – is a poor implementation – is not novel – is missing features that are routinely provided by modern DBMSs – is incompatible with the DBMS tools 2/16/10 An Introduction to MapReduce 26
  • 27. Reception and Criticism Response to criticism MapReduce is no RDBMS It suits well for processing and structuring huge amounts of unstructured data MapReduce's big inovation is that it enables distributing data processing across a network of cheap and possibly unreliable computers 2/16/10 An Introduction to MapReduce 27
  • 28. Outline •  Introduction •  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example •  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 An Introduction to MapReduce 28
  • 29. Trends and Future Development Trend of utilizing MapReduce/Hadoop as parallel database •  Hive: Query language for Hadoop •  HBase: Column-oriented distributed database (modeled after Google’s BigTable) •  Map-Reduce-Merge: Adding merge to the paradigm allows implementing features of relational algebra 2/16/10 An Introduction to MapReduce 29
  • 30. Trends and Future Development Trend to use the MapReduce-paradigm to better utilize multi-core CPUs •  Qt Concurrent –  Simplified C++ version of MapReduce for distributing tasks between multiple processor cores •  Mars •  MapReduce-Cell 2/16/10 An Introduction to MapReduce 30
  • 31. Outline •  Introduction •  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example •  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 An Introduction to MapReduce 31
  • 32. Conclusion MapReduce provides an easy solution for the processing of large amounts of data brings a paradigm shift in programming changed the world, i.e. made data processing more efficient and cheaper, is the foundation of many other approaches and solutions 2/16/10 An Introduction to MapReduce 32
  • 33. Questions? 2/16/10 An Introduction to MapReduce 33
  • 34. Thank You! 2/16/10 An Introduction to MapReduce 34