SlideShare a Scribd company logo
1 of 34
Download to read offline
An Introduction to MapReduce
             Presented by Frane Bandov
    at the Operating Complex IT-Systems seminar
                  Berlin, 1/26/2010
Outline
•  Introduction
•  Google MapReduce
    –  Idea
    –  Overview
    –  Fault Tolerance
    –  GFS: Google File System
    –  Job Example
•  Alternative Implementations
•  Reception and Criticism
•  Trends and Future Development
•  Conclusion
2/16/10             An Introduction to MapReduce   2
Outline
•  Introduction
•  Google MapReduce
    –  Idea
    –  Overview
    –  Fault Tolerance
    –  GFS: Google File System
    –  Job Example
•  Alternative Implementations
•  Reception and Criticism
•  Trends and Future Development
•  Conclusion
2/16/10             An Introduction to MapReduce   3
Introduction – Problem
Sometimes we have to deal with huge amounts
                 of data
TBytes
250

200

 150

100

 50

  0
            You   Facebook              Yahoo! Groups    German Climate
                                                        Computing Centre

  2/16/10          An Introduction to MapReduce                       4
Introduction – Problem
    The data needs to be processed, but how?


     Can‘t process all of this data on one machine
     Distribute the processing to many machines




2/16/10             An Introduction to MapReduce     5
Introduction – Approach
           Distributed computing is the solution
           “Let’s write our own distributed computing
              software as a solution to our problem”
         Checklist
 design protocols             evelopment takes a long time
                              D
 design data structures
 write the code              Expensive: Cost-benefit ratio?
 assure failure tolerance



   Build complex software for simple computations?

 2/16/10                     An Introduction to MapReduce   6
Outline
•  Introduction
•  Google MapReduce
    –  Idea
    –  Overview
    –  Fault Tolerance
    –  GFS: Google File System
    –  Job Example
•  Alternative Implementations
•  Reception and Criticism
•  Trends and Future Development
•  Conclusion
2/16/10             An Introduction to MapReduce   7
Google MapReduce – Idea
      A framework for distributed computing

  Don‘t care about protocols, failure tolerance, etc.

           Just write your simple computation




2/16/10              An Introduction to MapReduce       8
Google MapReduce – Idea
              MapReduce Paradigm
Map:                                  Reduce:
 Apply function to all                  Combine all elements
 elements of a list                     of a list


square x = x * x;                     reduce (+)[1, 2, 3, 4, 5];
map square [1, 2, 3, 4, 5];
 [1, 4, 9, 16, 25]                    15




2/16/10               An Introduction to MapReduce                 9
Google MapReduce – Idea
               Basic functioning



      Input     Map                     Reduce   Output




2/16/10           An Introduction to MapReduce            10
Google MapReduce – Overview
                       MapReduce-Based User Program

 GFS                                                              GFS

 Split 1                              Master


 Split 2                      Intermediate
              Worker                                     Worker   File 1
                                  File 1

 Split 3
                              Intermediate
              Worker
                                  File 2                 Worker   File 2
 Split 4

                              Intermediate
 Split 5      Worker
                                  File 3
                                                         Reduce   Output
Input file   Map Phase                                   Phase     files
2/16/10                   An Introduction to MapReduce               11
MapReduce – Fault Tolerance
•  Workers are periodically pinged by master
•  No answer over certain time  worker failed

Mapper fails:
     –  Reset map job as idle
     –  Even if job was completed  intermediate files are
        inaccessible
     –  Notify reducers where to get the new intermediate file
Reducer fails:
     –  Reset its job as idle
2/16/10                   An Introduction to MapReduce       12
MapReduce – Fault Tolerance
Master fails:
     –  Periodically sets checkpoints
     –  In case of failure MapReduce-Operation is aborted
     –  Operation can be restarted from last checkpoint




2/16/10                An Introduction to MapReduce         13
Google MapReduce – GFS
               Google File System
•  In-house distributed file system at Google
•  Stores all input an output files
•  Stores files…
     – divided into 64 MB blocks
     – on at least 3 different machines
•  Machines running GFS also
   run MapReduce
2/16/10              An Introduction to MapReduce   14
Google MapReduce – Job Example




2/16/10    An Introduction to MapReduce   15
Google MapReduce – Job Example




2/16/10    An Introduction to MapReduce   16
Google MapReduce – Job Example




2/16/10    An Introduction to MapReduce   17
Google MapReduce – Job Example




2/16/10    An Introduction to MapReduce   18
Outline
•  Introduction
•  Google MapReduce
    –  Idea
    –  Overview
    –  Fault Tolerance
    –  GFS: Google File System
    –  Job Example
•  Alternative Implementations
•  Reception and Criticism
•  Trends and Future Development
•  Conclusion
2/16/10             An Introduction to MapReduce   19
Alternative Implementations
Apache Hadoop

•    Open-Source-Implementation in Java
•    Jobs can be written in C++, Java, Python, etc.
•    Used by Yahoo!, Facebook, Amazon and others
•    Most commonly used implementation
•    HDFS as open-source-implementation of GFS
•    Can also use Amazon S3, HTTP(S) or FTP
•    Extensions: Hive, Pig, HBase
2/16/10              An Introduction to MapReduce     20
Alternative Implementations
                              Mars
          MapReduce-Implementation for nVidia GPU
                using the CUDA framework

                    MapReduce-Cell
            Implementation for the Cell multi-core
                         processor

                             Qizmt
     MySpace’s implementation of MapReduce in C#

2/16/10                An Introduction to MapReduce   21
Alternative Implementations


     There are many other open- and closed-
     source implementations of MapReduce!




2/16/10           An Introduction to MapReduce   22
Outline
•  Introduction
•  Google MapReduce
    –  Idea
    –  Overview
    –  Fault Tolerance
    –  GFS: Google File System
    –  Job Example
•  Alternative Implementations
•  Reception and Criticism
•  Trends and Future Development
•  Conclusion
2/16/10             An Introduction to MapReduce   23
Reception and Criticism
•  Yahoo!: Hadoop on a 10,000 server cluster
•  Facebook analyses the daily log (25TB) on
   a 1,000 server cluster
•  Amazon Elastic MapReduce: Hadoop
   clusters for rent on EC2 and S3
•  IBM and Google: Support university
   courses in distributed programming
•  UC Berkley announced to teach freashmen
   programming MapReduce
2/16/10          An Introduction to MapReduce   24
Reception and Criticism




2/16/10          An Introduction to MapReduce   25
Reception and Criticism
•  Criticism mainly by RDBMS experts
   DeWitt and Stonebraker
•  MapReduce
     – is a step backwards in database access
     – is a poor implementation
     – is not novel
     – is missing features that are routinely provided
       by modern DBMSs
     – is incompatible with the DBMS tools
2/16/10              An Introduction to MapReduce    26
Reception and Criticism
               Response to criticism

              MapReduce is no RDBMS

   It suits well for processing and structuring huge
              amounts of unstructured data

      MapReduce's big inovation is that it enables
     distributing data processing across a network of
         cheap and possibly unreliable computers
2/16/10              An Introduction to MapReduce      27
Outline
•  Introduction
•  Google MapReduce
    –  Idea
    –  Overview
    –  Fault Tolerance
    –  GFS: Google File System
    –  Job Example
•  Alternative Implementations
•  Reception and Criticism
•  Trends and Future Development
•  Conclusion
2/16/10             An Introduction to MapReduce   28
Trends and Future Development
   Trend of utilizing MapReduce/Hadoop as
                 parallel database

•  Hive: Query language for Hadoop
•  HBase: Column-oriented distributed database
   (modeled after Google’s BigTable)
•  Map-Reduce-Merge: Adding merge to the
   paradigm allows implementing features of
   relational algebra
2/16/10           An Introduction to MapReduce   29
Trends and Future Development
   Trend to use the MapReduce-paradigm to
         better utilize multi-core CPUs

•  Qt Concurrent
     –  Simplified C++ version of MapReduce for distributing
        tasks between multiple processor cores
•  Mars
•  MapReduce-Cell


2/16/10                An Introduction to MapReduce        30
Outline
•  Introduction
•  Google MapReduce
    –  Idea
    –  Overview
    –  Fault Tolerance
    –  GFS: Google File System
    –  Job Example
•  Alternative Implementations
•  Reception and Criticism
•  Trends and Future Development
•  Conclusion
2/16/10             An Introduction to MapReduce   31
Conclusion
                        MapReduce

     provides an easy solution for the processing of
                  large amounts of data

          brings a paradigm shift in programming

                      changed the world,
          i.e. made data processing more efficient and
            cheaper, is the foundation of many other
                   approaches and solutions
2/16/10                 An Introduction to MapReduce     32
Questions?




2/16/10    An Introduction to MapReduce   33
Thank You!




2/16/10    An Introduction to MapReduce   34

More Related Content

What's hot

Case study on deep learning
Case study on deep learningCase study on deep learning
Case study on deep learning
HarshitBarde
 
モバイルアプリ向けAWSネイティブアーキテクチャ
モバイルアプリ向けAWSネイティブアーキテクチャモバイルアプリ向けAWSネイティブアーキテクチャ
モバイルアプリ向けAWSネイティブアーキテクチャ
Rikitake Oohashi
 

What's hot (20)

アクセスプラン(実行計画)の読み方入門
アクセスプラン(実行計画)の読み方入門アクセスプラン(実行計画)の読み方入門
アクセスプラン(実行計画)の読み方入門
 
ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方
 
차원축소 훑어보기 (PCA, SVD, NMF)
차원축소 훑어보기 (PCA, SVD, NMF)차원축소 훑어보기 (PCA, SVD, NMF)
차원축소 훑어보기 (PCA, SVD, NMF)
 
DRIVE CHARTの裏側 〜 AI ☓ IoT ☓ ビッグデータを 支えるアーキテクチャ 〜
DRIVE CHARTの裏側  〜 AI ☓ IoT ☓ ビッグデータを 支えるアーキテクチャ 〜DRIVE CHARTの裏側  〜 AI ☓ IoT ☓ ビッグデータを 支えるアーキテクチャ 〜
DRIVE CHARTの裏側 〜 AI ☓ IoT ☓ ビッグデータを 支えるアーキテクチャ 〜
 
책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017
책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017
책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017
 
ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)
ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)
ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)
 
object detection with lidar-camera fusion: survey (updated)
object detection with lidar-camera fusion: survey (updated)object detection with lidar-camera fusion: survey (updated)
object detection with lidar-camera fusion: survey (updated)
 
Introduction to Big Data and hadoop
Introduction to Big Data and hadoopIntroduction to Big Data and hadoop
Introduction to Big Data and hadoop
 
Cassandraのしくみ データの読み書き編
Cassandraのしくみ データの読み書き編Cassandraのしくみ データの読み書き編
Cassandraのしくみ データの読み書き編
 
Computer Vision
Computer VisionComputer Vision
Computer Vision
 
分散処理基盤Apache Hadoop入門とHadoopエコシステムの最新技術動向 (オープンソースカンファレンス 2015 Tokyo/Spring 講...
分散処理基盤Apache Hadoop入門とHadoopエコシステムの最新技術動向 (オープンソースカンファレンス 2015 Tokyo/Spring 講...分散処理基盤Apache Hadoop入門とHadoopエコシステムの最新技術動向 (オープンソースカンファレンス 2015 Tokyo/Spring 講...
分散処理基盤Apache Hadoop入門とHadoopエコシステムの最新技術動向 (オープンソースカンファレンス 2015 Tokyo/Spring 講...
 
負荷テストを行う際に知っておきたいこと 初心者編
負荷テストを行う際に知っておきたいこと 初心者編負荷テストを行う際に知っておきたいこと 初心者編
負荷テストを行う際に知っておきたいこと 初心者編
 
Case study on deep learning
Case study on deep learningCase study on deep learning
Case study on deep learning
 
モバイルアプリ向けAWSネイティブアーキテクチャ
モバイルアプリ向けAWSネイティブアーキテクチャモバイルアプリ向けAWSネイティブアーキテクチャ
モバイルアプリ向けAWSネイティブアーキテクチャ
 
[AWSマイスターシリーズ] AWS CLI / AWS Tools for Windows PowerShell
[AWSマイスターシリーズ] AWS CLI / AWS Tools for Windows PowerShell[AWSマイスターシリーズ] AWS CLI / AWS Tools for Windows PowerShell
[AWSマイスターシリーズ] AWS CLI / AWS Tools for Windows PowerShell
 
Tutorial on Deep Generative Models
 Tutorial on Deep Generative Models Tutorial on Deep Generative Models
Tutorial on Deep Generative Models
 
サーバーのおしごと
サーバーのおしごとサーバーのおしごと
サーバーのおしごと
 
딥 러닝 자연어 처리 학습을 위한 PPT! (Deep Learning for Natural Language Processing)
딥 러닝 자연어 처리 학습을 위한 PPT! (Deep Learning for Natural Language Processing)딥 러닝 자연어 처리 학습을 위한 PPT! (Deep Learning for Natural Language Processing)
딥 러닝 자연어 처리 학습을 위한 PPT! (Deep Learning for Natural Language Processing)
 
Prestoで実現するインタラクティブクエリ - dbtech showcase 2014 Tokyo
Prestoで実現するインタラクティブクエリ - dbtech showcase 2014 TokyoPrestoで実現するインタラクティブクエリ - dbtech showcase 2014 Tokyo
Prestoで実現するインタラクティブクエリ - dbtech showcase 2014 Tokyo
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 

Similar to An Introduction to MapReduce

Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
Pallav Jha
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBase
Lukas Vlcek
 

Similar to An Introduction to MapReduce (20)

Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Mapreduce Hadop.pptx
Mapreduce Hadop.pptxMapreduce Hadop.pptx
Mapreduce Hadop.pptx
 
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
Map Reduce Workloads: A Dynamic Job Ordering and Slot ConfigurationsMap Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
 
Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)
 
MapReduce Programming Model
MapReduce Programming ModelMapReduce Programming Model
MapReduce Programming Model
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
E031201032036
E031201032036E031201032036
E031201032036
 
An Enhanced MapReduce Model (on BSP)
An Enhanced MapReduce Model (on BSP)An Enhanced MapReduce Model (on BSP)
An Enhanced MapReduce Model (on BSP)
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBase
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniques
 
Big Data Technology
Big Data TechnologyBig Data Technology
Big Data Technology
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programs
 
Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
Hybrid Map Task Scheduling for GPU-based Heterogeneous ClustersHybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
 

Recently uploaded

Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
FIDO Alliance
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
panagenda
 

Recently uploaded (20)

WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & Ireland
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development Companies
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideCollecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 

An Introduction to MapReduce

  • 1. An Introduction to MapReduce Presented by Frane Bandov at the Operating Complex IT-Systems seminar Berlin, 1/26/2010
  • 2. Outline •  Introduction •  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example •  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 An Introduction to MapReduce 2
  • 3. Outline •  Introduction •  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example •  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 An Introduction to MapReduce 3
  • 4. Introduction – Problem Sometimes we have to deal with huge amounts of data TBytes 250 200 150 100 50 0 You Facebook Yahoo! Groups German Climate Computing Centre 2/16/10 An Introduction to MapReduce 4
  • 5. Introduction – Problem The data needs to be processed, but how? Can‘t process all of this data on one machine  Distribute the processing to many machines 2/16/10 An Introduction to MapReduce 5
  • 6. Introduction – Approach Distributed computing is the solution “Let’s write our own distributed computing software as a solution to our problem” Checklist  design protocols   evelopment takes a long time D  design data structures  write the code  Expensive: Cost-benefit ratio?  assure failure tolerance Build complex software for simple computations? 2/16/10 An Introduction to MapReduce 6
  • 7. Outline •  Introduction •  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example •  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 An Introduction to MapReduce 7
  • 8. Google MapReduce – Idea A framework for distributed computing Don‘t care about protocols, failure tolerance, etc. Just write your simple computation 2/16/10 An Introduction to MapReduce 8
  • 9. Google MapReduce – Idea MapReduce Paradigm Map: Reduce: Apply function to all Combine all elements elements of a list of a list square x = x * x; reduce (+)[1, 2, 3, 4, 5]; map square [1, 2, 3, 4, 5];  [1, 4, 9, 16, 25]  15 2/16/10 An Introduction to MapReduce 9
  • 10. Google MapReduce – Idea Basic functioning Input Map Reduce Output 2/16/10 An Introduction to MapReduce 10
  • 11. Google MapReduce – Overview MapReduce-Based User Program GFS GFS Split 1 Master Split 2 Intermediate Worker Worker File 1 File 1 Split 3 Intermediate Worker File 2 Worker File 2 Split 4 Intermediate Split 5 Worker File 3 Reduce Output Input file Map Phase Phase files 2/16/10 An Introduction to MapReduce 11
  • 12. MapReduce – Fault Tolerance •  Workers are periodically pinged by master •  No answer over certain time  worker failed Mapper fails: –  Reset map job as idle –  Even if job was completed  intermediate files are inaccessible –  Notify reducers where to get the new intermediate file Reducer fails: –  Reset its job as idle 2/16/10 An Introduction to MapReduce 12
  • 13. MapReduce – Fault Tolerance Master fails: –  Periodically sets checkpoints –  In case of failure MapReduce-Operation is aborted –  Operation can be restarted from last checkpoint 2/16/10 An Introduction to MapReduce 13
  • 14. Google MapReduce – GFS Google File System •  In-house distributed file system at Google •  Stores all input an output files •  Stores files… – divided into 64 MB blocks – on at least 3 different machines •  Machines running GFS also run MapReduce 2/16/10 An Introduction to MapReduce 14
  • 15. Google MapReduce – Job Example 2/16/10 An Introduction to MapReduce 15
  • 16. Google MapReduce – Job Example 2/16/10 An Introduction to MapReduce 16
  • 17. Google MapReduce – Job Example 2/16/10 An Introduction to MapReduce 17
  • 18. Google MapReduce – Job Example 2/16/10 An Introduction to MapReduce 18
  • 19. Outline •  Introduction •  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example •  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 An Introduction to MapReduce 19
  • 20. Alternative Implementations Apache Hadoop •  Open-Source-Implementation in Java •  Jobs can be written in C++, Java, Python, etc. •  Used by Yahoo!, Facebook, Amazon and others •  Most commonly used implementation •  HDFS as open-source-implementation of GFS •  Can also use Amazon S3, HTTP(S) or FTP •  Extensions: Hive, Pig, HBase 2/16/10 An Introduction to MapReduce 20
  • 21. Alternative Implementations Mars MapReduce-Implementation for nVidia GPU using the CUDA framework MapReduce-Cell Implementation for the Cell multi-core processor Qizmt MySpace’s implementation of MapReduce in C# 2/16/10 An Introduction to MapReduce 21
  • 22. Alternative Implementations There are many other open- and closed- source implementations of MapReduce! 2/16/10 An Introduction to MapReduce 22
  • 23. Outline •  Introduction •  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example •  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 An Introduction to MapReduce 23
  • 24. Reception and Criticism •  Yahoo!: Hadoop on a 10,000 server cluster •  Facebook analyses the daily log (25TB) on a 1,000 server cluster •  Amazon Elastic MapReduce: Hadoop clusters for rent on EC2 and S3 •  IBM and Google: Support university courses in distributed programming •  UC Berkley announced to teach freashmen programming MapReduce 2/16/10 An Introduction to MapReduce 24
  • 25. Reception and Criticism 2/16/10 An Introduction to MapReduce 25
  • 26. Reception and Criticism •  Criticism mainly by RDBMS experts DeWitt and Stonebraker •  MapReduce – is a step backwards in database access – is a poor implementation – is not novel – is missing features that are routinely provided by modern DBMSs – is incompatible with the DBMS tools 2/16/10 An Introduction to MapReduce 26
  • 27. Reception and Criticism Response to criticism MapReduce is no RDBMS It suits well for processing and structuring huge amounts of unstructured data MapReduce's big inovation is that it enables distributing data processing across a network of cheap and possibly unreliable computers 2/16/10 An Introduction to MapReduce 27
  • 28. Outline •  Introduction •  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example •  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 An Introduction to MapReduce 28
  • 29. Trends and Future Development Trend of utilizing MapReduce/Hadoop as parallel database •  Hive: Query language for Hadoop •  HBase: Column-oriented distributed database (modeled after Google’s BigTable) •  Map-Reduce-Merge: Adding merge to the paradigm allows implementing features of relational algebra 2/16/10 An Introduction to MapReduce 29
  • 30. Trends and Future Development Trend to use the MapReduce-paradigm to better utilize multi-core CPUs •  Qt Concurrent –  Simplified C++ version of MapReduce for distributing tasks between multiple processor cores •  Mars •  MapReduce-Cell 2/16/10 An Introduction to MapReduce 30
  • 31. Outline •  Introduction •  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example •  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 An Introduction to MapReduce 31
  • 32. Conclusion MapReduce provides an easy solution for the processing of large amounts of data brings a paradigm shift in programming changed the world, i.e. made data processing more efficient and cheaper, is the foundation of many other approaches and solutions 2/16/10 An Introduction to MapReduce 32
  • 33. Questions? 2/16/10 An Introduction to MapReduce 33
  • 34. Thank You! 2/16/10 An Introduction to MapReduce 34