• Like
  • Save
An Introduction to MapReduce
Upcoming SlideShare
Loading in...5
×
 

An Introduction to MapReduce

on

  • 4,316 views

 

Statistics

Views

Total Views
4,316
Views on SlideShare
3,230
Embed Views
1,086

Actions

Likes
6
Downloads
153
Comments
0

3 Embeds 1,086

http://moodle.unicentro.br 1070
http://coderwall.com 14
http://www.linkedin.com 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    An Introduction to MapReduce An Introduction to MapReduce Presentation Transcript

    • An Introduction to MapReduce Presented by Frane Bandov at the Operating Complex IT-Systems seminar Berlin, 1/26/2010
    • Outline•  Introduction•  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example•  Alternative Implementations•  Reception and Criticism•  Trends and Future Development•  Conclusion2/16/10 An Introduction to MapReduce 2
    • Outline•  Introduction•  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example•  Alternative Implementations•  Reception and Criticism•  Trends and Future Development•  Conclusion2/16/10 An Introduction to MapReduce 3
    • Introduction – ProblemSometimes we have to deal with huge amounts of dataTBytes250200 150100 50 0 You Facebook Yahoo! Groups German Climate Computing Centre 2/16/10 An Introduction to MapReduce 4
    • Introduction – Problem The data needs to be processed, but how? Can‘t process all of this data on one machine  Distribute the processing to many machines2/16/10 An Introduction to MapReduce 5
    • Introduction – Approach Distributed computing is the solution “Let’s write our own distributed computing software as a solution to our problem” Checklist design protocols   evelopment takes a long time D design data structures write the code  Expensive: Cost-benefit ratio? assure failure tolerance Build complex software for simple computations? 2/16/10 An Introduction to MapReduce 6
    • Outline•  Introduction•  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example•  Alternative Implementations•  Reception and Criticism•  Trends and Future Development•  Conclusion2/16/10 An Introduction to MapReduce 7
    • Google MapReduce – Idea A framework for distributed computing Don‘t care about protocols, failure tolerance, etc. Just write your simple computation2/16/10 An Introduction to MapReduce 8
    • Google MapReduce – Idea MapReduce ParadigmMap: Reduce: Apply function to all Combine all elements elements of a list of a listsquare x = x * x; reduce (+)[1, 2, 3, 4, 5];map square [1, 2, 3, 4, 5]; [1, 4, 9, 16, 25]  152/16/10 An Introduction to MapReduce 9
    • Google MapReduce – Idea Basic functioning Input Map Reduce Output2/16/10 An Introduction to MapReduce 10
    • Google MapReduce – Overview MapReduce-Based User Program GFS GFS Split 1 Master Split 2 Intermediate Worker Worker File 1 File 1 Split 3 Intermediate Worker File 2 Worker File 2 Split 4 Intermediate Split 5 Worker File 3 Reduce OutputInput file Map Phase Phase files2/16/10 An Introduction to MapReduce 11
    • MapReduce – Fault Tolerance•  Workers are periodically pinged by master•  No answer over certain time  worker failedMapper fails: –  Reset map job as idle –  Even if job was completed  intermediate files are inaccessible –  Notify reducers where to get the new intermediate fileReducer fails: –  Reset its job as idle2/16/10 An Introduction to MapReduce 12
    • MapReduce – Fault ToleranceMaster fails: –  Periodically sets checkpoints –  In case of failure MapReduce-Operation is aborted –  Operation can be restarted from last checkpoint2/16/10 An Introduction to MapReduce 13
    • Google MapReduce – GFS Google File System•  In-house distributed file system at Google•  Stores all input an output files•  Stores files… – divided into 64 MB blocks – on at least 3 different machines•  Machines running GFS also run MapReduce2/16/10 An Introduction to MapReduce 14
    • Google MapReduce – Job Example2/16/10 An Introduction to MapReduce 15
    • Google MapReduce – Job Example2/16/10 An Introduction to MapReduce 16
    • Google MapReduce – Job Example2/16/10 An Introduction to MapReduce 17
    • Google MapReduce – Job Example2/16/10 An Introduction to MapReduce 18
    • Outline•  Introduction•  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example•  Alternative Implementations•  Reception and Criticism•  Trends and Future Development•  Conclusion2/16/10 An Introduction to MapReduce 19
    • Alternative ImplementationsApache Hadoop•  Open-Source-Implementation in Java•  Jobs can be written in C++, Java, Python, etc.•  Used by Yahoo!, Facebook, Amazon and others•  Most commonly used implementation•  HDFS as open-source-implementation of GFS•  Can also use Amazon S3, HTTP(S) or FTP•  Extensions: Hive, Pig, HBase2/16/10 An Introduction to MapReduce 20
    • Alternative Implementations Mars MapReduce-Implementation for nVidia GPU using the CUDA framework MapReduce-Cell Implementation for the Cell multi-core processor Qizmt MySpace’s implementation of MapReduce in C#2/16/10 An Introduction to MapReduce 21
    • Alternative Implementations There are many other open- and closed- source implementations of MapReduce!2/16/10 An Introduction to MapReduce 22
    • Outline•  Introduction•  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example•  Alternative Implementations•  Reception and Criticism•  Trends and Future Development•  Conclusion2/16/10 An Introduction to MapReduce 23
    • Reception and Criticism•  Yahoo!: Hadoop on a 10,000 server cluster•  Facebook analyses the daily log (25TB) on a 1,000 server cluster•  Amazon Elastic MapReduce: Hadoop clusters for rent on EC2 and S3•  IBM and Google: Support university courses in distributed programming•  UC Berkley announced to teach freashmen programming MapReduce2/16/10 An Introduction to MapReduce 24
    • Reception and Criticism2/16/10 An Introduction to MapReduce 25
    • Reception and Criticism•  Criticism mainly by RDBMS experts DeWitt and Stonebraker•  MapReduce – is a step backwards in database access – is a poor implementation – is not novel – is missing features that are routinely provided by modern DBMSs – is incompatible with the DBMS tools2/16/10 An Introduction to MapReduce 26
    • Reception and Criticism Response to criticism MapReduce is no RDBMS It suits well for processing and structuring huge amounts of unstructured data MapReduces big inovation is that it enables distributing data processing across a network of cheap and possibly unreliable computers2/16/10 An Introduction to MapReduce 27
    • Outline•  Introduction•  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example•  Alternative Implementations•  Reception and Criticism•  Trends and Future Development•  Conclusion2/16/10 An Introduction to MapReduce 28
    • Trends and Future Development Trend of utilizing MapReduce/Hadoop as parallel database•  Hive: Query language for Hadoop•  HBase: Column-oriented distributed database (modeled after Google’s BigTable)•  Map-Reduce-Merge: Adding merge to the paradigm allows implementing features of relational algebra2/16/10 An Introduction to MapReduce 29
    • Trends and Future Development Trend to use the MapReduce-paradigm to better utilize multi-core CPUs•  Qt Concurrent –  Simplified C++ version of MapReduce for distributing tasks between multiple processor cores•  Mars•  MapReduce-Cell2/16/10 An Introduction to MapReduce 30
    • Outline•  Introduction•  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example•  Alternative Implementations•  Reception and Criticism•  Trends and Future Development•  Conclusion2/16/10 An Introduction to MapReduce 31
    • Conclusion MapReduce provides an easy solution for the processing of large amounts of data brings a paradigm shift in programming changed the world, i.e. made data processing more efficient and cheaper, is the foundation of many other approaches and solutions2/16/10 An Introduction to MapReduce 32
    • Questions?2/16/10 An Introduction to MapReduce 33
    • Thank You!2/16/10 An Introduction to MapReduce 34