Your SlideShare is downloading. ×
Map and Reduce
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Map and Reduce

889
views

Published on

Map & Reduce, presentation at RWTH

Map & Reduce, presentation at RWTH

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
889
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • In these days the web is all about data. All major and important websites relay on huge amount of data in some form in order to provide services to users. For example Google … and Facebook …. Also facilities like the LHC will produce data measures in peta bytes each year. However, it takes about 2.5 hours in order to read one terabyte off a typical hard drive. The solution that comes immediately to mind, of course, is going parallel. KonkretesBeispiel [TODO], [Kontextzu Cloud Computing]
  • Parallel programming is still hard. Programmers have to deal with a lot of boilerplate code and have to manually write code for things like scheduling and load balancing. Also people want to use the company cluster in parallel, so something like a batch system is needed. As more and more companies use huge amounts of data, a some kind of standard framework or platform has emerged in recent years and that is the Map/Reduce framework.
  • Map Reduce known for years as functional programming concept
  • Actual execution and scheduling
  • http://www4.informatik.uni-erlangen.de/Lehre/WS10/V_MW/Uebung/folien/05-Map-Reduce-Framework.pdf
  • Transcript

    • 1. Map & Reduce
      Christopher Schleiden, Christian Corsten, Michael Lottko, Jinhui Li
      1
      The slides are licensed under aCreative Commons Attribution 3.0 License
    • 2. Outline
      Motivation
      Concept
      Parallel Map & Reduce
      Google’s MapReduce
      Example: Word Count
      Demo: Hadoop
      Summary
      Web Technologies
      2
    • 3. Today the web is all about data!
      Google
      Processing of 20 PB/day (2008)
      LHC
      Will generate about 15PB/year
      Facebook
      2.5 PB of data
      + 15 TB/day (4/2009)
      3
      BUT: It takes ~2.5 hours to read one terabyte off a typical hard disk!
    • 4. 4
      Solution: Going Parallel!
      Data Distribution
      However, parallel programming is hard!
      Synchronization
      Load Balancing

    • 5. Map & Reduce
      Programming model and Framework
      Designed for large volumes of data in parallel
      Based on functional map and reduce concept
      e.g., Output of functions only depends on their input, there are no side-effects
      5
    • 6. Functional Concept
      Map
      Apply function to each value of a sequence
      map(k,v)  <k’, v’>*
      Reduce/Fold
      Combine all elements of a sequence using binary operator
      reduce(k’, <v’>*) <k’, v’>*
      6
    • 7. Typical problem
      Iterate over large number of records
      Extract something interesting
      Shuffle & sort intermediate results
      Aggregate intermediate results
      Write final output
      7
      Map
      Reduce
    • 8. Parallel Map & Reduce
      8
    • 9. Parallel Map & Reduce
      Published (2004) and patented (2010) by Google Inc
      C++ Runtime with Bindings to Java/Python
      Other Implementations:
      Apache Hadoop/Hive project (Java)
      Developed at Yahoo!
      Used by:
      Facebook
      Hulu
      IBM
      And many more
      Microsoft COSMOS (Scope, based on SQL and C#)
      Starfish (Ruby)

      9
      Footer Text
    • 10. Parallel Map & Reduce /2
      Parallel execution of Map and Reduce stages
      Scheduling through Master/Worker pattern
      Runtime handles:
      Assigning workers to map and reduce tasks
      Data distribution
      Detects crashed workers
      10
    • 11. Parallel Map & Reduce Execution
      11
      Map
      Reduce
      Input
      Output
      Shuffle & Sort
      D
      RE
      A
      SU
      T
      LT
      A
    • 12. Components in Google’s MapReduce
      Web Technologies
      12
    • 13. Google Filesystem (GFS)
      Stores…
      Input data
      Intermediate results
      Final results
      …in 64MB chunks on at least three different machines
      Web Technologies
      13
      File
      Nodes
    • 14. Scheduling (Master/Worker)
      One master, many worker
      Input data split into M map tasks (~64MB in Size; GFS)
      Reduce phase partitioned into R tasks
      Tasks are assigned to workers dynamically
      Master assigns each map task to a free worker
      Master assigns each reducetask to a free worker
      Fault handling via Redundancy
      Master checks if Worker still alive via heart-beat
      Reschedules work item if worker has died
      Web Technologies
      14
    • 15. Scheduling Example
      15
      Map
      Reduce
      Input
      Output
      Temp
      Master
      Assign map
      Assign reduce
      D
      Worker
      Worker
      RES
      A
      Worker
      T
      Worker
      ULT
      Worker
      A
    • 16. Googles M&R vsHadoop
      Google MapReduce
      Main language: C++
      Google Filesystem (GFS)
      GFS Master
      GFS chunkserver
      HadoopMapReduce
      Main language: Java
      HadoopFilesystem (HDFS)
      Hadoopnamenode
      Hadoopdatanode
      Web Technologies
      16
    • 17. Word Count
      The Map & Reduce “Hello World” example
      17
    • 18. Word Count - Input
      Set of text files:
      Expected Output:
      sweet (1), this (2), is (2), the (2), foo (1), bar (1), file (1)
      18
      bar.txt
      This is the bar file
      foo.txt
      Sweet, this is the foo file
    • 19. Word Count - Map
      Mapper(filename, file-contents):
      for each word
      emit(word,1)
      Output
      this (1)
      is (1)
      the (1)
      sweet (1)
      this (1)
      the (1)
      is (1)
      foo (1)
      bar (1)
      file (1)
      19
    • 20. Word Count – Shuffle Sort
      this (1)
      is (1)
      the (1)
      sweet (1)
      this (1)
      the (1)
      is (1)
      foo (1)
      bar (1)
      file (1)
      this (1)
      this (1)
      is (1)
      is (1)
      the (1)
      the (1)
      sweet (1)
      foo (1)
      bar (1)
      file (1)
      20
    • 21. Word Count - Reduce
      reducer(word, values):
      sum = 0
      for each value in values:
      sum = sum + value
      emit(word,sum)
      Output
      sweet (1)
      this (2)
      is (2)
      the (2)
      foo (1)
      bar (1)
      file (1)
      21
    • 22. DEMO
      Hadoop – Word Count
      22
    • 23. Summary
      Lots of data processed on the web (e.g., Google)
      Performance solution: Go parallel
      Input, Map, Shuffle & Sort, Reduce, Output
      Google File System
      Scheduling: Master/Worker
      Word Count example
      Hadoop
      Questions?
      Web Technologies
      23
    • 24. References
      Inspirations for presentation
      http://www4.informatik.uni-erlangen.de/Lehre/WS10/V_MW/Uebung/folien/05-Map-Reduce-Framework.pdf
      http://www.scribd.com/doc/23844299/Map-Reduce-Hadoop-Pig
      RWTH Map Reduce Talk: http://bit.ly/f5oM7p
      Paper
      Dean et al, MapReduce: Simplified Data Processing on Large Clusters, OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004
      Ghemawat et al, The Google File System, 19th ACM Symposium on Operating Systems Principles, Lake George, NY, October, 2003.
      24