Your SlideShare is downloading. ×
0
Map and Reduce
Map and Reduce
Map and Reduce
Map and Reduce
Map and Reduce
Map and Reduce
Map and Reduce
Map and Reduce
Map and Reduce
Map and Reduce
Map and Reduce
Map and Reduce
Map and Reduce
Map and Reduce
Map and Reduce
Map and Reduce
Map and Reduce
Map and Reduce
Map and Reduce
Map and Reduce
Map and Reduce
Map and Reduce
Map and Reduce
Map and Reduce
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Map and Reduce

922

Published on

Map & Reduce, presentation at RWTH

Map & Reduce, presentation at RWTH

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
922
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • In these days the web is all about data. All major and important websites relay on huge amount of data in some form in order to provide services to users. For example Google … and Facebook …. Also facilities like the LHC will produce data measures in peta bytes each year. However, it takes about 2.5 hours in order to read one terabyte off a typical hard drive. The solution that comes immediately to mind, of course, is going parallel. KonkretesBeispiel [TODO], [Kontextzu Cloud Computing]
  • Parallel programming is still hard. Programmers have to deal with a lot of boilerplate code and have to manually write code for things like scheduling and load balancing. Also people want to use the company cluster in parallel, so something like a batch system is needed. As more and more companies use huge amounts of data, a some kind of standard framework or platform has emerged in recent years and that is the Map/Reduce framework.
  • Map Reduce known for years as functional programming concept
  • Actual execution and scheduling
  • http://www4.informatik.uni-erlangen.de/Lehre/WS10/V_MW/Uebung/folien/05-Map-Reduce-Framework.pdf
  • Transcript

    • 1. Map & Reduce<br />Christopher Schleiden, Christian Corsten, Michael Lottko, Jinhui Li<br />1<br />The slides are licensed under aCreative Commons Attribution 3.0 License<br />
    • 2. Outline<br />Motivation<br />Concept<br />Parallel Map & Reduce<br />Google’s MapReduce<br />Example: Word Count<br />Demo: Hadoop<br />Summary<br />Web Technologies<br />2<br />
    • 3. Today the web is all about data!<br />Google<br />Processing of 20 PB/day (2008)<br />LHC<br />Will generate about 15PB/year<br />Facebook<br />2.5 PB of data<br />+ 15 TB/day (4/2009)<br />3<br />BUT: It takes ~2.5 hours to read one terabyte off a typical hard disk!<br />
    • 4. 4<br />Solution: Going Parallel!<br />Data Distribution<br />However, parallel programming is hard! <br />Synchronization<br />Load Balancing<br />…<br />
    • 5. Map & Reduce<br />Programming model and Framework <br />Designed for large volumes of data in parallel<br />Based on functional map and reduce concept<br />e.g., Output of functions only depends on their input, there are no side-effects<br />5<br />
    • 6. Functional Concept<br />Map<br />Apply function to each value of a sequence<br />map(k,v)  <k’, v’>*<br />Reduce/Fold<br />Combine all elements of a sequence using binary operator <br />reduce(k’, <v’>*) <k’, v’>*<br />6<br />
    • 7. Typical problem<br />Iterate over large number of records<br />Extract something interesting<br />Shuffle & sort intermediate results<br />Aggregate intermediate results<br />Write final output<br />7<br />Map<br />Reduce<br />
    • 8. Parallel Map & Reduce<br />8<br />
    • 9. Parallel Map & Reduce<br />Published (2004) and patented (2010) by Google Inc<br />C++ Runtime with Bindings to Java/Python<br />Other Implementations:<br />Apache Hadoop/Hive project (Java)<br />Developed at Yahoo!<br />Used by:<br />Facebook<br />Hulu<br />IBM<br />And many more<br />Microsoft COSMOS (Scope, based on SQL and C#)<br />Starfish (Ruby)<br />… <br />9<br />Footer Text<br />
    • 10. Parallel Map & Reduce /2<br />Parallel execution of Map and Reduce stages<br />Scheduling through Master/Worker pattern<br />Runtime handles:<br />Assigning workers to map and reduce tasks<br />Data distribution<br />Detects crashed workers<br />10<br />
    • 11. Parallel Map & Reduce Execution<br />11<br />Map<br />Reduce<br />Input<br />Output<br />Shuffle & Sort<br />D<br />RE<br />A<br />SU<br />T<br />LT<br />A<br />
    • 12. Components in Google’s MapReduce<br />Web Technologies<br />12<br />
    • 13. Google Filesystem (GFS)<br />Stores…<br />Input data<br />Intermediate results<br />Final results<br />…in 64MB chunks on at least three different machines<br />Web Technologies<br />13<br />File<br />Nodes<br />
    • 14. Scheduling (Master/Worker)<br />One master, many worker<br />Input data split into M map tasks (~64MB in Size; GFS)<br />Reduce phase partitioned into R tasks<br />Tasks are assigned to workers dynamically<br />Master assigns each map task to a free worker<br />Master assigns each reducetask to a free worker<br />Fault handling via Redundancy<br />Master checks if Worker still alive via heart-beat<br />Reschedules work item if worker has died<br />Web Technologies<br />14<br />
    • 15. Scheduling Example<br />15<br />Map<br />Reduce<br />Input<br />Output<br />Temp<br />Master<br />Assign map<br />Assign reduce<br />D<br />Worker<br />Worker<br />RES<br />A<br />Worker<br />T<br />Worker<br />ULT<br />Worker<br />A<br />
    • 16. Googles M&R vsHadoop<br />Google MapReduce<br />Main language: C++<br />Google Filesystem (GFS)<br />GFS Master<br />GFS chunkserver<br />HadoopMapReduce<br />Main language: Java<br />HadoopFilesystem (HDFS)<br />Hadoopnamenode<br />Hadoopdatanode<br />Web Technologies<br />16<br />
    • 17. Word Count<br />The Map & Reduce “Hello World” example<br />17<br />
    • 18. Word Count - Input<br />Set of text files:<br />Expected Output:<br />sweet (1), this (2), is (2), the (2), foo (1), bar (1), file (1)<br />18<br />bar.txt<br />This is the bar file<br />foo.txt<br />Sweet, this is the foo file<br />
    • 19. Word Count - Map<br />Mapper(filename, file-contents):<br />for each word<br />emit(word,1)<br />Output<br />this (1)<br />is (1)<br />the (1)<br />sweet (1)<br />this (1)<br />the (1) <br />is (1) <br />foo (1) <br />bar (1) <br />file (1)<br />19<br />
    • 20. Word Count – Shuffle Sort<br />this (1)<br />is (1)<br />the (1)<br />sweet (1)<br />this (1)<br />the (1) <br />is (1) <br />foo (1) <br />bar (1) <br />file (1)<br />this (1)<br />this (1)<br />is (1)<br />is (1) <br />the (1)<br />the (1) <br />sweet (1)<br />foo (1) <br />bar (1) <br />file (1)<br />20<br />
    • 21. Word Count - Reduce<br />reducer(word, values):<br />sum = 0<br />for each value in values:<br />sum = sum + value<br />emit(word,sum)<br />Output<br />sweet (1)<br />this (2)<br />is (2)<br />the (2)<br />foo (1)<br />bar (1) <br />file (1)<br />21<br />
    • 22. DEMO<br />Hadoop – Word Count<br />22<br />
    • 23. Summary<br />Lots of data processed on the web (e.g., Google)<br />Performance solution: Go parallel<br />Input, Map, Shuffle & Sort, Reduce, Output<br />Google File System<br />Scheduling: Master/Worker<br />Word Count example<br />Hadoop<br />Questions?<br />Web Technologies<br />23<br />
    • 24. References<br />Inspirations for presentation<br />http://www4.informatik.uni-erlangen.de/Lehre/WS10/V_MW/Uebung/folien/05-Map-Reduce-Framework.pdf<br />http://www.scribd.com/doc/23844299/Map-Reduce-Hadoop-Pig<br />RWTH Map Reduce Talk: http://bit.ly/f5oM7p<br />Paper<br />Dean et al, MapReduce: Simplified Data Processing on Large Clusters, OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004<br />Ghemawat et al, The Google File System, 19th ACM Symposium on Operating Systems Principles, Lake George, NY, October, 2003.<br />24<br />

    ×