MapReduce
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,372
On Slideshare
1,370
From Embeds
2
Number of Embeds
1

Actions

Shares
Downloads
95
Comments
0
Likes
0

Embeds 2

https://twitter.com 2

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Introduction toMapReduce Zuhair Khayyat 3/11/2012
  • 2. What is MapReduce ● A programming model introduced by Google in OSDI 04 for processing large datasets efficiently. ● Features: – Automatic parallelization, no parallel experience required. – Data and process redundancy for failure recovery. – Auto scheduling and Load balancing. – Easy to program, based on two simple functions: ● Map ● Reduce.CS245 - 2012 Introduction to MapReduce 2
  • 3. Why MapReduce? ● For a cluster of: – 2000 machines. – Total 16 TB Ram (≈ 8 GB each). – Total 2 PB Disk space (≈ 1 TB each). ● Use the maximum capacity of the cluster to: – Implement a parallel word count for input size 100 TB.CS245 - 2012 Introduction to MapReduce 3
  • 4. Why MapReduce? ● For a cluster of: – 2000 machines. – Total 16 TB Ram (≈ 8 GB each). – Total 2 PB Disk space (≈ 1 TB each). ● Use the maximum capacity of the cluster to: – Implement a parallel word count for input size 100 TB. – Implement a parallel sort for the same input file. ● Can you use the same code for both applications?CS245 - 2012 Introduction to MapReduce 4
  • 5. How Fast is MapReduce (Hadoop) ● Sort Benchmark competition (http://sortbenchmark.org/): – 2009: 100 TB in 173 minutes using 3452 nodes: ● 2 x Quad core Xeons @ 2.5 GHz. ● 8 GB RAM. – 2008: 1TB in 3.48 minutes using 910 nodes: ● 4 x Dual core Xeons @ 2.0 GHz. ● 8 GB RAM.CS245 - 2012 Introduction to MapReduce 5
  • 6. Who uses MapReduce?CS245 - 2012 Introduction to MapReduce 6
  • 7. Map & Reduce functions ● The Mapper (Pick a key): – Input: Read input from disk. – Output: Create pairs of <key, value>, known as intermediate pairs. – More input partitions == More parallel Mappers. ● The Reducer (Process values): – Input: a list of <key,value> pairs with a unique key. – Output: Single or multiple of <key, values> – More unique keys == More Parallel Reducers.CS245 - 2012 Introduction to MapReduce 7
  • 8. How MapReduce Work 1) Partition input file into M partitions. 2) Create M Map tasks, read M partitions in parallel and emits intermediate <key, value> pairs. Store them into local storage. 3) Wait for all Map workers to finish, sort and partition intermediate <key, value> pairs into R regions. 4) Start R reduce workers, each reads a list of intermediate with a unique key from remote disks. 5) Write the output of reduce workers to file(s).CS245 - 2012 Introduction to MapReduce 8
  • 9. Example – Word count ● Assume an input as following: cat flower picture snow cat cat prince flower sun king queen ACCS245 - 2012 Introduction to MapReduce 9
  • 10. Example – Word count ● Step1: Partition input file into M partitions. cat flower picture cat flower picture snow cat cat prince flower sun snow cat cat king queen AC prince flower sun king queen ACCS245 - 2012 Introduction to MapReduce 10
  • 11. Example – Word count● Step2: Create M Map tasks, read M partitions in parallel and emits intermediate <key, value> pairs. Store them into local storage.cat flower picture Mapper 1 <cat,1> <flower,1> <picture,1> snow cat cat Mapper 2 <snow,1> <cat,1> <cat,1>prince flower sun Mapper 3 <prince,1> <flower,1> <sun,1> king queen ACCS245 - 2012 Introduction to 4 Mapper MapReduce <king,1> <queen,1> <AC,1> 11
  • 12. Example – Word count ● Step3: Wait for all Map workers to finish, sort and partition intermediate <key, value> pairs into R regions. <cat,1> <AC,1><cat,1> <flower,1> <picture,1> <flower,1> <cat,1> <picture,1> <cat,1> <cat,1> <cat,1> <snow,1> <cat,1> <cat,1> <cat,1> <flower,1> <snow,1> <flower,1> <flower,1> <king,1><prince,1> <flower,1> <sun,1> <prince,1> <picture,1> <sun,1> <prince,1> <queen,1> <AC,1> <snow,1> CS245 - 2012 <king,1> <king,1> <queen,1> <AC,1> Introduction to MapReduce <sun,1> 12 <queen,1>
  • 13. Example – Word count● Step4: Start R reduce workers, each reads a list of intermediate with a unique key from remote disks. <AC,1> Reducer 1 <AC,1> <cat,1> <cat,1> <cat,1> Reducer 2 <cat,3> <flower,1> <flower,1> Reducer 3 <flower,2> <king,1> <picture,1> <prince,1> <queen,1> <snow,1>CS245 - 2012 <sun,1> Reducer 9 Introduction to MapReduce <sun,1> 13
  • 14. Example – Word count● Step5: Write the output of reduce workers to file(s). <AC,1> <AC,1> <cat,3> <cat,3> <flower,2> <flower,2> <king,1> <king,1> <picture,1> <prince,1> <picture,1> <queen,1> <snow,1> <sun,1> <sun,1>CS245 - 2012 Introduction to MapReduce 14
  • 15. MapReduce frameworkCS245 - 2012 Introduction to MapReduce 15
  • 16. MapReduce Failure Recovery ● The framework works as master worker paradigm. ● The master keeps records of the work done on each worker. ● If a worker fails, the master assigns the same work to another worker. ● If a worker is late, another copy of the same work is assigned to another worker. ● If the master fails, another backup copy of the master can pick up and continue execution from the last check points.CS245 - 2012 Introduction to MapReduce 16
  • 17. Advantages of MapReduce ● Parallel IO: hides disk latency. ● Parallel Processing: – Map functions works independently in parallel, each process one unique partition. – Reduce functions work independently in parallel, each on a unique intermediate key. ● Using large clusters of commodity machines gives better results than small expensive clusters.CS245 - 2012 Introduction to MapReduce 17
  • 18. Advantages of MapReduce ● Parallel IO: hides disk latency. ● Parallel Processing: – Map functions works independently in parallel, each process one unique partition. – Reduce functions work independently in parallel, each on a unique intermediate key. ● Using large clusters of commodity machines gives comparable results than small expensive clusters.CS245 - 2012 Introduction to MapReduce 18
  • 19. Hadoop vs. others ● Algorithm: Sorting 100 TB data. Hadoop DEMSort TritonSort Nodes Count 3452 195 47 Processor 2x Quad-core 2x Quad-core 2x Quad-core Xeons @ 2.5 GHz Xeons @ 2.6 GHz Xeons @ 2.27 GHz Memory 8 GB 16 GB 24 GB Network 1 Gigabit Ethernet InfiniBand 10 Gigabit Fiber Throughput 0.578 TB/Min 0.564 TB/Min 0.582 TB/MinCS245 - 2012 Introduction to MapReduce 19
  • 20. MapReduce weak points ● Overhead of MapReduce is huge. ● Data dependent applications may need multiple iterations of MapReduce, for example: – K-means. – PageRank. ● Complex algorithms can be very hard to implement. – Range Queries. ● Sensitive to <key,value> pairs skewed distributionCS245 - 2012 Introduction to MapReduce 20
  • 21. Implementations of MapReduce ● Hadoop in Java. ● Mars in C++ & CUDA. ● Skynet in Ruby. ● Phoenix in C++ ● Microsoft Dryad: – Schedule multiple levels of “MapReduce” like operations..CS245 - 2012 Introduction to MapReduce 21
  • 22. MapReduce in DatabaseCS245 - 2012 Introduction to MapReduce 22
  • 23. MapReduce in Database - Ex1 ● Select Name from Students where age = 23; Students: Name ID Age Ahmed 1177 23 Bob 1131 20 Sara 1197 22CS245 - 2012 Introduction to MapReduce 23
  • 24. MapReduce in Database - Ex2 ● Select COUNT(Name) from Students where age > 20 group by Name; Students: Name ID Age Ahmed 1177 23 Bob 1131 20 Sara 1197 22CS245 - 2012 Introduction to MapReduce 24
  • 25. MapReduce in Database - Ex3 ● Select Name, Term from Students, Enrolment where ID = SID and age != 20; Students: Enrolment: Name ID Age CID SID Term Ahmed 1177 23 CS290 1177 042 Bob 1131 20 CS260 1177 052 Sara 1197 22 ME222 1131 051 AMCS220 1197 051CS245 - 2012 Introduction to MapReduce 25
  • 26. MapReduce in Database - Ex4 ● Select Name, Term from Students, Enrolment where ID != SID; Students: Enrolment: Name ID Age CID SID Term Ahmed 1177 23 CS290 1177 042 Bob 1131 20 CS260 1177 052 Sara 1197 22 ME222 1131 051 AMCS220 1197 051 ● What if the condition ID > SID?CS245 - 2012 Introduction to MapReduce 26
  • 27. MapReduce in Database - Ex5 ● Select Name, Term from Students, Enrolment where ID = SID and Admission != Term;Students: Students: Enrolment: Enrolment: Name ID Age Admission CID SID Term Ahmed 1177 23 042 CS290 1177 042 Bob 1131 20 051 CS260 1177 052 Sara 1197 22 042 ME222 1131 051 AMCS220 1197 051 CS245 - 2012 Introduction to MapReduce 27
  • 28. MapReduce in Database - Ex6 ● Select y from R, S, T where R.x = S.x and T.a = S.a; R: S: x y z a b x T: m n aCS245 - 2012 Introduction to MapReduce 28
  • 29. MapReduce in Academic Papers ● NIPS 07: Map-Reduce for Machine Learning on Multicore. ● Escience 08: CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications. ● KDD 09: Large-scale behavioral targeting. ● GCC 09: Spatial Queries Evaluation with MapReduce. ● SIGIR 09: On single-pass indexing with MapReduce. ● MDAC 10: A novel approach to multiple sequence alignment using hadoop data grids. ● VLDB Endowment 11: Social Content Matching in MapReduce. ● VLDB 12: Building Wavelet Histograms on Large Data in MapReduce.CS245 - 2012 Introduction to MapReduce 29
  • 30. Links● http://code.google.com/edu/parallel/mapreduce-tutorial.html● http://hadoop.apache.org/mapreduce/● http://www.cse.ust.hk/gpuqp/Mars.html● http://skynet.rubyforge.org/● http://mapreduce.stanford.edu/● http://wiki.apache.org/hadoop/PoweredBy● http://atbrox.com/2011/05/16/mapreduce-hadoop-algorithms-in- academic-papers-4th-update-may-2011/CS245 - 2012 Introduction to MapReduce 30