Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

MapReduce: Simplified Data Processing on Large Clusters


Published on

Simplified Data Processing on Large Clusters by MapReduce

Published in: Education, Technology
  • Be the first to comment

MapReduce: Simplified Data Processing on Large Clusters

  1. 1. MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Presented By Ashraf Uddin South Asian University ( 11 February 2014
  2. 2. MapReduce ● A programming model & associated implementaion – Processing & generating large datasets ● Programs written are automatically parallelized ● Takes care of – Partitioning the input data – Scheduling the program's execution – Handling machine failures – Managing inter-machine communication
  3. 3. MapReduce: Programming Model ● MapReduce expresses the computation as two functions: Map and Reduce ● Map: an input pair --> key/value pairs ● Reduce: Intermediate key/values --> output
  4. 4. MapReduce: Examples ● Word Frequency in a large collection of documents ● Distributed grep ● Count of URL access Frequency ● Reverse Web-Link graph ● Inverted Index ● Distributed Sort ● Term-Vector per Host
  5. 5. Implementation ● Many different implementaions interfaces ● Depends on the environment – A small memory shared memory – A large NUMA multi-processor – A large collection of networked machine
  6. 6. Implementaion: Execution Overview ● ● ● Map invocations are distributed across multiple machines by automatically partitioning the input data in a set of M splits. Reduce invocations are distributed by partitioning the intermediate key space into R pieces. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
  7. 7. Implementaion: Execution Overview Fig: How MapReduce works & Data flow Source:
  8. 8. Implementaion: Execution Overview Fig: input data values in the MapReduce model Source: Google Developers
  9. 9. Master Data Structure ● ● ● For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine For each completed map task, the master stores the locations and sizes of R intermediate file regions produced by the map task. The information is pushed incrementally to workers that have in progress task.
  10. 10. Fault Tolerance: Worker Failure ● The master pings every worker periodically ● No respose means the worker is failed ● ● All map tasks completed or in-progressby the worker are reset to idle state and reexecuted on other machines. For a failed machine, in-progress reduce tasks are rescheduled but completed reduce tasks do not need to be re-executed.
  11. 11. Fault Tolerance: Worker Failure ● ● When a map task is executed first by worker A and later executed by worker B (because A failed), all workers executing reduce tasks are notified of the re-execution. Any reduce task that has not already read data from worker A will read data from B.
  12. 12. Fault Tolerance: Semantics in the Presence of Failures ● When the Map and Reduce operators are deterministic functions, this implementation produces the same output as would have been produced by a non-faulting sequential execution.
  13. 13. Implementaion: Locality ● ● ● Network bandwith is relatively scarce resource in the computing environment. The input data managed by GFS is stored on the local disk of the machines GFS divides each file into 64 MB blocks, and stores several copies of each block
  14. 14. Implementaion: Locality ● ● ● The MapReduce master takes the location information of input files into acount and attempts to schedule a map task on a machine that contains a replica of the corresponding input data. Failing that, it attempts to schedule a map task near a replica of that task's input data. A significant fraction of the workers in a cluster, most input is read locally and consumes no network bandwidth.
  15. 15. Implementaion: Task Granularity ● ● ● M and R should be much larger than the number of worker machines. Having each worker perform many different tasks improves dynamic load balancing and also speeds up recovery. The master makes O(M+R) scheduling decisions and keeps O(M*R) state in memory.
  16. 16. Implementaion: Task Granularity ● ● R is often constrained by users because the outout of each reduce task ends up in a separate output file. Choose M such that individual task is roughly 16 MB to 64 MB input data for locality optimization.
  17. 17. Implementaion: Backup Tasks ● ● ● A “straggler”: a machine that takes an unusually long time to complete one of the last few map or reduce tasks. For example, a machine with a bad disk may experiance frequent correctable errors that slow its read performance. When a MapReduce is close to completion, the master schedules backup executions of the remaining in-progress tasks.
  18. 18. Refinement: Partitioning Function ● ● Data gets partitioned across R reduce tasks using a partitioning function on the intermediate key. (eg. “hash(key) mod R) This tends to result in fairly well-balanced partitions.
  19. 19. Refinement: Ordering Guarantees ● ● Within a given partition, the intermediate key/value pairs are processed in increasing key order. This ordering guarantee makes it easy to generate a sorted output file per partition.
  20. 20. Refinement: Combiner Function ● ● ● ● Each map task may produce hundreds or thousands of records with same key. The combiner function that does partial merging of this data before it is sent over the network. The combiner function is executed on each machine that performs a map task It significantly speeds up certain class of MapReduce operations. (eg. Word Frequency)
  21. 21. Refinement: Skipping Bad Records ● ● ● ● Sometimes there are bugs in user code that cause the Map or Reduce functions to crash deterministically on certain records. If the bugs in third-party library for which source code is not available then the bugs can not be fixed. Also, sometimes it is acceptable to ignore a few records (eg. Statistical analysis on larage dataset) MapReduce library detects which records cause deterministic crashes.
  22. 22. Refinement: Status Information ● ● The master runs at HTTP server and exports a set of status pages for human consumption. It shows the progress of the computation such as how many tasks have been completed, how many are in progress, bytes of input data, bytes of intermediate data, processing rate etc.
  23. 23. Refinement: Counters ● ● To count occurances of various events User code creates a named counter object and then increments the counter appropiately in the Map and/or Reduce function. Counter* uppercase; uppercase = GetCounter("uppercase"); map(String name, String contents): for each word w in contents: if (IsCapitalized(w)): uppercase->Increment(); EmitIntermediate(w, "1");
  24. 24. Performance: Cluster Configuration ● Approximately 1800 machines ● 2 Ghz Intel Xenon processors ● 4GB of memory ● Two 160GB IDE disks ● A gigabit Ethernet link ● ● Switched with 100-200 Gbps of aggregate bandwidth available at the root Roun trip time was less than a milisecond
  25. 25. Performance: Grep ● ● ● Scans through 10^10 100-byte records, searching for a relatively rare three-character patterns (92,337 records) 64MB pieces (M=15000) The entire output is placed in one file (R=1)
  26. 26. Performance: Grep
  27. 27. Performance: Grep ● 150 seconds from start to finish ● The overhead is due to – the propogation of the program to all worker machines – delays interacting with GFS to open the set of 1000 input files – get the information optimization needed for the locality
  28. 28. Performance: Sort ● ● ● ● ● Scans through 10^10 100-byte records (approximately 1 terabyte of data) The sorting program consists of less than 50 lines of code A three line Map function extracts a 10-byte sorting key from text line and emits the key and theoriginal text line. M=15000, R= 4000 The final sorted output is written to a set of 2-way replicated GFS files (i.e., 2 terabytes of data)
  29. 29. Performance: Sort
  30. 30. Performance: Sort ● ● ● ● The input rate is higher than the shuffle rate and output rate because of locality optimization The shuffle rate is higher than the output rate because the output phase writes two copies of the sorted data. No Backup tasks (an increase of 44% in time) Process killed (an increase of 5% over the normal execution time 891 seconds)
  31. 31. Experience ● Using MapReduce(instead of the ad-hoc distributed passes in the prior version of the indexing system) has provided several benefits: – The indexing code is simpler, smaller and easier to understand (3800 lines to only 700 lines) – Computation time a few months to a few days to implement in the new system – Machine failures, slow machines, and networking hiccups are dealt automatically
  32. 32. THANK YOU