Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to map reduce


Published on

In this presentation , i provide in depth information about the how MapReduce works. It contains many details about the execution steps , Fault tolerance , master / worker responsibilities.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Introduction to map reduce

  1. 1. Introduction to MapReduce Mohamed Baddar Senior Data Engineer
  2. 2. Contents Computation Models for Distributed Computing MPI MapReduce Why MapReduce? How MapReduce works Simple example References 2
  3. 3. Distributed Computing Why ? Booming of big data generation (social media , e-commerce , banks , etc …) Big data and machine learning , data mining AI became like bread and butter : better results comes from analyzing bigger set of data. How it works ? Data-Partitioning : divide data into multiple tasks , each implementing the same procedure (computations) at specific phase on its data segment Task-Partitioning : assign different tasks to different computation units Hardware for distributed Computing Multiple processors (Multi-core processors) 3
  4. 4. Metrics How to judge computational model suitability ? Simplicity : level of developer experience Scalability : adding more computational node , increase throughput / response time fault tolerance : support recovering computed results when node is down Maintainability : How easy fix bugs , add features Cost : need for special hardware (Multi-core processors , large RAM , infiniband or can use common ethernet cluster of commodity machines) No one size fits all sometimes it is better to use hybrid computational models 4
  5. 5. MPI (Message Passing Interface) ● Workload is divided among different processes (each process may have multiple threads) ● Communication is via Message passing ● Data exchange is via shared memory (Physical / Virtual) ● Pros ○ Flexibility : programmer can customize message and communication between nodes ○ Speed : rely on sharing data via memory Source : 5
  6. 6. MapReduce Objective : Design scalable parallel programming framework to be deployed on large cluster of commodity machines Data divided into splits , each processed by map functions , whose output are processed by reduce functions. Originated and first practical implementation in Google Inc. 2004 MapReduce implementations Apache Hadoop (Computation) 6
  7. 7. MapReduce Execution (1) # Mapper (M=3) #Reducers (R=2) MapReduce functions map(K1,V1) list(K2,V2) reduce(K2,list(V2)) list(V2) 7
  8. 8. MapReduce - Execution (2) Platform : Nodes communicating over ethernet network over TCP/IP Two main type of processes : Master : orchestrates the work Worker : process data Units of work : Jobs :A MapReduce job is a unit of work that the client wants to be performed: it consists of the input data, the MapReduce program, and configuration. task : can be map task (process input to intermediate data), reduce task (process intermediate data to output). Job is divided into several map / reduce tasks. 8
  9. 9. MapReduce Execution (3) 1. A copy of the master process is created 2. Input data is divided into M splits , each of 16 to 64 MB (user configured) 3. M map tasks are created and given unique IDs , each parses key/value pairs of each split ,start processing , output is written into a memory buffer. 4. Map output is partitioned to R partitions. When buffers are full , they are spilled into local hard disks , and master is notified by saved buffered locations. All records with the same key are put in same partition. Note : Map output stored in local worker file system , not distributed file system , as it is intermediate and to avoid complexity. 5. Shuffling : when a reduce receives a notification from master that one of 9
  10. 10. MapReduce Execution (4) 6. When the reduce worker receives all its intermediate output , it sorts them by key (sorting is need as reduce task may have several keys). (1) 7. When sorting finished , the reduce worker iterates over each key , passing the key and list of values to the reduce function. 8. The output of reduce function is appended to the file corresponding to this reduce worker. 9. For each HDFS block of the output of reduce task , one block is stored locally in the reduce worker and the other 2 (assuming replication factor of 3) is replicated on two other off-rack node for reliability. ■ Notes : 10
  11. 11. Master responsibilities Find idle nodes (workers) to assign map and reduce tasks. monitor each task status (idle , in-progress, finished). Keep track of locations of R map intermediate output on each map worker machine. Keep record of worker IDs and other info (CPU , memory , disk size) Continuously push information about intermediate map output to reduce 11
  12. 12. Fault tolerance (1) Objective : handle machine failures gracefully , i.e. programmer don’t need to handle it or be aware of details. Two type of failures : Master failure Worker failure Two main activities Failure detection Recover lost (computed) data with least 12
  13. 13. Fault tolerance (2) Worker failure Detection : Timeout for master ping , mark worker as failed. Remove worker from list of available worker. For all map tasks assigned to that worker : mark these tasks as idle these tasks will be eligible for re-scheduling on other workers map tasks are re-executed as output is stored in local file system in failed machine all reduce workers are notified with re-execution so they can get intermediate data they haven’t yet. No need to re-execute reduce tasks as their output is stored in distributed file system and 13
  14. 14. Semantics in the Presence of Failures Deterministic and Nondeterministic Functions Deterministic functions always return the same result any time they are called with a specific set of input values. Nondeterministic functions may return different results each time they are called with a specific set of input values. If map and reduce function are “deterministic” , distributed implementation of mapreduce framework must produce the same output of a non-faulting sequential execution of the program. Several copies of the same map/reduce task might run on different nodes for sake of reliability and fault tolerance 14
  15. 15. Semantics in the Presence of Failures (2) Mapper always write their output to tmp files (atomic commits). When a map task finishes : Renames the tmp file to final output. Sends message to master informing it with the filename. If another copy of the same map finished before , master ignores it , else store the filename. Reducers do the same , and if multiple copies of the same reduce task finished , MapReduce framework rely on the atomic rename of the file system. If map task are non-deterministic , and multiple copies of the map task run on different machines , weak semantic condition can happen : two reducers read15
  16. 16. Semantics in the Presence of Failures (3) 16 ● Workers #1 and #2 run the same copy of map task M1 ● Reducer task R1 reads its input for M1 from worker #1 ● Reducer task R2 reads its input for M1 from worker #2 , as worker#1 has failed by the time R2 has started. ● If M1’s function is deterministic , we have complete consistency. ● If M1’s function is not deterministic , R1 and R2 may receive different results from M1.
  17. 17. Task granularity Load balancing: fine grained is better , faster machines tend to take more tasks than slower machined over time , leads to less overall job execution time. Failure recovery : less time to re-execute failed tasks. Very fine-grained tasks may not be desirable : overhead to manage and too much data shuffling (consume valuable bandwidth). Optimal Granularity : Split size = HDFS block size (128 MB by default) HDFS block is guaranteed to be in the same node. We need to maximize work done by one mapper locally if split size < block : not fully utilizing possibility of local data processing if split size > block : may need data transfer to make map function complete 17
  18. 18. Data locality Network bandwidth is a valuable resource. We assume a rack server hardware. MapReduce scheduler works as follows : 1. Try to assign map task to the node where the corresponding split block(s) reside , if it is free assign , else go to step 2 2. try to find a free nod in the same rack to assign the map task , if can’t find a free off-rack node to assign. ● More complex implementation uses network cost model. 18
  19. 19. Backup tasks Stragglers : a set machines that run a set of assigned tasks (MapReduce) very slow. Slow running can be due to many reasons; bad disk , slow network , low speed CPU. Other tasks scheduled on stragglers cause more load and longer execution time. Solution Mechanism: When MapReduce job is close to finish for all the “in-progress” tasks , issue backup tasks. 19
  20. 20. Refinements Partitioning function : Partition the output of mapping tasks into R partitions (each for a reduce task). Good function should try as possible to make the partitions equal. Default : hash(key) mod R Works fine usually problem arises when specific keys have many records than the others. Need design custom hash functions or change the key. Combiner function Reduce size of map intermediate output. 20
  21. 21. Refinements (2) Skipping bad record Bug in third party library that can’t be fixed , causes code crash at specific records Terminating a job running for hours / days more expensive than sacrificing small percentage of accuracy (If context allows , for ex. statistical analysis of large data). How MapReduce handle that ? 1. Each worker process installs a signal handler that catches segmentation violations ,bus errors and other possible fatal errors. 2. Before a map / reduce task runs , the MapReduce library stores the key value in global variable. 3. When a map / reduce task function code generates a signal , the worker sends UDP 21
  22. 22. References 1. MapReduce: simplified data processing on large clusters 2. Hadoop Definitive guide , Ch 3 3. 4. 5. 6. Hadoop Definitive Guide - Chapter 1&2 7. 8. 22