MapReduce in Cloud Computing


Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

MapReduce in Cloud Computing

  1. 1. MapReduce in Cloud Computing Mohammad Mustaqeem M.Tech 2nd Year Computer Science and Engineering Reg. No: 2011CS17Department of Computer Science and Engineering Motilal Nehru National Institute of Technology Allahabad
  2. 2. Contents1 Introduction 1 1.1 Map and Reduce in Functional Programming . . . . . . . . . . . . . . . . . . 1 1.2 Structure of MapReduce Framework . . . . . . . . . . . . . . . . . . . . . . 12 Motivations 23 Description of First Paper 2 3.1 Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 3.2 Approach used to Tackle the Issue . . . . . . . . . . . . . . . . . . . . . . . . 3 3.2.1 Hadoop Distributed File System . . . . . . . . . . . . . . . . . . . . . 3 3.2.2 MapReduce Programming Model . . . . . . . . . . . . . . . . . . . . 4 3.3 An Example : Word Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Description of Second Paper 8 4.1 Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4.2 Approach used to Tackle the Issue . . . . . . . . . . . . . . . . . . . . . . . . 8 4.2.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.2.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.2.3 System Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Integration of both Papers 146 Conclusion 14
  3. 3. List of Figures 1 HDFS Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Execution phase of a generic MapReduce application . . . . . . . . . . . . . 5 3 Word Count Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4 System model described through the UML Class Diagram. . . . . . . . . . . 9 5 Behaviour of a generic node described by an UML State Diagram. . . . . . . 12 6 General Architecture of P2P-MapReduce. . . . . . . . . . . . . . . . . . . . . 13 3
  4. 4. 1 IntroductionCloud computing is designed to provide on demand resources or services over the Internet,usually at the scale and with the reliability level of a data center. MapReduce is a softwareframework that allows developers to write programs that process massive amounts of unstruc-tured data in parallel across a distributed cluster of processors or stand-alone computers. Itwas developed at Google for indexing Web pages. The model is inspired by the map and reduce functions commonly used in functionalprogramming(like LISP, scheme, racket etc.),[3] although their purpose in the MapReduceframework is not the same as their original forms.1.1 Map and Reduce in Functional Programming • Map: The structure of map function in Racket is - (map f list1)→ list2 [4] where f is a function and, list1 and list2 are lists. It applies function f to the elements of list1 and gives a list list2 containing results of f in order. e.g. (map (lambda (x)(* x x)) ’(1 2 3 4 5))→ ’(1 4 9 16 25) • Reduce: There are two variations of Reduce function in Racket.Their structure are - (foldl f init list1)→ any and (foldl f init list1)→ any [4]Like map, foldl applies a function to the elements of one or more lists. Whereas map combinesthe return values into a list, foldl combines the return values in an arbitrary way that isdetermined by f.In foldl, list1 is traversed from left to right while in foldr, list1 is traversedfrom right to left. e.g. (foldl - 0 ’(1 2 3 4 5 6))→ 3 (foldr - 0 ’(1 2 3 4 5 6))→ -31.2 Structure of MapReduce FrameworkThe framework is divided into two parts: • Map: It distributes out work to different nodes in the distributed cluster. 1
  5. 5. • Reduce: It collects the work and resolves the results into a single value. The MapReduce Framework is fault-tolerant because each node in the cluster is expectedto report back periodically with completed work and status updates. If a node remains silentfor longer than the expected interval, a master node makes note and re-assigns the work toother nodes.2 MotivationsThe computations that process large amount of raw data such as crawled documents, webrequest logs etc. to compute various kinds of derived data, such as inverted indices, vari-ous representations of the graph structure of web documents, summaries of the number ofpages crawled per host, the set of most frequent queries in a given day, etc., are very com-plex. Most such computations are conceptually straightforward. However, the input data isusually large and the computations have to be distributed across hundreds or thousands ofmachines(cluster) in order to finish in a reasonable amount of time. Most of the time, somemachines may fail during computation. So, we required such a solution that cope well withthese issues. MapReduce framework are able to handle these issues like how to parallelize the compu-tation, distribute the data, and handle failures of various nodes during computation. Besidethese features, writing MapReduce programs is very easy. Programmers have to just definethe two function i.e. map and reduce. Rest of the work is done by the MapReduce framework.3 Description of First PaperGaizhen Yang, ”The Application of MapReduce in the Cloud Computing”3.1 IssuesIn cloud computing, all the commodity hardware need to process enormous amount of datathat can’t be handle by single machine. The real life examples of such processing areReverseWeb-Link Graph, web access analysis, Term-Vector per Host, the inverted indexclustering, Count of URL Access Frequency, Distributed Sort etc [3]. Because of size of thesedata, we need to process it parallely in distributed manner on large clusters of machine sothat the processing can be done in reasonable amount of time. 2
  6. 6. 3.2 Approach used to Tackle the IssueHadoop is an open source Java framework for processing and querying vast amounts ofdata on large clusters of commodity hardware(cloud) and have been applied in many sitessuch as Amazon, Facebook and Yahoo etc. [1]. It takes advantage of distributed systeminfrastructure and process enormous amount of data in almost real time. It can also tacklethe node failure because it keep multiple copies of data. Hadoop has mainly two components - MapReduce and Hadoop Distributed File System(HDFS) [1].3.2.1 Hadoop Distributed File SystemHDFS provides the underlying support for distributed storage. Like traditional File System,we can make, delete, rename the files and directory. But these files and directories arestored in distributed fashion among the nodes. In HDFS, there are two types of nodes -Name Node and Data Node [1]. Name Node provides the data services while Data Nodeprovides actual storage. Hadoop cluster contains only one Name Node and multiple DataNodes. In HDFS, files are divided into blocks which are copied to multiple Data Nodes toprovide reliable File System. The HDFS architecture is shown below - Figure 1: HDFS Architecture • Name Node - Name Node is a process that runs on separate machine. It provides all the data services that is file system management and maintaining the file system tree. In reality, Name Node stores only the meta-data of the files and directories. While programming, programmer doesn’t need the actual location of the files but it can access the files through the Name Node. Name Node does all the underlying work for the users. 3
  7. 7. • Data Node - Data Node is a process that runs on individual machines of the cluster. The file blocks are stored in the local file system of these nodes. These nodes periodically sends the meta-data of the stored blocks to the Name Node. Client can directly writes the blocks to the Data Node. After writing, deleting, copying the blocks, the Data Nodes informs to the Name Node.The sequence of operations to write a file in HDFS are - 1. Client send request to write a file to the Name Node. 2. According to file size and file block configuration, NameNode returned file information of its management section to the Client. 3. Client divide files into multiple blocks. According to Data Node address information, Client writes the blocks to Data Nodes.3.2.2 MapReduce Programming ModelMapReduce is the key concept behind the Hadoop. It is widely recognized as the mostimportant programming model for Cloud computing. MapReduce is a technique for dividingwork across a distributed system. In MapReduce programming model, users have to define only two functions - a map anda reduce function. The map function processes a (key, value) pair and returns a list of (intermediate key,value) pairs: map (k1, v1) → list(k2, v2). The reduce function merges a intermediate values having the same intermediate key: reduce (k2, list(v2) → list(v3). Execution phase of a generic MapReduce application - Following sequence ofactions occur when the user submits a MapReduce Job: 1. The MapReduce library in the user program first splits the input files into M pieces. The size of these pieces is range from 16 MB to 64 MB. It then starts copying these pieces into multiple machines of the cluster. Then all the software program started. 2. Among these program one is master and others are workers or salves. There are total M map tasks and R reduce tasks. Firstly, master picks idle workers and assigns a map or reduce task. 4
  8. 8. Figure 2: Execution phase of a generic MapReduce application 3. Map task reads the contents of the corresponding input splits. It process key-value pairs of the input data and passes each pair to the user-defined Map function. The intermediate key-value pairs produced are buffered in the memory. 4. The buffered pairs are written to local disk and the location of these pairs are passed back to master. Then master forwards these memory locations to reduce workers. 5. When reduce worker gets these memory locations, it uses remote procedure calls to read data from map worker. After reading all these intermediate pairs, it reduce worker sorts it by the intermediate keys so that all the occurrences of the same key are grouped together. 6. For each intermediate key, the user defined Reduce function is applied to the corre- sponding intermediate values. Finally, the output of the Reduce function is appended to the final output file. 7. When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.After successful execution of these steps, the output is stored in R output files(one per reducetask). 5
  9. 9. 3.3 An Example : Word CountA simple MapReduce program can be written to determine how many times different wordsappear in a set of files. Let the content of a file is - the quick brown fox the fox ate the mouse how now brown cow Whole MapReduce process is depicted in the given figure - Figure 3: Word Count Execution 1. The MapReduce library splits the file content into three parts - 6
  10. 10. After splitting the data, it starts up many copies of the program on a cluster of ma- chines.2. Master copies the map task to 3 map worker. The code of the map function is like - mapper (filename, file-contents): for each word in file-contents: emit (word, 1)3. Map function is applied to each slit that generate following intermediate key-value pairs -4. When map worker is done, it reports to the master and gives the memory location of the output.5. When all the mapper task is done, the master starts reducer task on the idle machines and gives the memory location from where reduce worker starts copying the interme- diate key-value pairs.6. After receiving all the intermediate key-value pairs, reduce worker sorts these pairs to group the pairs on the basis of intermediate keys.7. At this point, the reduce function is applied to the intermediate key-value pair. The pseudocode of Reduce function is - reducer (word, values): sum = 0 for each value in values: sum = sum + value emit (word, sum) 7
  11. 11. 8. The final output of the Reduce function is - brown, 2 fox, 2 how, 1 now, 1 the, 3 ate, 1 cow, 1 mouse, 1 quick, 14 Description of Second PaperFabrizio Marozzo, Domenico Talia, Paolo Trunfioa, ”P2P-MapReduce: Paralleldata processing in dynamic Cloud environments”4.1 IssuesMapReduce is a development model that allows developers to write programs that processmassive amounts of unstructured data in parallel across a distributed cluster of machines.In a cloud, the nodes may leave and join at runtime.So, we required such a system thatcan handle such conditions. The MapReduce that is discussed so far is based on centralizedarchitecture and it can’t also tackle with the dynamic infrastructure, in which nodes mayjoin and leave the network at high rates. This paper describes an adaptive P2P-MapReducesystem that can handle the situation in which master node may fail.4.2 Approach used to Tackle the IssueThe main goal of P2P-MapReduce is to give such a infrastructure in which nodes may joinand leave the cluster without effecting the MapReduce functionality. This is required becausein cloud environment, there is high levels of churn. To achieve this goal, P2P-MapReduceadopts a peer-to-peer model in which a wide set of autonomous nodes can act either as masteror as a slave. The master and slave are interchanging to each other dynamically in such away that the ratio between the number of masters to the slaves remains constant. In P2P-MapReduce, to prevent the lose of computation in case of Master failure, there isare some backup masters for each masters. The master responsible for a job J, referred to as 8
  12. 12. the primary master for J, dynamically updates the job state on its backup nodes, which arereferred to as the backup masters for J. If at some instant, a primary master fails, its placeis taken by one of its backup masters.4.2.1 System ModelHere, the system model of P2P-MapReduce describes the characteristics of jobs, tasks, users,and nodes at abstract level. The UML class diagram is given below: Figure 4: System model described through the UML Class Diagram. • Job: A job can be modelled as following: job = jobId, code, input, output, M, R where jobId is a job identifier, code includes the map and reduce functions, input and output represent the locations of the input and output data respectively, M and R are the number of map tasks and reduce task respectively. • Task: A task can be modelled as following: 9
  13. 13. task = taskId, jobId, type, code, input, output where taskId and jobId are task identifier and job identifier respectively, type can be either MAP or REDUCE, code represents the map or reduce function(depending on the task type), and input & output represents the location of the input & output data of the task.• User: A user is modelled as a pair of the form: user = userId, userJobList where userId is user identifier and userJobList is the list of jobs submitted by the user.• Node: A node has following tuples: node = nodeId, role, primaryJobList, backupJobList, slaveT askList where nodeId represents the node identifier, the node’s role(MASTER or SLAVE) is identified by the role tuple, primaryJobList is the list of jobs managed by the node, backupJobList is the list of jobs of whom it is acting as a backup Master, slaveTaskList is empty if the node’s role is MASTER otherwise it contains the list of (map or reduce)task assigned to it.• PrimaryJobType: It has following tuples: primaryJobType = job, userId, jobStatus, jobT askList, backupM asterList where job is a job descriptor, userId is the user identifier, jobStatus is the current status of job, jobTaskList is the list of tasks contains in job, backupMasterList is the list of backup Masters of the primary job.• JobTaskType: JobTaskType has following tuples: jobTaskType = task, slaveId, taskStatus where task is a task descriptor, slaveId is the identifier of the slave node responsible for the task and taskStatus is current status of the task.• BackupJobType: The backupJobList contains tuples of a backupJobType defined as: backupJobType = job, userId, jobStatus, jobT askList, backupM asterList 10
  14. 14. BackupJobType differs from primaryJobType for the presence of an additional field, primaryId, which represents the identifier of the primary master associated with the job. • slaveTaskType: SlaveTaskType has following tuples: slaveTaskType = task, primaryId, taskStatus where task is a task descriptor, primaryId is the identifier of the primary master asso- ciated with the task, and taskStatus contains its status.4.2.2 ArchitectureThere are three types of node in P2P-MapReduce architecture i.e. user, master and slave.Master nodes and Slave nodes form two two logical peer-to-peer networks referred to as M-netand S-net, respectively. The composition of the M-net and S-net are changing dynamicallybecause as earlier described, the role of master node and slave node are interchanging.User node submits the MapReduce job to one of the available master nodes. The selectionof master node is done by current workload of the available master nodes.Master nodes are at the core of the system. They perform three types of operations: man-agement, recovery and coordination. A master node that is acting as primary master for oneor more jobs, executes the management operation. A master node that is acting as backupmaster for one or more jobs, executes the recovery operation. The coordinator operationchanges slaves into masters and vice-versa, so as to keep the desired master/slave ratio. The slave executes tasks that are assigned to it by one or more primary masters. Jobs and tasks are managed by process called Job Managers and Task Managers respec-tively. For each managed jobs, primary master runs one Job Manager while slave runs onetask Manager for each assigned task. In addition to this, masters also runs Backup JobManager for each job they are responsible for a backup masters.4.2.3 System MechanismThe behaviour of a generic node can be easily understood by UML state diagram. With thestates, it also gives the events by which the state of the node changes. UML state diagramof a node in P2P-MapReduce architecture is given below: 11
  15. 15. Figure 5: Behaviour of a generic node described by an UML State Diagram. The state diagram shows two macro-states, SLAVE and MASTER, which is the two rolethat can a node has. The SLAVE macro-state has three states, IDLE, CHECK MASTERand ACTIVE, which represent respectively: a slave waiting for task assignment; a slavechecking the existence of at least one master in the network; a slave executing one or moretasks. The MASTER macro-state is modelled with three parallel macro-states, which rep-resent the different roles a master can perform concurrently: possibly acting as the primarymaster for one or more jobs (MANAGEMENT); possibly acting as a backup master for oneor more jobs (RECOVERY); coordinating the network for maintenance purposes (COORDI-NATION). The MANAGEMENT macro-state contains two states: NOT PRIMARY, whichrepresents a master node currently not acting as the primary master for any job, and PRI-MARY, which, in contrast, represents a master node currently managing at least one jobas the primary master. Similarly, the RECOVERY macro-state includes two states: NOTBACKUP (the node is not managing any job as backup master) and BACKUP (at least onejob is currently being backed up on this node). Finally, the COORDINATION macro-stateincludes four states: NOT COORDINATOR (the node is not acting as the coordinator),COORDINATOR (the node is acting as the coordinator), WAITING COORDINATOR andELECTING COORDINATOR for nodes currently participating to the election of the newcoordinator, as specified later. The combination of the concurrent states [NOT PRIMARY,NOT BACKUP, NOT COORDINATOR] represents the abstract state MASTER.IDLE. The 12
  16. 16. transition from master to slave role is allowed only to masters in the MASTER.IDLE state.Similarly, the transition from slave to master role is allowed to slaves that are not in ACTIVEstate.4.3 ExampleWhole system mechanism can be understood by a simple example which is described byfollowing figure: Figure 6: General Architecture of P2P-MapReduce. Figure 6 shows that total three jobs have been submitted: one job by User1(Job1) andtwo jobs by User2(Job2 and Job3). For Job1, Node1 is primary master, and Node2 & Node3are backup masters. Job1 is composed by five tasks: two of them are assigned to Node4, andone each to Node7, Node9 and Node11. The following recovery procedure takes place when a primary master Node1 fails: • Backup masters Node2 and Node3 detect the failure of Node1 and start a distributed procedure to elect the new primary master among them. • Assuming that Node3 is elected as the new primary master, Node2 continues to play the backup function and, to keep the desired number of backup masters active (two, 13
  17. 17. in this example), another ackup node is chosen by Node3. Then, Node3 binds to the connections that were previously associated with Node1, and proceeds to manage the job using its local replica of the job state. As soon as the job is completed, the (new) primary master notifies the result to the usernode that submitted the managed job.5 Integration of both Papers First Paper Second Paper Issues To perform data-intensive com- To design a peer-to-peer MapRe- putation in Cloud environment in duce system that can handle all reasonable amount of time. the node’s failure including Mas- ter node’s failure. Approaches Used Simple MapReduce(presented by Peer-to-peer architecture is used Google) implementation is used. to handle all the dynamic churns MapReduce is based on the in a cluster. Master-Slave Model. This imple- mentation is known as Hadoop. Advantages Hadoop is scalable, reliable and P2P-MapReduce can manage distributed able to handle enor- node churn, master failures and mous amount of data. It can pro- job recovery in an effective way. cess big data in real time. Table 1: Comparison between two papers.6 ConclusionMapReduce is a scalable, reliable and exploits the distributed system to perform efficiently incloud environment. P2P-MapReduce is a novel approach to handle the real world problemsfaced by data-intensive computing. The P2P-MapReduce is more reliable than the MapRe-duce framework because it is able to manage node churn, master failures, and job recovery ina decentralized but effective way. Thus, cloud-based programming model will be the futuretrends in the programming field. 14
  18. 18. References[1] Gaizhen Yang, ”The Application of MapReduce in the Cloud Computing”, International Symposium on Intelligence Information Processing and Trusted Computing (IPTC), Oc- tober 2011, pp. 154-156, &arnumber=6103560.[2] Fabrizio Marozzo, Domenico Talia, Paolo Trunfioa, ”P2P-MapReduce: Parallel data pro- cessing in dynamic Cloud environments”, Journal of Computer and System Sciences, vol. 78, Issue 5 September 212, pp. 1382-1402, 2240494.[3] Jeffrey Dean and Sanjay Ghemawat, MapReduce: simplified data processing on large clusters, OSDI’04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, vol. 6, 2004, pp.10-10, osdi04/tech/full_papers/dean/dean.pdf and id=1251254.1251264.[4] The Racket Guide[5] Hadoop Tutorial - YDN html.[6][7] F. Marozzo, D. Talia, P. Trunfio., ”A Peer-to-Peer Framework for Supporting MapReduce Applications in Dynamic Cloud Environments”, In: N. Antonopoulos, L. Gillam (eds.), Cloud Computing: Principles, Systems and Applications, Springer, Chapter 7, 113-125, 2010.[8] IBM developer work, Using MapReduce and load balancing on the cloud http://www. 15