Next Generation Hadoop: High Availability for YARN

2,928 views

Published on

Final Report for ID2219 course. Project title: Next Generation Hadoop: High Availability for YARN

Published in: Education

Next Generation Hadoop: High Availability for YARN

  1. 1. Next Generation Hadoop: High Availability for YARN Arinto Murdopo Jim Dowling KTH Royal Institute of Technology Swedish Institute of Computer Science Hanstavägen 49 - 1065A, Isafjordsgatan 22, 164 53 Kista, Sweden 164 40 Kista, Sweden arinto@kth.se jdowling@kth.seABSTRACT several cluster computing frameworks to handle big data ef-Hadoop is one of the widely-adopted cluster computing frame- fectively.works for big data processing, but it is not free from limi-tations. Computer scientists and engineers are continuously One of the widely adopted cluster computing frameworksmaking efforts to eliminate those limitations and improve that commonly used by web-companies is Hadoop1 . It mainlyHadoop. One of the improvements in Hadoop is YARN, consists of Hadoop Distributed File System (HDFS) [11] towhich eliminates scalability limitation of the first generation store the data. On top of HDFS, MapReduce frameworkMapReduce. However, YARN still suffers from availabil- inspired by Google’s MapReduce [1] was developed to pro-ity limitation, i.e. single-point-of-failure in YARN resource- cess the data inside. Although Hadoop arguably has be-manager. In this paper we propose an architecture to solve come the standard solution for managing big data, it is notYARN availability limitation. The novelty of this architec- free from limitations. These limitations have triggered sig-ture lies on its stateless failure model, which enables multiple nificant efforts from academia and enterprise to improveYARN resource-managers to run concurrently and maintains Hadoop. Cloudera tried to reduce availability limitationhigh availability. MySQL Cluster (NDB) is proposed as the of HDFS using NameNode replication [9]. KTHFS solvedstorage technology in our architecture. Furthermore, we im- the HDFS availability limitation by utilizing MySQL Clus-plemented a proof-of-concept for the proposed architecture. ter to make HDFS NameNode stateless [12]. Scalability ofThe evaluations show that the proof-of-concept is able to in- MapReduce has become prominent limitation. MapReducecrease the availability of YARN. In addition, NDB is shown has reached scalability limit of 4000 nodes. To solve thisto have the highest throughput compared to Apache’s pro- limitation, the open source community proposed the nextposed storages (ZooKeeper and NDB). Finally, the evalua- generation MapReduce called YARN (Yet Another Resourcetions show the NDB achieves linear scalability hence it is Negotiator) [8]. From the enterprise world, Corona was re-suitable for our proposed stateless failure model. leased by Facebook to overcome the aforementioned scalabil- ity limitation [2]. Another limitation is Hadoop’s inabilityCategories and Subject Descriptors to perform fine-grained resource sharing between multiple computation frameworks. Mesos tried to solve this limita-D.4.7 [Operating Systems]: Distributed Systems, Batch tion by implementation of distributed two-level schedulingProcessing Systems mechanism called resource offers [3].General Terms However, few solutions have addressed the availability lim-Big Data, Storage Management itation in MapReduce framework. When a MapReduce’s JobTracker failures occur, the corresponding application is1. INTRODUCTION not able to continue, reducing MapReduce’s availability. Cur-Big data has become widespread across industries, especially rent YARN architecture is unable to solve this availabilityweb-companies. It has reached petabytes scale and it will limitation. ResourceManager, the JobTracker-equivalent inkeep increasing in the upcoming years. Traditional storage YARN, remains a single-point-of-failure. The open sourcesystems such as regular file systems and relational databases community has recently started to solve this issue but noare not designed to handle this petabytes-scale of magnitude. final and proven solution is available yet2 . The current pro-Scalability is the main issue for the traditional storage sys- posal from the open source community is to use ZooKeepertems in handling big data. This situation has resulted in [4] or HDFS as a persistent storage to store ResourceMan- ager’s states. Upon failure, ResourceManager will be recov- ered using the stored states. Solving this availability limitation will bring YARN into cloud-ready state. YARN can be executed in the cloud, such as Amazon EC2, and it is resistant to failures that often happen in the cloud. 1 http://hadoop.apache.org/ 2 https://issues.apache.org/jira/browse/YARN-128
  2. 2. In this report, we present a new architecture for YARN. Themain goal of the new architecture is to solve the aforemen-tioned availability limitation in YARN. This architectureprovides better alternatives than the existing Zoo-Keeper-based architecture since it eliminates the potential scalabil-ity limitation due to ZooKeeper’s relatively limited through-put.For achieving the desired availability, the new architectureutilizes a distributed in-memory database called MySQLCluster(NDB)3 to persist the ResourceManager states. NDBitself automatically replicates the stored data into differentNDB data-nodes to ensure high availability. Moreover, NDBis able to handle up to 1.8 million write queries per sec-ond [5].This report is organized as following. Section 2 presentsexisting YARN architecture, its availability limitations andproposed solution from Apache. The proposed architecture Figure 1: YARN Architectureis presented in Section 3. Section 4, presents our evaluationto verify the availability and the scalability of the proposedarchitecture. The related works in improving availability pluggable, which means we can implement our ownof cluster computing framework are presented in Section 5. scheduling policy to be used in our YARN deployment.And we conclude this report and propose future work for YARN currently provides three policies to choose from,this project in Section 6. i.e. fair-scheduler, FIFO-scheduler and capacity-scheduler. For the available resources, scheduler should ideally2. YARN ARCHITECTURE use CPU, memory, disk and other computing resourcesThis section explains the current YARN architecture, YARN as factor of resources during scheduling. However, cur-availability limitation, and Apache’s proposed solution to rent YARN only supports memory as the factor of re-overcome the limitation. source during scheduling. 2. Resource-tracker, which handles computing-nodes man-2.1 Architecture Overview agement. ”Computing-nodes” in this context meansYARN’s main goal is to provide more flexibility compared nodes that have node-manager process run on it andto Hadoop in term of data processing framework that can have computing resources. The management tasks in-be executed on top of it [7]. It is equipped with generic dis- clude new nodes registration, handling requests fromtributed application framework and resource-management invalid or decommisioned nodes, and nodes’ heartbeatscomponents. Therefore, YARN supports not only MapRe- processing. Resource-tracker works closely with node-duce, but also other data processing frameworks such as liveness-monitor(NMLivenessMonitor class), which keepsApache Giraph, Apache Hama and Spark. track of live and dead computing nodes based on nodes’ heartbeats, and node-list-manager(NodesListManagerIn addition, YARN is aimed to solve scalability limitation class), which store the list of valid and excluded com-in original implementation of Apache’s MapReduce [6]. To puting nodes based on YARN configuration files.achieve this aim, YARN splits MapReduce job-tracker re- 3. Applications-manager, which maintains collection ofsponsibilities of application scheduling, resource manage- user submitted jobs and cache of completed jobs. It isment and application monitoring into separate processes the entry point for clients to submit their jobs.or daemons. The new processes that handle job-trackerresponsibilities are resource-manager which handles globalresource management and job scheduling, and application- In YARN, clients submit jobs through applications-managermaster which is responsible for job monitoring, job life-cyle and the submission triggers scheduler to try to schedulemanagement and resource negotiation with the resource- the job. When the job is scheduled, resource-manager allo-manager. Each submitted job corresponds to an application- cates a container and launches a corresponding application-master process. Furthermore, YARN converts original MapRe- master. The application-master takes over and process theduce task-tracker into node-manager, which manages task job by splitting them into smaller tasks, requesting addi-execution in YARN’s unit of resource called container. tional containers to resource-manager, launching them with the help of node-manager, assigning the tasks into the avail-Figure 1 shows the current YARN architecture. Resource- able containers and keeping track of the job progress. Clientsmanager has three core components, they are: learn the job progress by polling application-master every specific seconds based on YARN configuration. When the job is completed, application-master cleans up its working 1. Scheduler, which schedules submitted jobs based on state. specific policy and available resources. The policy is3 http://www.mysql.com/products/cluster/ 2.2 Availability Limitation in YARN
  3. 3. Although YARN solves the scalability limitation of origi-nal MapReduce, it still suffers from an availability limita-tion which is the single-point-of-failure nature of resource-manager. This section explains why YARN resource-manageris a single-point-of-failure.Refer to Figure 1, container and task failures are handledby node-manager. When a container fails or dies, node-manager detects the failure event and launches a new con-tainer to replace the failing container and restart the taskexecution in the new container. Figure 2: Stateless Failure ModelIn the event of application-master failure, the resource-managerdetects the failure and start a new instance of the application- and the first allocated container details such as containermaster with a new container. The ability to recover the as- identification number, container node detail, requested re-sociated job state depends on the application-master imple- source and job priority.mentation. MapReduce application-master has the abilityto recover the state but it is not enabled by default. Other Upon restart, resource-manager reloads the saved informa-than resource-manager, associated client also reacts with the tion and restarts all node-managers and application-masters.failure. The client contacts the resource-manager to locate This restart mechanism does not retain the jobs that cur-the new application-master’s address. rently executing in the cluster. In the worst case, all progress will be lost and the job will be started from the beginning.Upon failure of a node-manager, the resource-manager up- To minimize this effect, a new application-master should bedates its list of available node-managers. Application-master designed to read the previous application-master states thatshould recover the tasks run on the failing node-managers executes under the failed resource-manager. For example,abut it depends on the application-master implementation. MapReduce application-master handles this case by stor-MapReduce application-master has an additional capability ing the progress in another process called job-history-serverto recover the failing task and blacklist the node-managers and upon restart, a new application-master obtains the jobthat often fail. progress from a job-history-server.Failure of the resource-manager is severe since clients can The main drawback of this model is the existence of down-not submit a new job and existing running job could not time to start a new resource-manager process when the oldnegotiate and request for new container. Existing node- one fails. If the down-time is too long, all processes reachmanagers and application-masters try to reconnect to the time-out and clients need to re-submit their jobs to the newfailed resource-manager. The job progress will be lost when resource-manager. Furthermore, HDFS is not suitable forthey are unable to reconnect. This lost of job progress will storing lots of data with small size (in this case, the datalikely frustrate engineers or data scientists that use YARN are the application states and the application-attempts).because typical production jobs that run on top of YARN ZooKeeper is suitable for current data size, but it is likelyare expected to have long running time and typically they to introduce problem when the amount of stored data in-are in the order of few hours. Furthermore, this limitation is creased since ZooKeeper is designed to store typically smallpreventing YARN to be used efficiently in cloud environment configuration data.(such as Amazon EC2) since node failures often happen incloud environment. 3. YARN WITH HIGH AVAILABILITY We explain our proposed failure model and architecture to2.3 Proposed Solution from Apache solve YARN availability limitation. Furthermore, implemen-To tackle this availability issue, Apache proposed to have tation of the proposal is explained in this section.recovery failure model using ZooKeeper or HDFS-based per-sistent storage4 . The proposed recovery failure model istransparent to clients, that means clients does not need to 3.1 Stateless Failure Modelre-submit the jobs. In this model, resource-manager saves We propose stateless failure model, which means all neces-relevant information upon job submission. sary information and states used by resource-manager are stored in a persistent storage. Based on our observation,These information currently include application-identification- these information include:number, application-submission-context and list of application-attempts. An application-submission-context contains in- 1. Application related information such as application-id,formation related to the job submission such as applica- application-submission-context and application-attempts.tion name, user who submits the job, and amount of re-quested resource. An application-attempt represents each 2. Resource related information such as list of node-managersresource-manager attempt to run a job by creating a new and available resources.application-master process. The saved information relatedto application-attempt are attempt identification number Figure 2 shows the architecture of stateless failure model.4 https://issues.apache.org/jira/browse/YARN-128 Since all the necessary information are stored in persistent
  4. 4. Column Type id int clustertimestamp bigint submittime bigint appcontext varbinary(13900) Table 1: Properties of application state scalability is achieved by auto-data-sharding based on user- defined partition key. The latest benchmark from Oracle shows that MySQL Cluster version 7.2 achieves horizontal scalability, i.e when number of datanodes is increased 15 times, the throughput is increased 13.63 times [5]. Regarding the performance, NDB has fast read and write rate. The aforementioned benchmark [5] shows that 30- node-NDB cluster supports 19.5 million writes per second. It supports fine-grained locking, which means only affectedFigure 3: YARN with High Availability Architec- rows are locked during a transaction. Updates on two dif-ture ferent rows in the same table can be executed concurrently. SQL and NoSQL interfaces are supported which makes NDB highly flexible depending on users’ needs and requirements.storage, it is possible to have more than one resource-managersrunning at the same time. All of the resource-managersshare the information through the storage and none of them 3.3 NDB Storage Modulehold the information in their memory. As a proof-of-concept of our proposed architecture, we de- signed and implemented NDB storage module for YARNWhen a resource-manager fails, the other resource-managers resource-manager. Due to limited time, recovery failurecan easily take over the job since all the needed states are model was used in our implementation. In this report, westored in the storage. Clients, node-managers and application- will refer the proof-of-concept of NDB-based-YARN as YARN-masters need to be modified so that they can point to new NDB.resource-managers upon the failure.To achieve high availability through this failure model, we 3.3.1 Database Designneed to have a storage that has these following requirements: We designed two NDB tables to store application states and their corresponding application-attempts. They are called applicationstate and attemptstate. Table 1 shows the columns 1. The storage should be highly available. It does not for applicationstate table. id is a running number and it is have single-point-of-failure. only unique within a resource-manager. clustertimestamp is the timestamp when the corresponding resource-manager 2. The storage should be able to handle high read and is started. When we have more than one resource-manager write rates for small data (in the order of at most few running at a time (as in stateless failure model), we need to kilo bytes), since this failure model needs to perform differentiate the applications that run among them. There- very frequent read and write to the storage. fore, the primary keys for this table are id and clustertimes- tamp. appcontext is a serialized ApplicationSubmissionCon- text object, thus the type is varbinary.ZooKeeper and HDFS satisfy the first requirement, but theydo not satisfy the second requirement. ZooKeeper is not de- The columns for attemptstate table are shown in Table 2.signed as a persistent storage for data and HDFS is not de- applicationid and clustertimestampe are the foreign keys tosigned to handle high read and write rates for small data. We applicationstate table. attemptid is the id of an attemptneed other storage technology and MySQL Cluster (NDB) and mastercontainer contains serialized information aboutis suitable for these requirements. Section 3.2 explain NDB the first container that is assigned into the correspondingin more details. application-master. The primary keys of this table are at- temptid, applicationid and clustertimestamp.Figure 3 shows the high level diagram of the proposed ar-chitecture. NDB is introduced to store resource-manager To enhance table performance in term of read and writestates. throughput, partitioning technique was used5 . Both ta- bles were partitioned by applicationid and clustertimestamp. With this technique, NDB located the desired data with-3.2 MySQL Cluster (NDB) out contacting NDB’s location resolver service, hence it wasMySQL Cluster (NDB) is a scalable in-memory distributed faster compared to NDB tables without partitioning.database. It is designed for availability, which means thereis no single-point-of-failure in NDB cluster. Furthermore, 5 http://dev.mysql.com/doc/refman/5.5/en/partitioning-it complies with ACID-transactional properties. Horizontal key.html
  5. 5. Column Type and all unfinished jobs are re-executed with a new application- attemptid int attempt. applicationid int clustertimestamp bigint mastercontainer varbinary(13900) 4. EVALUATION We designed two types of evaluation in this project. The Table 2: Properties of attempt state first evaluation was to test whether the NDB storage module works as expected or not. The second evaluation was to investigate and compare the throughput among ZooKeeper, HDFS and NDB when storing YARN’s application state. 4.1 NDB Storage Module Evaluation 4.1.1 Unit Test This evaluation used the unit test class explained in Sec- tion 3.3.2. It was performed using single-node-NDB-cluster i.e. two NDB datanode-processes in a node. on top of a computer with 4 GB of RAM and Intel dual-core i3 CPU at 2.40 GHz. We changed accordingly the ClusterJ’s Java- properties-file to point into our single-node-NDB-cluster. The unit test class was executed using Maven and Netbeans, and the result was positive. We tested the consistency by executing the unit test class several times and the results were always pass. 4.1.2 Actual Resource-Manager Failure Test In this evaluation, we used Swedish Institute of Computer Science (SICS) cluster. Each node in SICS’s cluster had 30 GB of RAM and two six-core AMD Opteron processor Figure 4: NDB Storage Unit Test Flowchart at 2.6GHz, which effectively could run 12 threads without significant context-switching overhead. Ubuntu 11.04 with Linux Kernel 2.6.38-12-server was installed as the operat-3.3.2 Integration with Resource-Manager ing system and Java(TM) SE Runtime Environment (JRE)We developed YARN-NDB using ClusterJ6 for two develop- version 1.6.0 was the Java runtime environment.ment iterations based on patches released by Apache. Thefirst YARN-NDB implementation is based on YARN-128.full- NDB was deployed in 6-node-cluster and YARN-NDB wascode.5 patch on top of Hadoop trunk dated 11 November configured using single-node setting. We executed pi and2012. The second implementation7 is based on YARN-231-2 bbp examples that come from Hadoop distribution. In thepatch8 on top of Hadoop trunk dated 23 December 2012. In middle of pi and bbp execution, we terminated the resource-this report, we refer to the second implementation of YARN- manager process using Linux kill command. The new resource-NDB unless otherwise specified. The NDB storage module manager with the same address and port was started threein YARN-NDB has same functionalities as Apache YARN’s seconds after the old one was successfully terminated.HDFS and ZooKeeper storage module such as adding anddeleting application states and attempts. We observed that the currently running job finished prop- erly, which means the resource-manager was correctly restarted.Furthermore, we developed unit test module for the storage Several connection-retry-attempts to contact the resource-module. Figure 4 shows the flowchart of this unit test mod- manager by node-managers, application-masters and MapRe-ule. In this module, three MapReduce jobs are submitted duce clients were observed. To check for consistency, weinto YARN-NDB. The first job finishes the execution before submitted a new job to the new resource-manager and thea resource-manager fails. The second job is successfully sub- new job was finished correctly. We repeated this experi-mitted and scheduled, hence application-master is launched, ment several times and same results were observed, i.e thebut no container is allocated. The third job is successfully new resource-manager was successfully restarted and tooksubmitted but not yet scheduled. These three jobs represent over the killed resource-manager’s roles correctly.three different scenarios when a resource-manager fails.Restarting a resource-manager is achieved by connecting 4.2 NDB Performance Evaluationthe existing application-masters and node-managers to the We utilised the same set of machines in SICS cluster as ournew resource-manager. All application-masters and node- evaluation in Section4.1.2. NDB was deployed in the samemanagers process are rebooted by the new resource-manager 6-node-cluster and ZooKeeper were deployed to three SICS nodes. The maximum memory for each ZooKeeper process6 http://dev.mysql.com/doc/ndbapi/en/mccj.html was set to 5GB of RAM. HDFS were also deployed to three7 https://github.com/arinto/hadoop-common SICS nodes and it used ZooKeeper’s maximum memory con-8 https://issues.apache.org/jira/browse/YARN-231 figuration of 5GB of RAM.
  6. 6. 18000 ZooKeeper NDB 16000 HDFS 14000 Completed requests/s 12000 10000 8000 6000 4000 2000 0 R intensive W intensive R/W intensive Workload type Figure 5: zkndb Architecture Figure 6: zkndb Throughput Benchmark Result for 8 Threads and 1 Minute of Benchmark Execution 4.2.1 zkdnb FrameworkWe developed zkndb framework9 to effectively benchmarkstorage systems with minimum effort. Figure 5 shows the Application state information was an array of random bytes,architecture of zkndb framework. The framework consists with length of 53 bytes. The length of application state in-of three main packages: formation was determined after observing actual application state information that stored when executing YARN-NDB jobs. Each data-read consisted of reading an application 1. storage package, which contains the configurable load identification and its corresponding application state infor- generator (StorageImpl ) in term of number of reads mation. and writes per time unit. 2. metrics package, which contains metrics parameters Three types of workload were used in our experiment, they (MetricsEngine), for example write or read request were: and acknowledge. Additionally, this package contains also the metrics logging mechanism (ThroughputEngineImpl ). 1. Read-intensive. One set of data was written into database, 3. benchmark package, which contains benchmark appli- and zkndb always read on the written data. cations and manages benchmark executions. 2. Write-intensive. No read was performed, zkndb always wrote a new set of data into different location.zkndb framework offers flexibility in integrating new storagetechnologies, defining new metrics and storing benchmark 3. Read-write balance. Read and write were performedresults. To integrate a new storage technology, framework alternately.users can implement storage-interface in storage package. Anew metric can be developed by implementing metric in- Furthermore, we varied the throughput rate by configuringterface in metrics package. Additionally, framework users the number of threads that accessed the database for read-can design new metrics logging mechanism by implementing ing and writing. To maximize the throughput, no delay wasthroughput-engine-interface in metrics package. Resulting configured in between each read and each write. We com-data produced by ThroughputEngineImpl were further pro- pared the throughput between ZooKeeper, HDFS, and NDBcessed by our custom scripts for further analysis. For this for equal configurations of number of threads and workloadevaluation, three storage implementations were added into types. In addition, scalability of each storage was investi-the framework, which are NDB, HDFS and ZooKeeper. gated by increasing the number of threads, while keeping the other configurations unchanged. 4.2.2 Benchmark Implementation in zkndbFor ZooKeeper and HDFS, we ported YARN’s storage mod- 4.2.3 Throughput Benchmark Resultule implementation based on YARN-128.full-code.5 patch10 Figure 6 shows the throughput benchmark result for eightinto our benchmark. The first iteration of YARN-NDB’s threads and one minute of execution with the three types ofNDB storage module is ported into our zkndb NDB storage different workload and three types of storage implementa-implementation. tion: ZooKeeper, NDB and HDFS.Each data-write into the storage had an application identifi- For all three workload types, NDB had the highest through-cation and application state information. Application iden- put compared to ZooKeeper and HDFS. These results cantification was a Java long data type with size of eight bytes. be attributed to the nature of NDB as a high performance9 https://github.com/4knahs/zkndb persistent storage which is capable to handle high read and10 https://issues.apache.org/jira/browse/YARN-128 write request rate. Refer to the error bar in Figure6, NDB
  7. 7. 45000 30000 ZooKeeper ZooKeeper NDB NDB 40000 HDFS HDFS 25000 35000Completed requests/s Completed requests/s 30000 20000 25000 15000 20000 15000 10000 10000 5000 5000 0 0 4 8 12 16 24 36 4 8 12 16 24 36 Number of threads Number of threadsFigure 7: Scalability Benchmark Results for Read- Figure 8: Scalability Benchmark Results for Write-Intensive Workload Intensive Workloadhas big deviation between its average and the lowest value the other hand, HDFS performed very poor for this work-during experiment. This big deviation could be attributed load. The highest throughput achieved by NDB with 36to infrequent intervention from NDB management process threads was only 534.92 requests per second. The poor per-to recalculate the data index for fast access. formance of HDFS could be attributed to the same reasons as explained in Section 4.2.3, which are NameNode-lockingInterestingly, ZooKeeper’s throughput were stable for all overhead and inefficient data access pattern for small files.workload types. This throughput stability can be accountedto ZooKeeper’s behavior to linearize incoming requests that 5. RELATED WORKcauses read and write request have approximately the sameexecution time. Another possible explanation for ZooKeeper’s 5.1 Coronathroughput stability is the YARN’s ZooKeeper storage mod- Corona [2] introduces a new process called cluster-managerule implementation. The module implementation code could to take over cluster management functions from MapReducecause the read and write execution time equal. job-tracker. The main purposes of cluster-manager is to keep track of amount of free resources in and manages the nodesAs expected, HDFS had the lowest throughput for all work- in the cluster. Corona utilizes push-based scheduling, i.e.load types. HDFS’ low throughput may be attributed to cluster-manager pushes the allocated resources back to theHDFS’ NameNode-locking overhead and inefficient data ac- job-tracker after it receives resource requests. Furthermore,cess pattern when it processes lots of small files. Each Corona claims that scheduling latency is low since there is notime HDFS receives read or write request, HDFS NameN- periodic heartbeat involved during this resource scheduling.ode needs to acquire a lock for the file path so HDFS can Although Corona solves the MapReduce scalability limita-return a valid result. Acquiring a lock frequently increases tion, it has single-point-of-failure in cluster-manager hencedata access time hence decreases the throughput. The inef- the MapReduce availability limitation is still present.ficient data access pattern in HDFS is due to data splittingto fit the data into HDFS block and data replication. Fur- 5.2 KTHFSthermore, the needs to write the data into disk in HDFS KTHFS [12] solves scalability and availability limitation ofdecreases the throughput as observed in write-intensive and HDFS NameNodes. The filesystem metadata informationread-write balance workload. of HDFS NameNodes are stored in NDB, hence the HDFS NameNodes are fully state-less. By being state-less, more4.2.4 Scalability Benchmark Result than one HDFS NameNodes can run simultaneously and failure of HDFS NameNodes can be easily mitigated by theFigure 7 shows the increases in throughput when we in- remaining alive NameNodes. Furthermore, KTHFS has lin-creased the number of threads for read-intensive workload. ear throughput scalability, that is throughput increment canAll of the storage implementation increased their through- be performed by adding HDFS NameNodes or adding NDBput when the number of threads were increased. NDB had DataNodes. KTHFS has inspired the NDB usage to solvethe highest increase compared to HDFS and ZooKeeper. For the YARN availability limitation.NDB, doubling the number of threads increased the through-put by 1.69 and it was close to linear scalability. 5.3 MesosSame trend was observed for write-intensive workload as Mesos [3] is a resource management platform that enablesshown in 8. NDB still had the highest increase in through- commodity-cluster-sharing between different cluster comput-put compared to HDFS and ZooKeeper. For NDB, doubling ing frameworks. Cluster utilization is improved due to thethe number of threads increased the throughput by 1.67. On sharing mechanism. It has several master processes that
  8. 8. have similar roles compared to YARN resource-manager. ¨ thank our colleague: Umit Cavus B¨y¨k¸ahin, Strahinja ¸ u u sThe availability of Mesos is achieved by having several stand- Lazetic and, Vasiliki Kalavri for providing feedback through-by master processes to replace the failed active master pro- out this project. Additionally we would like to thank ourcess. Mesos utilizes ZooKeeper to monitor the group of mas- EMDC friends: Muhammad Anis uddin Nasir, Emmanouilter processes. And during master process failures, ZooKeeper Dimogerontakis, Maria Stylianou and Mudit Verma for con-performs leader-election to choose the new active master tinuous support throughout report writing process.process. Reconstruction of state is performed by the newlyactive master process. This reconstruction mechanism may 8. REFERENCESintroduce significant amount of delay when the state is big. [1] J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Commun. ACM,5.4 Apache HDFS-1623 51(1):107–113, Jan. 2008.Apache utilizes failover recovery model to solve HDFS Na- [2] Facebook. Under the hood: Scheduling MapReducemeNode single-point-of-failure limitation [9,10]. In this solu- jobs more efficiently with corona, Nov. 2012. Retrievedtion, additional HDFS NameNodes are introduced as stand- at November, 18, 2012 fromby NameNodes. The active NameNode writes all changes http://on.fb.me/109FHPD.to the file system namespace into a write-ahead-log in a [3] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi,persistent storage. Overhead when storing data is likely to A. D. Joseph, R. Katz, S. Shenker, and I. Stoica.be introduced and the overhead magnitude depends on the Mesos: a platform for fine-grained resource sharing inchoice of storage system. This solution supports automatic the data center. In Proceedings of the 8th USENIXfailover, but the solution complexity increases due to the conference on Networked systems design andexistence of additional processes as failure detectors. These implementation, NSDI’11, page 22, Berkeley, CA,failure detectors trigger automatic failover mechanism when USA, 2011. USENIX Association.they detect NameNode failures. [4] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. ZooKeeper: wait-free coordination for internet-scale6. CONCLUSION AND FUTURE WORK systems. In USENIX ATC, volume 10, 2010.We have presented an architecture for highly-available clus- [5] M. Keep. MySQL cluster 7.2 GA released, delivers 1ter computing management framework. The proposed ar- BILLION queries per minute, Apr. 2012. Retrieved atchitecture incorporated state-less failure model into exist- November, 18, 2012 from http://dev.mysql.com/tech-ing Apache YARN. To achieve the high-availability nature resources/articles/mysql-cluster-7.2-ga.html.and the state-less failure model, MySQL Cluster (NDB) was [6] A. C. Murthy. The next generation of apache hadoopproposed as the storage technology for storing the necessary MapReduce, Feb. 2011. Retrieved at November, 18,state information. 2012 from http://developer.yahoo.com/blogs/hadoop/posts/As a proof-of-concept, we implemented Apache YARN’s re- 2011/02/mapreduce-nextgen/.covery failure model using NDB (YARN-NDB) and we de- [7] A. C. Murthy. Introducing apache hadoop YARN,veloped zkndb benchmark framework to test it. Availability Aug. 2012. Retrieved at November, 11, 2012 fromand scalability of the implementation has been examined http://hortonworks.com/blog/introducing-apache-and proven using unit test, actual resource-manager fail- hadoop-yarn/.ure test and throughput benchmark experiments. Resultsshowed that YARN-NDB was better in term of throughput [8] A. C. Murthy, C. Douglas, M. Konar, O. O’Malley,and ability to scale compared to existing ZooKeeper and S. Radia, S. Agarwal, and V. KV. Architecture of nextHDFS-based solutions. generation apache hadoop MapReduce framework. Retrieved at November, 18, 2012 fromFor future work, we plan to further develop YARN-NDB https://issues.apache.org/jira/secure/attachment/with fully state-less failure model. As the first step of this 12486023/MapR.plan, more detailed analysis of resource-manager states are [9] A. Myers. High availability for the hadoop distributedneeded. After the states are successfully analysed, we plan to file system (HDFS), Mar. 2012. Retrieved atre-design the database to accommodate additional informa- November, 18, 2012 from http://bit.ly/ZT1xIc.tion of states from the analysis. In addition, modifications [10] S. Radia. High availability framework for HDFS NN,in YARN-NDB code are needed to remove the information Feb. 2011. Retrieved at January, 4, 2012 fromfrom memory and always access NDB when the information https://issues.apache.org/jira/browse/HDFS-1623.are needed. Next, we perform evaluation to measure the [11] K. Shvachko, H. Kuang, S. Radia, and R. Chansler.throughput and overhead of the new implementation. Fi- The hadoop distributed file system. In 2010 IEEEnally, after the new implementation successfully passes the 26th Symposium on Mass Storage Systems andevaluations, we should deploy YARN-NDB in significantly Technologies (MSST), pages 1–10, May 2010.big cluster with real-world workload to check for its actual [12] M. Wasif. A distributed namespace for a distributedscalability. The resulting YARN-NDB is expected to run file system, 2012. Retrieved at November, 18, 2012perfectly in cloud environment and handle the node failures from http://kth.diva-portal.org/smash/properly. record.jsf?searchId=1&pid=diva2:548037.7. ACKNOWLEDGEMENTThe authors would like to thank our partner M`rio Almeida afor his contribution in the project. We would also like to

×