1. Next Generation Hadoop: High Availability for YARN
Arinto Murdopo Jim Dowling
KTH Royal Institute of Technology Swedish Institute of Computer Science
Hanstavägen 49 - 1065A, Isafjordsgatan 22,
164 53 Kista, Sweden 164 40 Kista, Sweden
arinto@kth.se jdowling@kth.se
ABSTRACT several cluster computing frameworks to handle big data ef-
Hadoop is one of the widely-adopted cluster computing frame- fectively.
works for big data processing, but it is not free from limi-
tations. Computer scientists and engineers are continuously One of the widely adopted cluster computing frameworks
making efforts to eliminate those limitations and improve that commonly used by web-companies is Hadoop1 . It mainly
Hadoop. One of the improvements in Hadoop is YARN, consists of Hadoop Distributed File System (HDFS) [11] to
which eliminates scalability limitation of the first generation store the data. On top of HDFS, MapReduce framework
MapReduce. However, YARN still suffers from availabil- inspired by Google’s MapReduce [1] was developed to pro-
ity limitation, i.e. single-point-of-failure in YARN resource- cess the data inside. Although Hadoop arguably has be-
manager. In this paper we propose an architecture to solve come the standard solution for managing big data, it is not
YARN availability limitation. The novelty of this architec- free from limitations. These limitations have triggered sig-
ture lies on its stateless failure model, which enables multiple nificant efforts from academia and enterprise to improve
YARN resource-managers to run concurrently and maintains Hadoop. Cloudera tried to reduce availability limitation
high availability. MySQL Cluster (NDB) is proposed as the of HDFS using NameNode replication [9]. KTHFS solved
storage technology in our architecture. Furthermore, we im- the HDFS availability limitation by utilizing MySQL Clus-
plemented a proof-of-concept for the proposed architecture. ter to make HDFS NameNode stateless [12]. Scalability of
The evaluations show that the proof-of-concept is able to in- MapReduce has become prominent limitation. MapReduce
crease the availability of YARN. In addition, NDB is shown has reached scalability limit of 4000 nodes. To solve this
to have the highest throughput compared to Apache’s pro- limitation, the open source community proposed the next
posed storages (ZooKeeper and NDB). Finally, the evalua- generation MapReduce called YARN (Yet Another Resource
tions show the NDB achieves linear scalability hence it is Negotiator) [8]. From the enterprise world, Corona was re-
suitable for our proposed stateless failure model. leased by Facebook to overcome the aforementioned scalabil-
ity limitation [2]. Another limitation is Hadoop’s inability
Categories and Subject Descriptors to perform fine-grained resource sharing between multiple
computation frameworks. Mesos tried to solve this limita-
D.4.7 [Operating Systems]: Distributed Systems, Batch
tion by implementation of distributed two-level scheduling
Processing Systems
mechanism called resource offers [3].
General Terms However, few solutions have addressed the availability lim-
Big Data, Storage Management itation in MapReduce framework. When a MapReduce’s
JobTracker failures occur, the corresponding application is
1. INTRODUCTION not able to continue, reducing MapReduce’s availability. Cur-
Big data has become widespread across industries, especially rent YARN architecture is unable to solve this availability
web-companies. It has reached petabytes scale and it will limitation. ResourceManager, the JobTracker-equivalent in
keep increasing in the upcoming years. Traditional storage YARN, remains a single-point-of-failure. The open source
systems such as regular file systems and relational databases community has recently started to solve this issue but no
are not designed to handle this petabytes-scale of magnitude. final and proven solution is available yet2 . The current pro-
Scalability is the main issue for the traditional storage sys- posal from the open source community is to use ZooKeeper
tems in handling big data. This situation has resulted in [4] or HDFS as a persistent storage to store ResourceMan-
ager’s states. Upon failure, ResourceManager will be recov-
ered using the stored states.
Solving this availability limitation will bring YARN into
cloud-ready state. YARN can be executed in the cloud,
such as Amazon EC2, and it is resistant to failures that
often happen in the cloud.
1
http://hadoop.apache.org/
2
https://issues.apache.org/jira/browse/YARN-128
2. In this report, we present a new architecture for YARN. The
main goal of the new architecture is to solve the aforemen-
tioned availability limitation in YARN. This architecture
provides better alternatives than the existing Zoo-Keeper-
based architecture since it eliminates the potential scalabil-
ity limitation due to ZooKeeper’s relatively limited through-
put.
For achieving the desired availability, the new architecture
utilizes a distributed in-memory database called MySQL
Cluster(NDB)3 to persist the ResourceManager states. NDB
itself automatically replicates the stored data into different
NDB data-nodes to ensure high availability. Moreover, NDB
is able to handle up to 1.8 million write queries per sec-
ond [5].
This report is organized as following. Section 2 presents
existing YARN architecture, its availability limitations and
proposed solution from Apache. The proposed architecture Figure 1: YARN Architecture
is presented in Section 3. Section 4, presents our evaluation
to verify the availability and the scalability of the proposed
architecture. The related works in improving availability pluggable, which means we can implement our own
of cluster computing framework are presented in Section 5. scheduling policy to be used in our YARN deployment.
And we conclude this report and propose future work for YARN currently provides three policies to choose from,
this project in Section 6. i.e. fair-scheduler, FIFO-scheduler and capacity-scheduler.
For the available resources, scheduler should ideally
2. YARN ARCHITECTURE use CPU, memory, disk and other computing resources
This section explains the current YARN architecture, YARN as factor of resources during scheduling. However, cur-
availability limitation, and Apache’s proposed solution to rent YARN only supports memory as the factor of re-
overcome the limitation. source during scheduling.
2. Resource-tracker, which handles computing-nodes man-
2.1 Architecture Overview agement. ”Computing-nodes” in this context means
YARN’s main goal is to provide more flexibility compared nodes that have node-manager process run on it and
to Hadoop in term of data processing framework that can have computing resources. The management tasks in-
be executed on top of it [7]. It is equipped with generic dis- clude new nodes registration, handling requests from
tributed application framework and resource-management invalid or decommisioned nodes, and nodes’ heartbeats
components. Therefore, YARN supports not only MapRe- processing. Resource-tracker works closely with node-
duce, but also other data processing frameworks such as liveness-monitor(NMLivenessMonitor class), which keeps
Apache Giraph, Apache Hama and Spark. track of live and dead computing nodes based on nodes’
heartbeats, and node-list-manager(NodesListManager
In addition, YARN is aimed to solve scalability limitation class), which store the list of valid and excluded com-
in original implementation of Apache’s MapReduce [6]. To puting nodes based on YARN configuration files.
achieve this aim, YARN splits MapReduce job-tracker re-
3. Applications-manager, which maintains collection of
sponsibilities of application scheduling, resource manage-
user submitted jobs and cache of completed jobs. It is
ment and application monitoring into separate processes
the entry point for clients to submit their jobs.
or daemons. The new processes that handle job-tracker
responsibilities are resource-manager which handles global
resource management and job scheduling, and application- In YARN, clients submit jobs through applications-manager
master which is responsible for job monitoring, job life-cyle and the submission triggers scheduler to try to schedule
management and resource negotiation with the resource- the job. When the job is scheduled, resource-manager allo-
manager. Each submitted job corresponds to an application- cates a container and launches a corresponding application-
master process. Furthermore, YARN converts original MapRe- master. The application-master takes over and process the
duce task-tracker into node-manager, which manages task job by splitting them into smaller tasks, requesting addi-
execution in YARN’s unit of resource called container. tional containers to resource-manager, launching them with
the help of node-manager, assigning the tasks into the avail-
Figure 1 shows the current YARN architecture. Resource- able containers and keeping track of the job progress. Clients
manager has three core components, they are: learn the job progress by polling application-master every
specific seconds based on YARN configuration. When the
job is completed, application-master cleans up its working
1. Scheduler, which schedules submitted jobs based on state.
specific policy and available resources. The policy is
3
http://www.mysql.com/products/cluster/ 2.2 Availability Limitation in YARN
3. Although YARN solves the scalability limitation of origi-
nal MapReduce, it still suffers from an availability limita-
tion which is the single-point-of-failure nature of resource-
manager. This section explains why YARN resource-manager
is a single-point-of-failure.
Refer to Figure 1, container and task failures are handled
by node-manager. When a container fails or dies, node-
manager detects the failure event and launches a new con-
tainer to replace the failing container and restart the task
execution in the new container.
Figure 2: Stateless Failure Model
In the event of application-master failure, the resource-manager
detects the failure and start a new instance of the application-
and the first allocated container details such as container
master with a new container. The ability to recover the as-
identification number, container node detail, requested re-
sociated job state depends on the application-master imple-
source and job priority.
mentation. MapReduce application-master has the ability
to recover the state but it is not enabled by default. Other
Upon restart, resource-manager reloads the saved informa-
than resource-manager, associated client also reacts with the
tion and restarts all node-managers and application-masters.
failure. The client contacts the resource-manager to locate
This restart mechanism does not retain the jobs that cur-
the new application-master’s address.
rently executing in the cluster. In the worst case, all progress
will be lost and the job will be started from the beginning.
Upon failure of a node-manager, the resource-manager up-
To minimize this effect, a new application-master should be
dates its list of available node-managers. Application-master
designed to read the previous application-master states that
should recover the tasks run on the failing node-managers
executes under the failed resource-manager. For example,a
but it depends on the application-master implementation.
MapReduce application-master handles this case by stor-
MapReduce application-master has an additional capability
ing the progress in another process called job-history-server
to recover the failing task and blacklist the node-managers
and upon restart, a new application-master obtains the job
that often fail.
progress from a job-history-server.
Failure of the resource-manager is severe since clients can
The main drawback of this model is the existence of down-
not submit a new job and existing running job could not
time to start a new resource-manager process when the old
negotiate and request for new container. Existing node-
one fails. If the down-time is too long, all processes reach
managers and application-masters try to reconnect to the
time-out and clients need to re-submit their jobs to the new
failed resource-manager. The job progress will be lost when
resource-manager. Furthermore, HDFS is not suitable for
they are unable to reconnect. This lost of job progress will
storing lots of data with small size (in this case, the data
likely frustrate engineers or data scientists that use YARN
are the application states and the application-attempts).
because typical production jobs that run on top of YARN
ZooKeeper is suitable for current data size, but it is likely
are expected to have long running time and typically they
to introduce problem when the amount of stored data in-
are in the order of few hours. Furthermore, this limitation is
creased since ZooKeeper is designed to store typically small
preventing YARN to be used efficiently in cloud environment
configuration data.
(such as Amazon EC2) since node failures often happen in
cloud environment.
3. YARN WITH HIGH AVAILABILITY
We explain our proposed failure model and architecture to
2.3 Proposed Solution from Apache solve YARN availability limitation. Furthermore, implemen-
To tackle this availability issue, Apache proposed to have tation of the proposal is explained in this section.
recovery failure model using ZooKeeper or HDFS-based per-
sistent storage4 . The proposed recovery failure model is
transparent to clients, that means clients does not need to 3.1 Stateless Failure Model
re-submit the jobs. In this model, resource-manager saves We propose stateless failure model, which means all neces-
relevant information upon job submission. sary information and states used by resource-manager are
stored in a persistent storage. Based on our observation,
These information currently include application-identification- these information include:
number, application-submission-context and list of application-
attempts. An application-submission-context contains in- 1. Application related information such as application-id,
formation related to the job submission such as applica- application-submission-context and application-attempts.
tion name, user who submits the job, and amount of re-
quested resource. An application-attempt represents each 2. Resource related information such as list of node-managers
resource-manager attempt to run a job by creating a new and available resources.
application-master process. The saved information related
to application-attempt are attempt identification number
Figure 2 shows the architecture of stateless failure model.
4
https://issues.apache.org/jira/browse/YARN-128 Since all the necessary information are stored in persistent
4. Column Type
id int
clustertimestamp bigint
submittime bigint
appcontext varbinary(13900)
Table 1: Properties of application state
scalability is achieved by auto-data-sharding based on user-
defined partition key. The latest benchmark from Oracle
shows that MySQL Cluster version 7.2 achieves horizontal
scalability, i.e when number of datanodes is increased 15
times, the throughput is increased 13.63 times [5].
Regarding the performance, NDB has fast read and write
rate. The aforementioned benchmark [5] shows that 30-
node-NDB cluster supports 19.5 million writes per second.
It supports fine-grained locking, which means only affected
Figure 3: YARN with High Availability Architec- rows are locked during a transaction. Updates on two dif-
ture ferent rows in the same table can be executed concurrently.
SQL and NoSQL interfaces are supported which makes NDB
highly flexible depending on users’ needs and requirements.
storage, it is possible to have more than one resource-managers
running at the same time. All of the resource-managers
share the information through the storage and none of them 3.3 NDB Storage Module
hold the information in their memory. As a proof-of-concept of our proposed architecture, we de-
signed and implemented NDB storage module for YARN
When a resource-manager fails, the other resource-managers resource-manager. Due to limited time, recovery failure
can easily take over the job since all the needed states are model was used in our implementation. In this report, we
stored in the storage. Clients, node-managers and application- will refer the proof-of-concept of NDB-based-YARN as YARN-
masters need to be modified so that they can point to new NDB.
resource-managers upon the failure.
To achieve high availability through this failure model, we 3.3.1 Database Design
need to have a storage that has these following requirements: We designed two NDB tables to store application states and
their corresponding application-attempts. They are called
applicationstate and attemptstate. Table 1 shows the columns
1. The storage should be highly available. It does not for applicationstate table. id is a running number and it is
have single-point-of-failure. only unique within a resource-manager. clustertimestamp
is the timestamp when the corresponding resource-manager
2. The storage should be able to handle high read and is started. When we have more than one resource-manager
write rates for small data (in the order of at most few running at a time (as in stateless failure model), we need to
kilo bytes), since this failure model needs to perform differentiate the applications that run among them. There-
very frequent read and write to the storage. fore, the primary keys for this table are id and clustertimes-
tamp. appcontext is a serialized ApplicationSubmissionCon-
text object, thus the type is varbinary.
ZooKeeper and HDFS satisfy the first requirement, but they
do not satisfy the second requirement. ZooKeeper is not de- The columns for attemptstate table are shown in Table 2.
signed as a persistent storage for data and HDFS is not de- applicationid and clustertimestampe are the foreign keys to
signed to handle high read and write rates for small data. We applicationstate table. attemptid is the id of an attempt
need other storage technology and MySQL Cluster (NDB) and mastercontainer contains serialized information about
is suitable for these requirements. Section 3.2 explain NDB the first container that is assigned into the corresponding
in more details. application-master. The primary keys of this table are at-
temptid, applicationid and clustertimestamp.
Figure 3 shows the high level diagram of the proposed ar-
chitecture. NDB is introduced to store resource-manager To enhance table performance in term of read and write
states. throughput, partitioning technique was used5 . Both ta-
bles were partitioned by applicationid and clustertimestamp.
With this technique, NDB located the desired data with-
3.2 MySQL Cluster (NDB) out contacting NDB’s location resolver service, hence it was
MySQL Cluster (NDB) is a scalable in-memory distributed
faster compared to NDB tables without partitioning.
database. It is designed for availability, which means there
is no single-point-of-failure in NDB cluster. Furthermore, 5
http://dev.mysql.com/doc/refman/5.5/en/partitioning-
it complies with ACID-transactional properties. Horizontal key.html
5. Column Type and all unfinished jobs are re-executed with a new application-
attemptid int attempt.
applicationid int
clustertimestamp bigint
mastercontainer varbinary(13900) 4. EVALUATION
We designed two types of evaluation in this project. The
Table 2: Properties of attempt state first evaluation was to test whether the NDB storage module
works as expected or not. The second evaluation was to
investigate and compare the throughput among ZooKeeper,
HDFS and NDB when storing YARN’s application state.
4.1 NDB Storage Module Evaluation
4.1.1 Unit Test
This evaluation used the unit test class explained in Sec-
tion 3.3.2. It was performed using single-node-NDB-cluster
i.e. two NDB datanode-processes in a node. on top of a
computer with 4 GB of RAM and Intel dual-core i3 CPU
at 2.40 GHz. We changed accordingly the ClusterJ’s Java-
properties-file to point into our single-node-NDB-cluster.
The unit test class was executed using Maven and Netbeans,
and the result was positive. We tested the consistency by
executing the unit test class several times and the results
were always pass.
4.1.2 Actual Resource-Manager Failure Test
In this evaluation, we used Swedish Institute of Computer
Science (SICS) cluster. Each node in SICS’s cluster had
30 GB of RAM and two six-core AMD Opteron processor
Figure 4: NDB Storage Unit Test Flowchart at 2.6GHz, which effectively could run 12 threads without
significant context-switching overhead. Ubuntu 11.04 with
Linux Kernel 2.6.38-12-server was installed as the operat-
3.3.2 Integration with Resource-Manager ing system and Java(TM) SE Runtime Environment (JRE)
We developed YARN-NDB using ClusterJ6 for two develop- version 1.6.0 was the Java runtime environment.
ment iterations based on patches released by Apache. The
first YARN-NDB implementation is based on YARN-128.full- NDB was deployed in 6-node-cluster and YARN-NDB was
code.5 patch on top of Hadoop trunk dated 11 November configured using single-node setting. We executed pi and
2012. The second implementation7 is based on YARN-231-2 bbp examples that come from Hadoop distribution. In the
patch8 on top of Hadoop trunk dated 23 December 2012. In middle of pi and bbp execution, we terminated the resource-
this report, we refer to the second implementation of YARN- manager process using Linux kill command. The new resource-
NDB unless otherwise specified. The NDB storage module manager with the same address and port was started three
in YARN-NDB has same functionalities as Apache YARN’s seconds after the old one was successfully terminated.
HDFS and ZooKeeper storage module such as adding and
deleting application states and attempts. We observed that the currently running job finished prop-
erly, which means the resource-manager was correctly restarted.
Furthermore, we developed unit test module for the storage Several connection-retry-attempts to contact the resource-
module. Figure 4 shows the flowchart of this unit test mod- manager by node-managers, application-masters and MapRe-
ule. In this module, three MapReduce jobs are submitted duce clients were observed. To check for consistency, we
into YARN-NDB. The first job finishes the execution before submitted a new job to the new resource-manager and the
a resource-manager fails. The second job is successfully sub- new job was finished correctly. We repeated this experi-
mitted and scheduled, hence application-master is launched, ment several times and same results were observed, i.e the
but no container is allocated. The third job is successfully new resource-manager was successfully restarted and took
submitted but not yet scheduled. These three jobs represent over the killed resource-manager’s roles correctly.
three different scenarios when a resource-manager fails.
Restarting a resource-manager is achieved by connecting 4.2 NDB Performance Evaluation
the existing application-masters and node-managers to the We utilised the same set of machines in SICS cluster as our
new resource-manager. All application-masters and node- evaluation in Section4.1.2. NDB was deployed in the same
managers process are rebooted by the new resource-manager 6-node-cluster and ZooKeeper were deployed to three SICS
nodes. The maximum memory for each ZooKeeper process
6
http://dev.mysql.com/doc/ndbapi/en/mccj.html was set to 5GB of RAM. HDFS were also deployed to three
7
https://github.com/arinto/hadoop-common SICS nodes and it used ZooKeeper’s maximum memory con-
8
https://issues.apache.org/jira/browse/YARN-231 figuration of 5GB of RAM.
6. 18000
ZooKeeper
NDB
16000 HDFS
14000
Completed requests/s
12000
10000
8000
6000
4000
2000
0
R intensive W intensive R/W intensive
Workload type
Figure 5: zkndb Architecture
Figure 6: zkndb Throughput Benchmark Result for
8 Threads and 1 Minute of Benchmark Execution
4.2.1 zkdnb Framework
We developed zkndb framework9 to effectively benchmark
storage systems with minimum effort. Figure 5 shows the Application state information was an array of random bytes,
architecture of zkndb framework. The framework consists with length of 53 bytes. The length of application state in-
of three main packages: formation was determined after observing actual application
state information that stored when executing YARN-NDB
jobs. Each data-read consisted of reading an application
1. storage package, which contains the configurable load identification and its corresponding application state infor-
generator (StorageImpl ) in term of number of reads mation.
and writes per time unit.
2. metrics package, which contains metrics parameters Three types of workload were used in our experiment, they
(MetricsEngine), for example write or read request were:
and acknowledge. Additionally, this package contains
also the metrics logging mechanism (ThroughputEngineImpl ). 1. Read-intensive. One set of data was written into database,
3. benchmark package, which contains benchmark appli- and zkndb always read on the written data.
cations and manages benchmark executions.
2. Write-intensive. No read was performed, zkndb always
wrote a new set of data into different location.
zkndb framework offers flexibility in integrating new storage
technologies, defining new metrics and storing benchmark 3. Read-write balance. Read and write were performed
results. To integrate a new storage technology, framework alternately.
users can implement storage-interface in storage package. A
new metric can be developed by implementing metric in- Furthermore, we varied the throughput rate by configuring
terface in metrics package. Additionally, framework users the number of threads that accessed the database for read-
can design new metrics logging mechanism by implementing ing and writing. To maximize the throughput, no delay was
throughput-engine-interface in metrics package. Resulting configured in between each read and each write. We com-
data produced by ThroughputEngineImpl were further pro- pared the throughput between ZooKeeper, HDFS, and NDB
cessed by our custom scripts for further analysis. For this for equal configurations of number of threads and workload
evaluation, three storage implementations were added into types. In addition, scalability of each storage was investi-
the framework, which are NDB, HDFS and ZooKeeper. gated by increasing the number of threads, while keeping
the other configurations unchanged.
4.2.2 Benchmark Implementation in zkndb
For ZooKeeper and HDFS, we ported YARN’s storage mod- 4.2.3 Throughput Benchmark Result
ule implementation based on YARN-128.full-code.5 patch10 Figure 6 shows the throughput benchmark result for eight
into our benchmark. The first iteration of YARN-NDB’s threads and one minute of execution with the three types of
NDB storage module is ported into our zkndb NDB storage different workload and three types of storage implementa-
implementation. tion: ZooKeeper, NDB and HDFS.
Each data-write into the storage had an application identifi- For all three workload types, NDB had the highest through-
cation and application state information. Application iden- put compared to ZooKeeper and HDFS. These results can
tification was a Java long data type with size of eight bytes. be attributed to the nature of NDB as a high performance
9
https://github.com/4knahs/zkndb persistent storage which is capable to handle high read and
10
https://issues.apache.org/jira/browse/YARN-128 write request rate. Refer to the error bar in Figure6, NDB
7. 45000 30000
ZooKeeper ZooKeeper
NDB NDB
40000 HDFS HDFS
25000
35000
Completed requests/s
Completed requests/s
30000 20000
25000
15000
20000
15000 10000
10000
5000
5000
0 0
4 8 12 16 24 36 4 8 12 16 24 36
Number of threads Number of threads
Figure 7: Scalability Benchmark Results for Read- Figure 8: Scalability Benchmark Results for Write-
Intensive Workload Intensive Workload
has big deviation between its average and the lowest value the other hand, HDFS performed very poor for this work-
during experiment. This big deviation could be attributed load. The highest throughput achieved by NDB with 36
to infrequent intervention from NDB management process threads was only 534.92 requests per second. The poor per-
to recalculate the data index for fast access. formance of HDFS could be attributed to the same reasons
as explained in Section 4.2.3, which are NameNode-locking
Interestingly, ZooKeeper’s throughput were stable for all overhead and inefficient data access pattern for small files.
workload types. This throughput stability can be accounted
to ZooKeeper’s behavior to linearize incoming requests that 5. RELATED WORK
causes read and write request have approximately the same
execution time. Another possible explanation for ZooKeeper’s 5.1 Corona
throughput stability is the YARN’s ZooKeeper storage mod- Corona [2] introduces a new process called cluster-manager
ule implementation. The module implementation code could to take over cluster management functions from MapReduce
cause the read and write execution time equal. job-tracker. The main purposes of cluster-manager is to keep
track of amount of free resources in and manages the nodes
As expected, HDFS had the lowest throughput for all work- in the cluster. Corona utilizes push-based scheduling, i.e.
load types. HDFS’ low throughput may be attributed to cluster-manager pushes the allocated resources back to the
HDFS’ NameNode-locking overhead and inefficient data ac- job-tracker after it receives resource requests. Furthermore,
cess pattern when it processes lots of small files. Each Corona claims that scheduling latency is low since there is no
time HDFS receives read or write request, HDFS NameN- periodic heartbeat involved during this resource scheduling.
ode needs to acquire a lock for the file path so HDFS can Although Corona solves the MapReduce scalability limita-
return a valid result. Acquiring a lock frequently increases tion, it has single-point-of-failure in cluster-manager hence
data access time hence decreases the throughput. The inef- the MapReduce availability limitation is still present.
ficient data access pattern in HDFS is due to data splitting
to fit the data into HDFS block and data replication. Fur- 5.2 KTHFS
thermore, the needs to write the data into disk in HDFS KTHFS [12] solves scalability and availability limitation of
decreases the throughput as observed in write-intensive and HDFS NameNodes. The filesystem metadata information
read-write balance workload. of HDFS NameNodes are stored in NDB, hence the HDFS
NameNodes are fully state-less. By being state-less, more
4.2.4 Scalability Benchmark Result than one HDFS NameNodes can run simultaneously and
failure of HDFS NameNodes can be easily mitigated by the
Figure 7 shows the increases in throughput when we in-
remaining alive NameNodes. Furthermore, KTHFS has lin-
creased the number of threads for read-intensive workload.
ear throughput scalability, that is throughput increment can
All of the storage implementation increased their through-
be performed by adding HDFS NameNodes or adding NDB
put when the number of threads were increased. NDB had
DataNodes. KTHFS has inspired the NDB usage to solve
the highest increase compared to HDFS and ZooKeeper. For
the YARN availability limitation.
NDB, doubling the number of threads increased the through-
put by 1.69 and it was close to linear scalability.
5.3 Mesos
Same trend was observed for write-intensive workload as Mesos [3] is a resource management platform that enables
shown in 8. NDB still had the highest increase in through- commodity-cluster-sharing between different cluster comput-
put compared to HDFS and ZooKeeper. For NDB, doubling ing frameworks. Cluster utilization is improved due to the
the number of threads increased the throughput by 1.67. On sharing mechanism. It has several master processes that
8. have similar roles compared to YARN resource-manager. ¨
thank our colleague: Umit Cavus B¨y¨k¸ahin, Strahinja
¸ u u s
The availability of Mesos is achieved by having several stand- Lazetic and, Vasiliki Kalavri for providing feedback through-
by master processes to replace the failed active master pro- out this project. Additionally we would like to thank our
cess. Mesos utilizes ZooKeeper to monitor the group of mas- EMDC friends: Muhammad Anis uddin Nasir, Emmanouil
ter processes. And during master process failures, ZooKeeper Dimogerontakis, Maria Stylianou and Mudit Verma for con-
performs leader-election to choose the new active master tinuous support throughout report writing process.
process. Reconstruction of state is performed by the newly
active master process. This reconstruction mechanism may 8. REFERENCES
introduce significant amount of delay when the state is big. [1] J. Dean and S. Ghemawat. MapReduce: simplified
data processing on large clusters. Commun. ACM,
5.4 Apache HDFS-1623 51(1):107–113, Jan. 2008.
Apache utilizes failover recovery model to solve HDFS Na- [2] Facebook. Under the hood: Scheduling MapReduce
meNode single-point-of-failure limitation [9,10]. In this solu- jobs more efficiently with corona, Nov. 2012. Retrieved
tion, additional HDFS NameNodes are introduced as stand- at November, 18, 2012 from
by NameNodes. The active NameNode writes all changes http://on.fb.me/109FHPD.
to the file system namespace into a write-ahead-log in a [3] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi,
persistent storage. Overhead when storing data is likely to A. D. Joseph, R. Katz, S. Shenker, and I. Stoica.
be introduced and the overhead magnitude depends on the Mesos: a platform for fine-grained resource sharing in
choice of storage system. This solution supports automatic the data center. In Proceedings of the 8th USENIX
failover, but the solution complexity increases due to the conference on Networked systems design and
existence of additional processes as failure detectors. These implementation, NSDI’11, page 22, Berkeley, CA,
failure detectors trigger automatic failover mechanism when USA, 2011. USENIX Association.
they detect NameNode failures. [4] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed.
ZooKeeper: wait-free coordination for internet-scale
6. CONCLUSION AND FUTURE WORK systems. In USENIX ATC, volume 10, 2010.
We have presented an architecture for highly-available clus- [5] M. Keep. MySQL cluster 7.2 GA released, delivers 1
ter computing management framework. The proposed ar- BILLION queries per minute, Apr. 2012. Retrieved at
chitecture incorporated state-less failure model into exist- November, 18, 2012 from http://dev.mysql.com/tech-
ing Apache YARN. To achieve the high-availability nature resources/articles/mysql-cluster-7.2-ga.html.
and the state-less failure model, MySQL Cluster (NDB) was
[6] A. C. Murthy. The next generation of apache hadoop
proposed as the storage technology for storing the necessary
MapReduce, Feb. 2011. Retrieved at November, 18,
state information.
2012 from
http://developer.yahoo.com/blogs/hadoop/posts/
As a proof-of-concept, we implemented Apache YARN’s re-
2011/02/mapreduce-nextgen/.
covery failure model using NDB (YARN-NDB) and we de-
[7] A. C. Murthy. Introducing apache hadoop YARN,
veloped zkndb benchmark framework to test it. Availability
Aug. 2012. Retrieved at November, 11, 2012 from
and scalability of the implementation has been examined
http://hortonworks.com/blog/introducing-apache-
and proven using unit test, actual resource-manager fail-
hadoop-yarn/.
ure test and throughput benchmark experiments. Results
showed that YARN-NDB was better in term of throughput [8] A. C. Murthy, C. Douglas, M. Konar, O. O’Malley,
and ability to scale compared to existing ZooKeeper and S. Radia, S. Agarwal, and V. KV. Architecture of next
HDFS-based solutions. generation apache hadoop MapReduce framework.
Retrieved at November, 18, 2012 from
For future work, we plan to further develop YARN-NDB https://issues.apache.org/jira/secure/attachment/
with fully state-less failure model. As the first step of this 12486023/MapR.
plan, more detailed analysis of resource-manager states are [9] A. Myers. High availability for the hadoop distributed
needed. After the states are successfully analysed, we plan to file system (HDFS), Mar. 2012. Retrieved at
re-design the database to accommodate additional informa- November, 18, 2012 from http://bit.ly/ZT1xIc.
tion of states from the analysis. In addition, modifications [10] S. Radia. High availability framework for HDFS NN,
in YARN-NDB code are needed to remove the information Feb. 2011. Retrieved at January, 4, 2012 from
from memory and always access NDB when the information https://issues.apache.org/jira/browse/HDFS-1623.
are needed. Next, we perform evaluation to measure the [11] K. Shvachko, H. Kuang, S. Radia, and R. Chansler.
throughput and overhead of the new implementation. Fi- The hadoop distributed file system. In 2010 IEEE
nally, after the new implementation successfully passes the 26th Symposium on Mass Storage Systems and
evaluations, we should deploy YARN-NDB in significantly Technologies (MSST), pages 1–10, May 2010.
big cluster with real-world workload to check for its actual [12] M. Wasif. A distributed namespace for a distributed
scalability. The resulting YARN-NDB is expected to run file system, 2012. Retrieved at November, 18, 2012
perfectly in cloud environment and handle the node failures from http://kth.diva-portal.org/smash/
properly. record.jsf?searchId=1&pid=diva2:548037.
7. ACKNOWLEDGEMENT
The authors would like to thank our partner M`rio Almeida
a
for his contribution in the project. We would also like to