Hadoop is an open source framework for distributed storage and processing of large datasets across clusters of computers. It uses HDFS for data storage, which partitions data into blocks and replicates them across nodes for fault tolerance. The master node tracks where data blocks are stored and worker nodes execute tasks like mapping and reducing data. Hadoop provides scalability and fault tolerance but is slower for iterative jobs compared to Spark, which keeps data in memory. The Lambda architecture also informs Hadoop's ability to handle batch and speed layers separately for scalability.
1. [Type the companyname]
Architecture, benefits and Challenges ofHadoop
1
Architecture, benefits and challenges of Hadoop
Kirti Jayadevan
Introduction to Big Data Concepts, Technologies and deployment
Alakh Verma
2-28-2016
2. [Type the companyname]
Architecture, benefits and Challenges ofHadoop
2
Abstract: [Hadoop is designed to scale up from single server to infinite machines. Unlike
relational database management system, it provides data storage over distributed systems by
partitioning data and executing computation in parallel. This paper provides an overview of
the architecture, benefits and challenges of Hadoop by comparing with lambda architecture
and spark cluster.]
Hadoop is an open source framework and is an Apache project, used to process large
amount of datasets. It is developed using distributed file system design using TCP/IP
protocols. Here servers can be added dynamically without any interruption (Shvachko, et al.,
2010, pg.1). HDFS (Hadoop Distributed File System) is the file system component of
Hadoop.
The HDFS architecture includes one name node and multiple data nodes. The name
node stores file system metadata. HDFS client first contacts the name node to know the
location of data and contacts the nearest data node to access the data. The file content is split
into block of 128 MB and each block is stored and replicated in three data nodes (Shvachko,
et al., 2010, pg.1). The data node includes one file that contains the data itself and a second
file that stores block’s metadata. During each start up the name node and data node performs
a handshake by verifying the namespace id and software version of data node. This helps to
register data node with name node. After registration, a block report that contains up-to-date
view of where block replicas are located is sent by data node every hour to the name node.
Then the data nodes send heart beats every 3 seconds which helps the name node to know
that data node is operating and block replicas are available (Shvachko, et al., 2010, pg.2).
Name node replies to the heartbeat with instructions to data node on whether to replicate
block, remove block or shut down the node (Shvachko, et al., 2010, pg.2). The name node
acts as a checkpoint node or backup node to protect the file system metadata. The checkpoint
node maintains the persistent record of files and directories in application data which is
3. [Type the companyname]
Architecture, benefits and Challenges ofHadoop
3
written to the disk (Shvachko, et al., 2010, pg.3). Thus Hadoop does not depend on hardware
for fault tolerance. To avoid data corruption during system upgrades, name node creates a
snapshot that saves current state of file system and instructs data nodes, while handshaking,
to create local snapshot (Shvachko, et al., 2010, pg.4).
To communicate with HDFS, we use HDFS client which reference files and
directories by paths in the namespace. The user program uses map reduce framework,
developed by Google, to handle distributed computing in large datasets. When the user
program calls the map-reduce function, it splits the input files into pieces and uses a master -
worker relationship to distribute the data in those files (Ghemawat, S & Dean, J., 2004, pg.4).
The master tracks the job and assigns the job while the workers execute the tasks given by the
master. The master assigns map tasks to workers which read the input file and store the
intermediate key value in different location. These locations are then passed to the master and
master in turn assigns reduce tasks to other workers which read the locations and identify the
intermediate data (Ghemawat, S & Dean, J., 2004, pg.4). Later, these workers sort those data
and append it to the output file. Once all map and reduce tasks are completed the master
wakes up the user program (Ghemawat, S & Dean, J., 2004, pg.4). Thus it helps to iterate
through the large data sets quickly. Hadoop also uses many other tools and frameworks like
HBase, Pig, Avro, Hive for data access, data serialization etc. (Shvachko, et al., 2010, pg.1).
Spark cluster, developed in UC Berkeley, is faster with iterative datasets when
compared to Hadoop. It is a programming interface for in memory data mining on clusters.
Spark uses resilient distributed datasets (RDD) that enables data reuse and it performs in
memory computations with low latency (Zaharia, M., et al., 2011). Lambda architecture by
Nathan Marz, briefs the framework of Hadoop. It is designed to provide fault tolerance and
scalability without interrupting the service. The batch layer, speed layer and serving layer of
Lambda architecture are used in big data technologies. Distributed file system, HDFS uses
4. [Type the companyname]
Architecture, benefits and Challenges ofHadoop
4
dataset from batch layer that can be queried with low latency. The speed layer of lambda
architecture deals with click-stream or recent data. Hadoop does not deal with speed layer
and it is not ACID compliant.
References:
1. Shvachko, K., Kuang, H., Radia, S., Chansler, R., The Hadoop Distributed File
System, 2010, Proceedings of the 2010 IEEE 26th Symposium on Mass Storage
Systems and Technologies, Yahoo.
2. Dean, Jeffrey., and Ghemawat, Sanjay., MapReduce: Simplified Data Processing on
Large Clusters, 2004 OSDI Operating Systems Design and Implementation
Conference.
3. Lambda Architecture. , MapR technologies, Retrieved from:
https://www.mapr.com/developercentral/lambda-architecture
4. Hadoop Introduction., Retrieved from:
http://www.tutorialspoint.com/hadoop/hadoop_introduction.htm
5. Zaharia, Matei., Chowdhury, Mosharaf., Das, Tathagata., Dave, Ankur., Ma, Justin.,
McCauley, Murphy., Franklin, J, Michael., Shenker, Scott., Stoica, Ion., Resilient
Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing, Electrical Engineering and Computer Sciences University of California at
Berkeley Technical Report No. UCB/EECS-2011-82
http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-82.pdf.