[Type the companyname]
Architecture, benefits and Challenges ofHadoop
1
Architecture, benefits and challenges of Hadoop
Kirti Jayadevan
Introduction to Big Data Concepts, Technologies and deployment
Alakh Verma
2-28-2016
[Type the companyname]
Architecture, benefits and Challenges ofHadoop
2
Abstract: [Hadoop is designed to scale up from single server to infinite machines. Unlike
relational database management system, it provides data storage over distributed systems by
partitioning data and executing computation in parallel. This paper provides an overview of
the architecture, benefits and challenges of Hadoop by comparing with lambda architecture
and spark cluster.]
Hadoop is an open source framework and is an Apache project, used to process large
amount of datasets. It is developed using distributed file system design using TCP/IP
protocols. Here servers can be added dynamically without any interruption (Shvachko, et al.,
2010, pg.1). HDFS (Hadoop Distributed File System) is the file system component of
Hadoop.
The HDFS architecture includes one name node and multiple data nodes. The name
node stores file system metadata. HDFS client first contacts the name node to know the
location of data and contacts the nearest data node to access the data. The file content is split
into block of 128 MB and each block is stored and replicated in three data nodes (Shvachko,
et al., 2010, pg.1). The data node includes one file that contains the data itself and a second
file that stores block’s metadata. During each start up the name node and data node performs
a handshake by verifying the namespace id and software version of data node. This helps to
register data node with name node. After registration, a block report that contains up-to-date
view of where block replicas are located is sent by data node every hour to the name node.
Then the data nodes send heart beats every 3 seconds which helps the name node to know
that data node is operating and block replicas are available (Shvachko, et al., 2010, pg.2).
Name node replies to the heartbeat with instructions to data node on whether to replicate
block, remove block or shut down the node (Shvachko, et al., 2010, pg.2). The name node
acts as a checkpoint node or backup node to protect the file system metadata. The checkpoint
node maintains the persistent record of files and directories in application data which is
[Type the companyname]
Architecture, benefits and Challenges ofHadoop
3
written to the disk (Shvachko, et al., 2010, pg.3). Thus Hadoop does not depend on hardware
for fault tolerance. To avoid data corruption during system upgrades, name node creates a
snapshot that saves current state of file system and instructs data nodes, while handshaking,
to create local snapshot (Shvachko, et al., 2010, pg.4).
To communicate with HDFS, we use HDFS client which reference files and
directories by paths in the namespace. The user program uses map reduce framework,
developed by Google, to handle distributed computing in large datasets. When the user
program calls the map-reduce function, it splits the input files into pieces and uses a master -
worker relationship to distribute the data in those files (Ghemawat, S & Dean, J., 2004, pg.4).
The master tracks the job and assigns the job while the workers execute the tasks given by the
master. The master assigns map tasks to workers which read the input file and store the
intermediate key value in different location. These locations are then passed to the master and
master in turn assigns reduce tasks to other workers which read the locations and identify the
intermediate data (Ghemawat, S & Dean, J., 2004, pg.4). Later, these workers sort those data
and append it to the output file. Once all map and reduce tasks are completed the master
wakes up the user program (Ghemawat, S & Dean, J., 2004, pg.4). Thus it helps to iterate
through the large data sets quickly. Hadoop also uses many other tools and frameworks like
HBase, Pig, Avro, Hive for data access, data serialization etc. (Shvachko, et al., 2010, pg.1).
Spark cluster, developed in UC Berkeley, is faster with iterative datasets when
compared to Hadoop. It is a programming interface for in memory data mining on clusters.
Spark uses resilient distributed datasets (RDD) that enables data reuse and it performs in
memory computations with low latency (Zaharia, M., et al., 2011). Lambda architecture by
Nathan Marz, briefs the framework of Hadoop. It is designed to provide fault tolerance and
scalability without interrupting the service. The batch layer, speed layer and serving layer of
Lambda architecture are used in big data technologies. Distributed file system, HDFS uses
[Type the companyname]
Architecture, benefits and Challenges ofHadoop
4
dataset from batch layer that can be queried with low latency. The speed layer of lambda
architecture deals with click-stream or recent data. Hadoop does not deal with speed layer
and it is not ACID compliant.
References:
1. Shvachko, K., Kuang, H., Radia, S., Chansler, R., The Hadoop Distributed File
System, 2010, Proceedings of the 2010 IEEE 26th Symposium on Mass Storage
Systems and Technologies, Yahoo.
2. Dean, Jeffrey., and Ghemawat, Sanjay., MapReduce: Simplified Data Processing on
Large Clusters, 2004 OSDI Operating Systems Design and Implementation
Conference.
3. Lambda Architecture. , MapR technologies, Retrieved from:
https://www.mapr.com/developercentral/lambda-architecture
4. Hadoop Introduction., Retrieved from:
http://www.tutorialspoint.com/hadoop/hadoop_introduction.htm
5. Zaharia, Matei., Chowdhury, Mosharaf., Das, Tathagata., Dave, Ankur., Ma, Justin.,
McCauley, Murphy., Franklin, J, Michael., Shenker, Scott., Stoica, Ion., Resilient
Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing, Electrical Engineering and Computer Sciences University of California at
Berkeley Technical Report No. UCB/EECS-2011-82
http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-82.pdf.

assignment3

  • 1.
    [Type the companyname] Architecture,benefits and Challenges ofHadoop 1 Architecture, benefits and challenges of Hadoop Kirti Jayadevan Introduction to Big Data Concepts, Technologies and deployment Alakh Verma 2-28-2016
  • 2.
    [Type the companyname] Architecture,benefits and Challenges ofHadoop 2 Abstract: [Hadoop is designed to scale up from single server to infinite machines. Unlike relational database management system, it provides data storage over distributed systems by partitioning data and executing computation in parallel. This paper provides an overview of the architecture, benefits and challenges of Hadoop by comparing with lambda architecture and spark cluster.] Hadoop is an open source framework and is an Apache project, used to process large amount of datasets. It is developed using distributed file system design using TCP/IP protocols. Here servers can be added dynamically without any interruption (Shvachko, et al., 2010, pg.1). HDFS (Hadoop Distributed File System) is the file system component of Hadoop. The HDFS architecture includes one name node and multiple data nodes. The name node stores file system metadata. HDFS client first contacts the name node to know the location of data and contacts the nearest data node to access the data. The file content is split into block of 128 MB and each block is stored and replicated in three data nodes (Shvachko, et al., 2010, pg.1). The data node includes one file that contains the data itself and a second file that stores block’s metadata. During each start up the name node and data node performs a handshake by verifying the namespace id and software version of data node. This helps to register data node with name node. After registration, a block report that contains up-to-date view of where block replicas are located is sent by data node every hour to the name node. Then the data nodes send heart beats every 3 seconds which helps the name node to know that data node is operating and block replicas are available (Shvachko, et al., 2010, pg.2). Name node replies to the heartbeat with instructions to data node on whether to replicate block, remove block or shut down the node (Shvachko, et al., 2010, pg.2). The name node acts as a checkpoint node or backup node to protect the file system metadata. The checkpoint node maintains the persistent record of files and directories in application data which is
  • 3.
    [Type the companyname] Architecture,benefits and Challenges ofHadoop 3 written to the disk (Shvachko, et al., 2010, pg.3). Thus Hadoop does not depend on hardware for fault tolerance. To avoid data corruption during system upgrades, name node creates a snapshot that saves current state of file system and instructs data nodes, while handshaking, to create local snapshot (Shvachko, et al., 2010, pg.4). To communicate with HDFS, we use HDFS client which reference files and directories by paths in the namespace. The user program uses map reduce framework, developed by Google, to handle distributed computing in large datasets. When the user program calls the map-reduce function, it splits the input files into pieces and uses a master - worker relationship to distribute the data in those files (Ghemawat, S & Dean, J., 2004, pg.4). The master tracks the job and assigns the job while the workers execute the tasks given by the master. The master assigns map tasks to workers which read the input file and store the intermediate key value in different location. These locations are then passed to the master and master in turn assigns reduce tasks to other workers which read the locations and identify the intermediate data (Ghemawat, S & Dean, J., 2004, pg.4). Later, these workers sort those data and append it to the output file. Once all map and reduce tasks are completed the master wakes up the user program (Ghemawat, S & Dean, J., 2004, pg.4). Thus it helps to iterate through the large data sets quickly. Hadoop also uses many other tools and frameworks like HBase, Pig, Avro, Hive for data access, data serialization etc. (Shvachko, et al., 2010, pg.1). Spark cluster, developed in UC Berkeley, is faster with iterative datasets when compared to Hadoop. It is a programming interface for in memory data mining on clusters. Spark uses resilient distributed datasets (RDD) that enables data reuse and it performs in memory computations with low latency (Zaharia, M., et al., 2011). Lambda architecture by Nathan Marz, briefs the framework of Hadoop. It is designed to provide fault tolerance and scalability without interrupting the service. The batch layer, speed layer and serving layer of Lambda architecture are used in big data technologies. Distributed file system, HDFS uses
  • 4.
    [Type the companyname] Architecture,benefits and Challenges ofHadoop 4 dataset from batch layer that can be queried with low latency. The speed layer of lambda architecture deals with click-stream or recent data. Hadoop does not deal with speed layer and it is not ACID compliant. References: 1. Shvachko, K., Kuang, H., Radia, S., Chansler, R., The Hadoop Distributed File System, 2010, Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies, Yahoo. 2. Dean, Jeffrey., and Ghemawat, Sanjay., MapReduce: Simplified Data Processing on Large Clusters, 2004 OSDI Operating Systems Design and Implementation Conference. 3. Lambda Architecture. , MapR technologies, Retrieved from: https://www.mapr.com/developercentral/lambda-architecture 4. Hadoop Introduction., Retrieved from: http://www.tutorialspoint.com/hadoop/hadoop_introduction.htm 5. Zaharia, Matei., Chowdhury, Mosharaf., Das, Tathagata., Dave, Ankur., Ma, Justin., McCauley, Murphy., Franklin, J, Michael., Shenker, Scott., Stoica, Ion., Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2011-82 http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-82.pdf.