Hadoop paper

1
Hadoop
Simon Alex ATWIINE Nambaale Bakule RAYMOND
2019/HD05/25235U 2019/HD05/25253U
satwiine@cis.mak.ac.ug nambaaler@gmail.com
Abstract—We live in the data age, where the amount of data is raising exponentially from Terabytes to Petabytes,
Exabytes, Zettabytes and Yottabytes. This means that the amount of data generated and stored electronically is huge-
“Big Data”. Data can be classified as Big Data using three dimensional data management challenges, also known as
3V’s including: volume (size of the data), velocity (the rate at which data can be received as well as analyzed) and variety
(different kinds of data are being generated from various sources). In this paper entitled “Hadoop”, we discuss Hadoop
that distinguishes itself as a solution to Big Data.
!
1 INTRODUCTION
HADOOP is an open source software framework
for storing and processing big data on lots of
commodity machines[11]. The underlying technol-
ogy on which Hadoop is built was developed by
Google to index rich textual and structural informa-
tion they were collecting, and then present mean-
ingful and actionable results to users[13]. There was
nothing on the market that would let them do that,
so they built their own platform. Google’s innova-
tions, as published in Google File System (GFS) and
MapReduce paper in 2003-2004, were incorporated
into Nutch, an open source project, and Hadoop
was later spun-off from that by Yahoo.
2 CORE HADOOP ECOSYSTEM
Hadoop Ecosystem is a framework of various types
of complex and evolving tools and components
which have proficient advantage in solving prob-
lems. Some of the elements may be very dissimilar
from each other in terms of their architecture; how-
ever, what keeps them all together under a single
roof is that they all derive their functionalities from
the scalability and power of Hadoop. The Hadoop
Ecosystem can be visioned to consist of four differ-
ent layers: data storage, data processing, data access
and data management[9].
2.1 Data Storage
Data storage layer is the layer where the data
is stored in a distributed file system; consists of
Hadoop Distributed File System (HDFS) and HBase
Column DB Storage. HDFS is a file system for
Figure 1. Hadoop Ecosystem
storing very large files with streaming data access
on the cluster[14]. HBase is a scalable, distributed
and column oriented open source non relational
database built on top of HDFS[14].
2.2 Data Processing
This layer consists of MapReduce [9]-a software
framework for distributed processing of large data
sets that serves as the compute layer of Hadoop
which process vast amounts of data in-parallel on
large clusters of commodity hardware in a reliable
and fault-tolerant manner, and YARN (Yet Another
Resource Negotiator) manages resources on the
computing cluster.
Hadoop · College of Computing and Information Sciences, Makerere University

2
2.3 Data Access
The layer, where the request from Management
layer is sent to Data Processing Layer. Some projects
have been setup for this layer, some of them are:
Hive, a data warehouse infrastructure that provides
data summarization and ad hoc querying; Pig, a
platform for analyzing large data sets; Mahout, a
scalable machine learning and data mining library;
Avro, data serialization system; Sqoop is a con-
nectivity tool for moving data from non-Hadoop
data stores-such as relational databases and data
warehouses into Hadoop[14].
2.4 Data Management
A layer through which the user interacts with the
system. This layer has components like: Chukwa,
a data collection system for managing large dis-
tributed system; Zookeeper, high performance coor-
dination service for distributed applications; Oozie,
a workflow processing system that lets users de-
fine a series of jobs written in multiple languages;
Flume, a distributed, reliable, and available service
for efficiently collecting, aggregating, and moving
large amounts of log data [9].
3 HDFS ARCHITECTURE
Hadoop includes two core components; HDFS and
MapReduce. HDFS is a distributed file-system that
stores very large amounts of data (on a petabytes
scale)[6]. Distributed file-systems manage the stor-
age across a network of machines[15]. In an HDFS
cluster, there is one Namenode (master node)
and an arbitrarily number of Datanodes (slave
nodes)[6]. The Namenode serves all metadata oper-
ations on the file-system including creating, open-
ing, or renaming files and directories[6]. On the
other hand, Datanodes are the workhorses of the
file-system. They store and retrieve blocks when
they are told to (by clients or the namenode), and
they report back to the namenode periodically with
lists of blocks that they are storing[15].
3.1 HDFS Write Operation
At first, the HDFS client will reach out to the
NameNode for a write request. The NameNode
will then grant the client the write permission and
will provide the IP addresses of the DataNodes
where the file blocks will be copied eventually. The
selection of IP addresses of DataNodes is purely
randomized based on availability, replication factor
and rack awareness. The whole data copy process
will happen in three stages: set up of pipeline,
data streaming and replication, and shutdown of
pipeline (acknowledgement stage)[1].
Figure 2. Writing data to HDFS
3.2 HDFS Read Operation
The client will reach out to NameNode asking for
the block metadata for the file. The NameNode
will return the list of DataNodes where blocks are
stored. After that client, will connect to the DataN-
odes where the blocks are stored. The client starts
reading data parallel from the DataNodes. Once the
client gets all the required file blocks, it will combine
these blocks to form a file[1].
Figure 3. Reading the file from HDFS
3.3 HDFS Resilience
HDFS is robust against a number of failure types:
• DataNode failure: Each DataNode sends
from time to time a heartbeat message to
the NameNode. If the NameNode does not
receive any heartbeat for a specific amount
of time from a DataNode, the node is seen
to be dead and no further operations are
scheduled for it. To prevent data loss, on
lost DataNode, the NameNode starts a new
replication process to increase the replication
factor for the blocks stored[8].

3
• HDFS federation: NameNodes can be setup
to manage different namespace volumes on
a computer cluster to mitigate on the impact
of data loss[8].
• NameNode failure: As the NameNode is a
single point of failure, it is critical that its
data can be restored. Therefore one can con-
figure a standby NameNode or write the
EditLog and FsImage to the local disk and
Network File System (NFS)[8].
4 MAPREDUCE
MapReduce is a core component of Apache Hadoop
[3], and is defined as a programming model for
processing and generating large datasets across
a cluster of computers[16][15]. This programming
model is inspired by the map and reduce prim-
itives of functional programming languages such
as Lisp[16]. In functional programming languages,
map, takes as input a function and a sequence of
values and applies the function to each value in the
sequence, and reduce, takes as input a sequence
of values and combines all values using binary
operator.
4.1 How MapReduce Works?
MapReduce works by breaking the processing
into two phases: the map phase and the reduce
phase[3][16]. Each phase has key-value pairs as
input and output, the types of which may be chosen
by the programmer. The programmer also specifies
two functions: the map function and the reduce
function[3].
Figure 4. MapReduce logical data flow, showing the results of
the highest Commulative Grade Point Average(CGPA) for each
year in the students’ dataset
In Figure 4 above, the map function extracts
year and CGPA from the students’ feed dataset, and
emits them as its output. The MapReduce frame-
work sorts, shuffles and groups year and CGPA as
key-value pairs by year as key. Finally, the reduce
function iterates through the list and picks up the
highest CGPA.
5 FUTURE HADOOP
This section discusses the recent developments in
Hadoop. On January 30th 2019, Hadoop developers
gathered to share their latest work, with presenta-
tions by engineers from community members like
Microsoft, Cloudera, Uber, and LinkedIn[4]. Some
of the latest works that were presented include;
TonY: TensorFlow on YARN, an application for
running TensorFlow and other distributed machine
learning jobs on Hadoop. TensorFlow is a popular
open-source machine learning library for creating
machine learning and deep learning applications[4].
Hadoop Encryption: Key Management Server
(KMS) and column-level encryption. As the breadth
of Hadoop expands and new data protection rules
like General Data Protection Regulation (GDPR) are
put in place, there is increasing pressure for Hadoop
administrators to encrypt their data. The key de-
velopment here is a schema-controlled column-level
access control via encryption[4].
HDFS High Availability Enhancement: In HDFS,
a single metadata server, known as the NameNode,
serves all of the client requests for information
about and updates to the file system. To keep the
system running with high availability, there are
replicas of the NameNode known as Standby Na-
meNodes, which maintain an up-to-date view of the
file system, ready to take over in case of failures.
This new feature enables a client to read meta-
data from this standby node and greatly increases
the number of read requests an HDFS cluster can
serve[4].
Ozone: This is a new storage project within
Hadoop that is closely related to HDFS except that
it is an object store as opposed to a file system. This
means it has semantics similar to those of Amazon’s
S3, rather than those of a file system. This allows for
enormous scalability gains[4].
6 STRENGTHS AND WEAKNESSES
In the digital world, it is important to check the
strengths and weakness, and see if the strengths
outweigh the weaknesses. Hadoop comes with its
own set of strengths and weaknesses.
6.1 Strengths
Following are the advantages or strengths of
Hadoop:
• Varied Data Sources: Hadoop accepts a vari-
ety of data, both structured and unstructured
data[2].

4
• Cost-effective: Hadoop is an economical so-
lution as it uses a cluster of commodity hard-
ware to store data[2].
• Performance: Hadoop with its distributed
processing and distributed storage architec-
ture processes huge amounts of data with
high speed[2].
• Fault-Tolerant: A key advantage of using
Hadoop is its fault tolerance[12]. Even with
the failure of some of the data nodes there is
no loss in data because we have a replication
of the same data on other data nodes[10].
• High Availability: Hadoop 3.0 supports mul-
tiple standby NameNodes making the sys-
tem even more highly available as it can
continue functioning in case two or more
NameNodes crashes[2].
• Low Network Traffic: In Hadoop, each job
submitted by the user is split into a number
of independent sub-tasks and these sub-tasks
are assigned to the data nodes thereby mov-
ing a small amount of code to data rather
than moving huge data to code which leads
to low network traffic[2].
• Scalable: Hadoop works on the principle of
horizontal scalability i.e. we need to add the
entire machine to the cluster of nodes and
not change the configuration of a machine
like adding RAM, disk and so on which is
known as vertical scalability. Nodes can be
added to Hadoop cluster on the fly making
it a scalable framework[2].
6.2 Weaknesses
Following are the drawbacks or weaknesses of
Hadoop:
• Issue With Small Files: Hadoop is suitable
for a small number of large files but when it
comes to the application which deals with
a large number of small files. A small file
is nothing but a file which is significantly
smaller than Hadoop’s block size which can
be either 128MB or 256MB by default. These
large number of small files overload the
Namenode as it stores namespace for the
system and makes it difficult for Hadoop to
function[2].
• Processing Overhead: In Hadoop, the data is
read from the disk and written to the disk
which makes read/write operations very ex-
pensive when we are dealing with terabytes
and petabytes of data. Hadoop cannot do in-
memory calculations hence it incurs process-
ing overhead[2].
• Supports Only Batch Processing: At the core,
Hadoop has a batch processing engine which
is not efficient in stream processing. It can-
not produce output in real-time with low
latency. It only works on data which we
collect and store in a file in advance before
processing[2].
• Iterative processing: Hadoop cannot do iter-
ative processing by itself. Machine learning
or iterative processing has a cyclic data flow
whereas Hadoop has data flowing in a chain
of stages where output on one stage becomes
the input of another stage[2].
7 CASE STUDY: LINKEDLN ASSESSMENT
In LinkedIn assessments, the Rasch Model is used
to calibrate question difficulty based on answer
response data. Each question has three statuses:
draft, limited ramp, and full ramp. Newly added
questions are in the “draft” status. Subsequently,
questions marked as “draft” are randomly picked
and surfaced as actual assessment questions to test
takers, thus entering the “limited ramp” stage[7].
The three Hadoop work-flows at Linkedln to man-
age these statuses are as follows:
• Initial calibration: For new assessments, be-
fore fully releasing them to the public,
enough answer response data is collected to
calibrate the initial difficulty for each ques-
tion. This data mainly comes from internal
employees or a small group of dynamically-
selected LinkedIn members. This Hadoop
workflow will calibrate the answer re-
sponse data for all questions within the new
assessment[7].
• On-fly calibration: New assessment ques-
tions go through the “draft” and “limited
ramp” flow before they are “fully ramped”.
This Hadoop workflow only calibrates the
answer response data for those new ques-
tions within an assessment[7].
• Recalibration: The difficulty of fully ramped
questions is recalibrated periodically with
new answer response data, so that their dif-
ficulty will always be relevant[7].
8 HADOOP ALTERNATIVES
Some of the popular open source alternatives to
Hadoop is Disco, Spark, BashReduce, GraphLab,

5
Storm, HPCC systems, etc[5]. Disco allows develop-
ers to write MapReduce jobs in Python and backend
is built using Erlang, the functional language, which
has built-in support for consistency, fault tolerance
and job scheduling. Disco does not use HDFS,
rather it uses a fault-tolerant distributed file sys-
tem DDFS[5]. Spark developed by UC Berkeley al-
lows in-memory MapReduce processing distributed
across multiple machines and is implemented using
Scala[5]. BashReduce developed by Richard Crow-
ley implements MapReduce for standard Unix com-
mands such as sort, awk, grep, join etc. GraphLab
was developed at Carnegie Mellon and is designed
for use in machine learning[5]. Storm developed by
Nathan Marz is a distributed, reliable, and fault-
tolerant stream processing system and processes
data in parallel[5]. High-Performance Computing
Cluster developed by LexisNexis Risk Solutions and
written in C++ makes writing parallel-processing
workflows easier by using Enterprise Control Lan-
guage (ECL).
9 RESULTS
Figure 5. Running Hadoop in User Interface-Ambari
10 CONCLUSION
Hadoop distributed file system is optimized to
work on large datasets. It is designed to hold ter-
abytes or petabytes of data and provides higher
throughput access to this data.
REFERENCES
[1] B. Ashish, “Apache hadoop hdfs architec-
ture,” 2019, [Online; accessed 23-october-2019].
[Online]. Available: https://www.edureka.co/blog/
apache-hadoop-hdfs-architecture/
Figure 6. Running a MapReduce Job in Command Line
Interface-PUTTY
[2] T. DataFlair, “Top advantages and disadvantages
of hadoop 3,” 2019, [Online; accessed 22-October-
2019]. [Online]. Available: https://data-flair.training/
blogs/advantages-and-disadvantages-of-hadoop/
[3] J. Dean and S. Ghemawat, “Mapreduce: a flexible data
processing tool,” Communications of the ACM, vol. 53, no. 1,
pp. 72–77, 2010.
[4] K. Erik, “The present and future of apache hadoop:
A community meetup at linkedin,” 2019, [Online;
accessed 22-October-2019]. [Online]. Available: http:
//bit.ly/322O84w
[5] R. Gauri and Y. Nandita, “A comprehensive survey on
big data issues and alternative approaches to hadoop
mapreduce,” International Journal of Emerging Technology
and Advanced Engineering, vol. 5, no. 2, 2015.
[6] M. Kerzner and S. Maniyam, “Hadoop illuminated,”
Hadoop illuminated LLC, 2013.
[7] Linkedin, “The building blocks of linkedin skill assess-
ments,” 2019, [Online; accessed 27-October-2019]. [On-
line]. Available: https://engineering.linkedin.com/blog/
2019/the-building-blocks-of-linkedin-skill-assessments
[8] M. Martin, “Apache hadoop tutorial: The ultimate guide,”
2016.
[9] S. Mehta and V. Mehta, “Hadoop ecosystem: An intro-
duction,” International Journal of Science and Research (IJSR),
vol. 5, no. 6, pp. 557–562, 2016.
[10] T. Nathan, “Advantages of hadoop,” Scientific & Engineer-
ing Research, vol. 6, no. 1, 2015.
[11] O. O’Malley, “Integrating kerberos into apache hadoop,”
in Kerberos Conference, 2010, pp. 26–27.
[12] P. Steve, “Hadoop advantages & disadvan-
tages,” 2015, [Online; accessed 27-October-2019].
[Online]. Available: https://www.mindsmapped.com/
hadoop-advantages-and-disadvantages/
[13] J. Turner, “Hadoop: What it is, how it works, and what it
can do,” 2011.
[14] C. Uzunkaya, T. Ensari, and Y. Kavurucu, “Hadoop
ecosystem and its analysis on tweets,” Procedia-Social and
Behavioral Sciences, vol. 195, pp. 1890–1897, 2015.
[15] T. White, Hadoop: The definitive guide. ” O’Reilly Media,
Inc.”, 2012.
[16] J. Zhao and J. Pjesivac-Grbovic, “Mapreduce: The pro-
gramming model and practice,” 2009.

Hadoop paper

More Related Content

What's hot

Similar to Hadoop paper

More from ATWIINE Simon Alex

Recently uploaded

Hadoop paper