paper

Big Data Analysis and Its Validation Using
Hadoop
Ankeeta Battalwar
Department of Computer Engineering
Fr. C. Rodrigues Institute of Technology, Vashi
Navi Mumbai, India.
Abstract-This paper intends to introduce the masses to
the concept of Big Data and Big Data analytics. It
provides an overview of Big Data and Big Data analysis
using Hadoop, a popular analytical software. It also
deals with the validation of Big Data in the Hadoop
system. The study of this concept has established a
strong foundation on which a data validation system
can be created. The proposed system intends to validate
the data coming from various data sources before
entering the Big Data system thereby storing only the
data that is atomic, consistent, integrated and relevant
to the system.
I. INTRODUCTION
Big data is data that exceeds the processing
capacity of conventional database systems. The data
is too big, moves too fast, or doesn’t fit the strictures
of your database architectures. To gain value from
this data, you must choose an alternative way to
process it. The fundamentals of big data analysis can
only be understood by getting familiar with the term
“Big Data”.
Big data is the term for a collection of data
sets so large and complex that it becomes difficult to
process using on-hand database management tools or
traditional data processing applications. Big Data
usually includes data sets with sizes beyond the
ability of commonly used software tools to capture,
curate, manage, and process the data within a
tolerable elapsed time.
Big data analysis includes the analysing of data
and carefully processing it so that it can be made
available to the customers. It is the organization,
administration and governance of large volumes of
both structured and unstructured data.The goal of big
data management is to ensure a high level of data
quality and accessibility for business intelligence
and big data analytics applications.
II. OVERVIEW OF BIG DATA
A. Tools available for Big data analysis
Among a horde of tools available for Big Data
analysis a few popular ones are listed below:
Hadoop: Apache Hadoop is an open-source
software framework for storage and large-scale
processing of data-sets on clusters of commodity
hardware. Hadoop is an Apache top-level project
being built and used by a global community of
contributors and users. It is licensed under the
Apache License 2.0.
MongoDB: MongoDB is a cross-platform
document-oriented database system. Released under
a combination of the GNU Affero General Public
License and the Apache License, MongoDB is free
and open-source software.
Splunk: Splunk captures, indexes and correlates
real-time data in a searchable repository from which
it can generate graphs, reports, alerts, dashboards and
visualizations.
MapReduce: Originally developed by Google,
the MapReduce website describe it as “a
programming model and software framework for
writing applications that rapidly process vast amounts
of data in parallel on large clusters of compute
nodes.” It’s used by Hadoop, as well as many other
data processing applications.
Operating System: OS Independent.
GridGain: GridGrain offers an alternative to
Hadoop’s MapReduce that is compatible with the
Hadoop Distributed File System. It offers in-memory
processing for fast analysis of real-time data.
Operating System: Windows, Linux, OS X.

Storm: Now owned by Twitter, Storm offers
distributed real-time computation capabilities and is
often described as the “Hadoop of real time.” It’s
highly scalable, robust, fault-tolerant and works with
nearly all programming languages.
Operating System: Linux.
HBase: Another Apache project, HBase is
the non-relational data store for Hadoop. Features
include linear and modular scalability, strictly
consistent reads and writes, automatic failover
support and much more.
Hibari: Used by many telecom companies,
Hibari is a key-value, big data store with strong
consistency, high availability and fast performance.
Support is available through Gemini Mobile.
B. Traditional Data v/s Big Data
Traditional Data
warehouse Analytics
Big Data Analytics
1. Traditional
Analytics analyzes the
known data terrain, that
too the data that is well
understood. Most of the
data warehouses have
elaborate ETL processes
and database constraints,
which means that the
data that is loaded inside
a data warehouse is well
understood, cleansed and
in line with the business
metadata.
The biggest
advantage of Big
Data is that it is
targeted at
unstructured data
outside the traditional
means of capturing
data. This means that
there is no guarantee
that the incoming
data is well-formed
and clean and devoid
of any errors. This
makes it more
challenging but at the
same time it gives a
scope for much more
insight into the data.
2. Traditional Analytics is
built on top of the
relational data model.
Relationships between
the subjects of interest
are created inside the
system and the analysis
is done based on them.
In typical world, it is
very difficult to
establish
a relationship
between all the
information in a
formal way,
and hence
unstructured data in
the form of images,
videos, mobile
generated
information, RFID
etc. have to be
considered in big
data analytics.
3. Traditional analytics is
batch oriented and we
need to wait for ETL and
transformation jobs to
complete before the
required insight is
obtained.
Big Data Analytics is
aimed at near-real
time analysis of the
data using the
support of the
software meant for it.
4. Parallelism in a
traditional analytics
system is
achieved through costly
hardware like MPP
(Massively Parallel
Processing) systems
and/or SMP systems.
While there are
appliances in the
market for the Big
Data Analytics, this
can also be
achieved through
commodity hardware
and new generation
of analytical software
like Hadoop or other
Analytical databases.
III. OVERVIEW OF HADOOP
Hadoop is the widely preferred Big Data
analytical tool due to many reasons which include it
being cheap as it is an open source software to it
being very fault tolerant and effective at Big Data
storage and processing. The Hadoop architecture
performs the required functions with the help of two
of its most vital components- MapReduce and
Hadoop File System (HDFS). Both these components
are essential in understanding how Hadoop handles
Big Data.
A. Hadoop Architecture
Hadoop consists of the Hadoop Common
package, which provides filesystem and OS level
abstractions, a MapReduce engine (either
MapReduce/MR1 or YARN/MR2) and the Hadoop
Distributed File System (HDFS). The Hadoop
Common package contains the necessary Java
archive files and scripts needed to start Hadoop. The
package also provides source code, documentation
and a contribution section that includes projects from
the Hadoop Community.
A Node in Hadoop architecture is any end
point of processing or storage activity. The storage
system is divided physically and also virtually in to
racks (a collection of smaller storages). For effective
scheduling of work, every Hadoop-compatible file
system should provide location awareness, that is, the
name of the rack (more precisely, of the network
switch) where a worker node is.

A small Hadoop cluster includes a single
master and multiple worker nodes. The master node
consists of a JobTracker, TaskTracker, NameNode
and DataNode. A slave or worker node acts as both a
DataNode and TaskTracker, though it is possible to
have data-only worker nodes and compute-only
worker nodes. Hadoop requires Java Runtime
Environment1.6 or higher. The standard start-up and
shutdown scripts require Secure Shell (SSH) to be set
up between nodes in the cluster.
In a larger cluster, the HDFS is managed
through a dedicated NameNode server to host the file
system index, and a secondary NameNode that can
generate snapshots of the NameNode’s memory
structures, thus preventing file-system corruption and
reducing loss of data. Similarly, a standalone
JobTracker server can manage job scheduling. In
clusters where the Hadoop MapReduce engine is
deployed against an alternate file system, the
NameNode, secondary NameNode and DataNode
architecture of HDFS is replaced by the file-system-
specific equivalent.
B. Hadoop Distributed File System (HDFS)
The HDFS is Hadoop’s file system to handle
large data sets. It stores data over a number of small
repositories using simple commonplace hardware to
store it and process it. Each such storage and
processing unit is called as a cluster and thus a
distributed file system is generated. This file system
is where all data resides and is scalable, fault-
tolerant, redundant storage.
Hadoop uses the strategy of bringing the
processing code to data thus reducing large data
transfers over the distributed network. The data, be it
structured or unstructured is split over the network
and each is processed separately. All data is
replicated and each change in system is constantly
recorded. The HDFS is a network of racks of low
cost storage hardware where each rack will have
smaller storage areas, of around 64 gigabytes, where
data will be stored. The racks use TCP/IP based
protocols and Remote Call Procedures (RPC) to
communicate with each other. HDFS implements 3-
way replication, two copies of data on same rack and
one more copy on some other rack. A single
DataNode may consist of one or multiple of such
small storages.
1) Architecture
The entire architecture of HDFS is divided
into Namespaces that contain NameNodes which are
further connected to DataNodes. It is a master-slave
architecture where a NameNode acts as master
controlling the DataNode slaves. Both are pieces of
software that enable recognition and manipulation of
data in the HDFS.
The different components and functions are
listed as follows:
NameNode: A master server that manages
the file system namespace and regulates access to
files by clients. It manages data through DataNodes.
It executes file system namespace operations like
opening, closing, and renaming files and directories.
DataNode: It manages storage attached to
the nodes that they run on. A file is split into one or
more blocks and these blocks are stored in a set of
DataNodes. It executes file system namespace
operations like opening, closing, and renaming files
and directories, block creation, deletion, and
replication upon instruction from the NameNode.
HeartBeat: Receipt of a Heartbeat implies
that the DataNode is functioning properly. Using
HeartBeat the System knows which nodes are
running and where failure may have occurred. This
helps in real time process tracking and allows counter
measures to be applied to maintain throughput.
BlockReport: A BlockReport contains a list
of all blocks on a DataNode. BlockReport along with
HeartBeat keeps track of working nodes.
Safemode: On startup, the NameNode enters
a special state called Safemode. Replication of data
blocks does not occur when the NameNode is in the
Safemode state. The NameNode receives Heartbeat
and Blockreport messages from the DataNodes, then
starts replication. Once replicated, Safemode is
exited.

EditLog: The NameNode uses a transaction
log called the EditLog to persistently record every
change that occurs in file system metadata.
FsImage: The NameNode creates a snapshot
of its system called the FsImage. This is stored as file
in the system itself and acts as a checkpoint. In case
of a crash or malfunction instead of booting the
system from the start we can use the FsImage to
recreate the system quickly.
DFSAdmin: The DFSAdmin command set
is used for administering an HDFS cluster. These are
commands that are used only by an HDFS
administrator.
C. MapReduce
The MapReduce framework handles distributed
parallel computation on the data sets of Big Data. It
works on local block of data and functions in a
manner to reduce traffic on the network. It uses
JobTracker and TaskTracker applications to handle
parallel and batch-wise processing of Big Data.
IV. VALIDATION OF DATA
Big Data presents challenges in all phases of
data analytics. The Testing of Big data systems
is one of the major challenges faced by
Organizations because of lack of Knowledge
regarding what to test and how much data should
be tested. Since big data consists of a varied
variety and large volume of data having high
data rate, the organizations are facing challenges
in defining and implementing the test strategies
for the structured and unstructured data
validation.

The setting up of an optimal test
environment is an essential for testing the large
volume of data. Working with non-relational
databases and performing non-functional testing
are the challenges faced. These challenges are
leading to poor quality of data going in
production and causing delayed implementation
and increase in the cost. Robust testing
approaches need to be defined for validating
structured and semi-structured data and start the
testing activities early to identify the possible
defects early in the implementation Life cycle
and thus reduce the overall cost, time and efforts
spent on testing these homogeneous data
processing systems.
A. Challenges to Data Validation
While many organizations have well-defined IT
security policies, those policies mostly revolve
around trust and procedure, both of which can break
down and fail. A major complicating factor is that
nearly all physical forms of documentations and
transactions have digitized versions — which means
organizations must validate duplicate sets of data and
be able to respond quickly to any invalid data set.
At the same time, cybercrime has emerged as a
very profitable business for criminals- underlining
the growing need to validate and verify all kinds of
data.Another area of concern is the increasing
dominance of mobile communications within most
organizations. This mobile dimension opens up
another deep fissure of insecurity, as the myriad types
of mobile devices and mobile networks can create
new avenues for malicious practices.
For testing and validating the data to be
processed in the big data systems we need a test
automation tool for validating the data in the
RDBMS and big data systems, so that the correct
data is processed to obtain the accurate results and
ensure that the data has been tested and validated.
B. Suggested system for Data Validation
Thus, we suggest to build an utility that will
fetch the data from the RDBMS and Big Data system
(Hadoop) and compare the data by using a computing
engine which will validate whether the records from
the RDBMS systems like SQL server database and
the Hadoop (HDFS) database match each other to
ensure that there are no mismatched records, extra
records.
• Using a user interface, the system can
extract the data from these databases and
then using a comparison engine we will
validate the records and generate appropriate
test results and reports that will be generated
once the test cases have been executed.
• The reports will specify whether there are
any mismatched records, extra or duplicate
records, faulty records etc. The matched
records will be processed and carried
forward, thus ensuring that only the correct
data is being inserted into the Big data
systems.
• The detailed multi-level reports can be
accessed by different users of the system
depending on the access privileges. The
utility will provide detailed test result
summary and allow the tester/user to
perform different operations.
• This system ensures data validation and
helps to achieve some level of automation.
The utility will fetch the data from the SQL
server database & the Hadoop(HDFS) &
will allow the comparison of the records to
find the mismatching records, to remove any
duplicate records & to perform data
validations on the records.
• The System validates the data quality by
employing a structured testing technique,
using the utility we will be able to compare
both the structured as well as semi-
structured database records and automate the
process of data validation.
• The system provides full control to the
administrator to access all the files and data
and reports.
• Various analysis reports can be generated
for improving the process of data validation
and providing 100% test coverage.
• The system is scalable and can reliably
validate and process high volumes of data.
• The use of algorithms for data validation
process and adaptability for varying
requirements makes the utility robust and
reliable.

V. Conclusion
Our proposed system for data validation using
Hadoop architecture allows us to handle big data
efficiently. It uses the concept of MapReduce and
HDFS to efficiently handle large amount of data by
grouping them into clusters thus providing many
features such as authentication, replication, data
integrity and space reclamation.
Once the data is segregated successfully into
clusters, the validation system compares this stored
data with a traditional relational database from which
the data was derived using a computing engine. This
ensures that the data entering the big data system is
authentic, integrated and consistent.
VI. Future scope
Having seen the importance of Big data analysis
in the current work place scenario, it is safe to
assume that it has many applications in the
foreseeable future. Millions of terabytes of
organizational data can be converted into useful and
beneficial data using the concepts of HDFS and
MapReduce in Hadoop. For eg. with the right big
data analytics platforms in place, an enterprise can
boost sales, increase efficiency, and improve
operations, customer service and risk management.
The scope of the application can be expanded by
making it available for larger and more complex data
sources. Large organizations can use an optimum test
environment like the one suggested above to
successfully store and validate data so as to benefit
the organization.

paper

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to paper

Similar to paper (20)

paper