SlideShare a Scribd company logo
Big Data Analysis and Its Validation Using
Hadoop
Ankeeta Battalwar
Department of Computer Engineering
Fr. C. Rodrigues Institute of Technology, Vashi
Navi Mumbai, India.
Abstract-This paper intends to introduce the masses to
the concept of Big Data and Big Data analytics. It
provides an overview of Big Data and Big Data analysis
using Hadoop, a popular analytical software. It also
deals with the validation of Big Data in the Hadoop
system. The study of this concept has established a
strong foundation on which a data validation system
can be created. The proposed system intends to validate
the data coming from various data sources before
entering the Big Data system thereby storing only the
data that is atomic, consistent, integrated and relevant
to the system.
I. INTRODUCTION
Big data is data that exceeds the processing
capacity of conventional database systems. The data
is too big, moves too fast, or doesn’t fit the strictures
of your database architectures. To gain value from
this data, you must choose an alternative way to
process it. The fundamentals of big data analysis can
only be understood by getting familiar with the term
“Big Data”.
Big data is the term for a collection of data
sets so large and complex that it becomes difficult to
process using on-hand database management tools or
traditional data processing applications. Big Data
usually includes data sets with sizes beyond the
ability of commonly used software tools to capture,
curate, manage, and process the data within a
tolerable elapsed time.
Big data analysis includes the analysing of data
and carefully processing it so that it can be made
available to the customers. It is the organization,
administration and governance of large volumes of
both structured and unstructured data.The goal of big
data management is to ensure a high level of data
quality and accessibility for business intelligence
and big data analytics applications.
II. OVERVIEW OF BIG DATA
A. Tools available for Big data analysis
Among a horde of tools available for Big Data
analysis a few popular ones are listed below:
Hadoop: Apache Hadoop is an open-source
software framework for storage and large-scale
processing of data-sets on clusters of commodity
hardware. Hadoop is an Apache top-level project
being built and used by a global community of
contributors and users. It is licensed under the
Apache License 2.0.
MongoDB: MongoDB is a cross-platform
document-oriented database system. Released under
a combination of the GNU Affero General Public
License and the Apache License, MongoDB is free
and open-source software.
Splunk: Splunk captures, indexes and correlates
real-time data in a searchable repository from which
it can generate graphs, reports, alerts, dashboards and
visualizations.
MapReduce: Originally developed by Google,
the MapReduce website describe it as “a
programming model and software framework for
writing applications that rapidly process vast amounts
of data in parallel on large clusters of compute
nodes.” It’s used by Hadoop, as well as many other
data processing applications.
Operating System: OS Independent.
GridGain: GridGrain offers an alternative to
Hadoop’s MapReduce that is compatible with the
Hadoop Distributed File System. It offers in-memory
processing for fast analysis of real-time data.
Operating System: Windows, Linux, OS X.
Storm: Now owned by Twitter, Storm offers
distributed real-time computation capabilities and is
often described as the “Hadoop of real time.” It’s
highly scalable, robust, fault-tolerant and works with
nearly all programming languages.
Operating System: Linux.
HBase: Another Apache project, HBase is
the non-relational data store for Hadoop. Features
include linear and modular scalability, strictly
consistent reads and writes, automatic failover
support and much more.
Operating System: OS Independent.
Hibari: Used by many telecom companies,
Hibari is a key-value, big data store with strong
consistency, high availability and fast performance.
Support is available through Gemini Mobile.
Operating System: OS Independent.
B. Traditional Data v/s Big Data
Traditional Data
warehouse Analytics
Big Data Analytics
1. Traditional
Analytics analyzes the
known data terrain, that
too the data that is well
understood. Most of the
data warehouses have
elaborate ETL processes
and database constraints,
which means that the
data that is loaded inside
a data warehouse is well
understood, cleansed and
in line with the business
metadata.
The biggest
advantage of Big
Data is that it is
targeted at
unstructured data
outside the traditional
means of capturing
data. This means that
there is no guarantee
that the incoming
data is well-formed
and clean and devoid
of any errors. This
makes it more
challenging but at the
same time it gives a
scope for much more
insight into the data.
2. Traditional Analytics is
built on top of the
relational data model.
Relationships between
the subjects of interest
are created inside the
system and the analysis
is done based on them.
In typical world, it is
very difficult to
establish
a relationship
between all the
information in a
formal way,
and hence
unstructured data in
the form of images,
videos, mobile
generated
information, RFID
etc. have to be
considered in big
data analytics.
3. Traditional analytics is
batch oriented and we
need to wait for ETL and
transformation jobs to
complete before the
required insight is
obtained.
Big Data Analytics is
aimed at near-real
time analysis of the
data using the
support of the
software meant for it.
4. Parallelism in a
traditional analytics
system is
achieved through costly
hardware like MPP
(Massively Parallel
Processing) systems
and/or SMP systems.
While there are
appliances in the
market for the Big
Data Analytics, this
can also be
achieved through
commodity hardware
and new generation
of analytical software
like Hadoop or other
Analytical databases.
III. OVERVIEW OF HADOOP
Hadoop is the widely preferred Big Data
analytical tool due to many reasons which include it
being cheap as it is an open source software to it
being very fault tolerant and effective at Big Data
storage and processing. The Hadoop architecture
performs the required functions with the help of two
of its most vital components- MapReduce and
Hadoop File System (HDFS). Both these components
are essential in understanding how Hadoop handles
Big Data.
A. Hadoop Architecture
Hadoop consists of the Hadoop Common
package, which provides filesystem and OS level
abstractions, a MapReduce engine (either
MapReduce/MR1 or YARN/MR2) and the Hadoop
Distributed File System (HDFS). The Hadoop
Common package contains the necessary Java
archive files and scripts needed to start Hadoop. The
package also provides source code, documentation
and a contribution section that includes projects from
the Hadoop Community.
A Node in Hadoop architecture is any end
point of processing or storage activity. The storage
system is divided physically and also virtually in to
racks (a collection of smaller storages). For effective
scheduling of work, every Hadoop-compatible file
system should provide location awareness, that is, the
name of the rack (more precisely, of the network
switch) where a worker node is.
A small Hadoop cluster includes a single
master and multiple worker nodes. The master node
consists of a JobTracker, TaskTracker, NameNode
and DataNode. A slave or worker node acts as both a
DataNode and TaskTracker, though it is possible to
have data-only worker nodes and compute-only
worker nodes. Hadoop requires Java Runtime
Environment1.6 or higher. The standard start-up and
shutdown scripts require Secure Shell (SSH) to be set
up between nodes in the cluster.
In a larger cluster, the HDFS is managed
through a dedicated NameNode server to host the file
system index, and a secondary NameNode that can
generate snapshots of the NameNode’s memory
structures, thus preventing file-system corruption and
reducing loss of data. Similarly, a standalone
JobTracker server can manage job scheduling. In
clusters where the Hadoop MapReduce engine is
deployed against an alternate file system, the
NameNode, secondary NameNode and DataNode
architecture of HDFS is replaced by the file-system-
specific equivalent.
B. Hadoop Distributed File System (HDFS)
The HDFS is Hadoop’s file system to handle
large data sets. It stores data over a number of small
repositories using simple commonplace hardware to
store it and process it. Each such storage and
processing unit is called as a cluster and thus a
distributed file system is generated. This file system
is where all data resides and is scalable, fault-
tolerant, redundant storage.
Hadoop uses the strategy of bringing the
processing code to data thus reducing large data
transfers over the distributed network. The data, be it
structured or unstructured is split over the network
and each is processed separately. All data is
replicated and each change in system is constantly
recorded. The HDFS is a network of racks of low
cost storage hardware where each rack will have
smaller storage areas, of around 64 gigabytes, where
data will be stored. The racks use TCP/IP based
protocols and Remote Call Procedures (RPC) to
communicate with each other. HDFS implements 3-
way replication, two copies of data on same rack and
one more copy on some other rack. A single
DataNode may consist of one or multiple of such
small storages.
1) Architecture
The entire architecture of HDFS is divided
into Namespaces that contain NameNodes which are
further connected to DataNodes. It is a master-slave
architecture where a NameNode acts as master
controlling the DataNode slaves. Both are pieces of
software that enable recognition and manipulation of
data in the HDFS.
The different components and functions are
listed as follows:
NameNode: A master server that manages
the file system namespace and regulates access to
files by clients. It manages data through DataNodes.
It executes file system namespace operations like
opening, closing, and renaming files and directories.
DataNode: It manages storage attached to
the nodes that they run on. A file is split into one or
more blocks and these blocks are stored in a set of
DataNodes. It executes file system namespace
operations like opening, closing, and renaming files
and directories, block creation, deletion, and
replication upon instruction from the NameNode.
HeartBeat: Receipt of a Heartbeat implies
that the DataNode is functioning properly. Using
HeartBeat the System knows which nodes are
running and where failure may have occurred. This
helps in real time process tracking and allows counter
measures to be applied to maintain throughput.
BlockReport: A BlockReport contains a list
of all blocks on a DataNode. BlockReport along with
HeartBeat keeps track of working nodes.
Safemode: On startup, the NameNode enters
a special state called Safemode. Replication of data
blocks does not occur when the NameNode is in the
Safemode state. The NameNode receives Heartbeat
and Blockreport messages from the DataNodes, then
starts replication. Once replicated, Safemode is
exited.
EditLog: The NameNode uses a transaction
log called the EditLog to persistently record every
change that occurs in file system metadata.
FsImage: The NameNode creates a snapshot
of its system called the FsImage. This is stored as file
in the system itself and acts as a checkpoint. In case
of a crash or malfunction instead of booting the
system from the start we can use the FsImage to
recreate the system quickly.
DFSAdmin: The DFSAdmin command set
is used for administering an HDFS cluster. These are
commands that are used only by an HDFS
administrator.
C. MapReduce
The MapReduce framework handles distributed
parallel computation on the data sets of Big Data. It
works on local block of data and functions in a
manner to reduce traffic on the network. It uses
JobTracker and TaskTracker applications to handle
parallel and batch-wise processing of Big Data.
IV. VALIDATION OF DATA
Big Data presents challenges in all phases of
data analytics. The Testing of Big data systems
is one of the major challenges faced by
Organizations because of lack of Knowledge
regarding what to test and how much data should
be tested. Since big data consists of a varied
variety and large volume of data having high
data rate, the organizations are facing challenges
in defining and implementing the test strategies
for the structured and unstructured data
validation.
The setting up of an optimal test
environment is an essential for testing the large
volume of data. Working with non-relational
databases and performing non-functional testing
are the challenges faced. These challenges are
leading to poor quality of data going in
production and causing delayed implementation
and increase in the cost. Robust testing
approaches need to be defined for validating
structured and semi-structured data and start the
testing activities early to identify the possible
defects early in the implementation Life cycle
and thus reduce the overall cost, time and efforts
spent on testing these homogeneous data
processing systems.
A. Challenges to Data Validation
While many organizations have well-defined IT
security policies, those policies mostly revolve
around trust and procedure, both of which can break
down and fail. A major complicating factor is that
nearly all physical forms of documentations and
transactions have digitized versions — which means
organizations must validate duplicate sets of data and
be able to respond quickly to any invalid data set.
At the same time, cybercrime has emerged as a
very profitable business for criminals- underlining
the growing need to validate and verify all kinds of
data.Another area of concern is the increasing
dominance of mobile communications within most
organizations. This mobile dimension opens up
another deep fissure of insecurity, as the myriad types
of mobile devices and mobile networks can create
new avenues for malicious practices.
For testing and validating the data to be
processed in the big data systems we need a test
automation tool for validating the data in the
RDBMS and big data systems, so that the correct
data is processed to obtain the accurate results and
ensure that the data has been tested and validated.
B. Suggested system for Data Validation
Thus, we suggest to build an utility that will
fetch the data from the RDBMS and Big Data system
(Hadoop) and compare the data by using a computing
engine which will validate whether the records from
the RDBMS systems like SQL server database and
the Hadoop (HDFS) database match each other to
ensure that there are no mismatched records, extra
records.
• Using a user interface, the system can
extract the data from these databases and
then using a comparison engine we will
validate the records and generate appropriate
test results and reports that will be generated
once the test cases have been executed.
• The reports will specify whether there are
any mismatched records, extra or duplicate
records, faulty records etc. The matched
records will be processed and carried
forward, thus ensuring that only the correct
data is being inserted into the Big data
systems.
• The detailed multi-level reports can be
accessed by different users of the system
depending on the access privileges. The
utility will provide detailed test result
summary and allow the tester/user to
perform different operations.
• This system ensures data validation and
helps to achieve some level of automation.
The utility will fetch the data from the SQL
server database & the Hadoop(HDFS) &
will allow the comparison of the records to
find the mismatching records, to remove any
duplicate records & to perform data
validations on the records.
• The System validates the data quality by
employing a structured testing technique,
using the utility we will be able to compare
both the structured as well as semi-
structured database records and automate the
process of data validation.
• The system provides full control to the
administrator to access all the files and data
and reports.
• Various analysis reports can be generated
for improving the process of data validation
and providing 100% test coverage.
• The system is scalable and can reliably
validate and process high volumes of data.
• The use of algorithms for data validation
process and adaptability for varying
requirements makes the utility robust and
reliable.
V. Conclusion
Our proposed system for data validation using
Hadoop architecture allows us to handle big data
efficiently. It uses the concept of MapReduce and
HDFS to efficiently handle large amount of data by
grouping them into clusters thus providing many
features such as authentication, replication, data
integrity and space reclamation.
Once the data is segregated successfully into
clusters, the validation system compares this stored
data with a traditional relational database from which
the data was derived using a computing engine. This
ensures that the data entering the big data system is
authentic, integrated and consistent.
VI. Future scope
Having seen the importance of Big data analysis
in the current work place scenario, it is safe to
assume that it has many applications in the
foreseeable future. Millions of terabytes of
organizational data can be converted into useful and
beneficial data using the concepts of HDFS and
MapReduce in Hadoop. For eg. with the right big
data analytics platforms in place, an enterprise can
boost sales, increase efficiency, and improve
operations, customer service and risk management.
The scope of the application can be expanded by
making it available for larger and more complex data
sources. Large organizations can use an optimum test
environment like the one suggested above to
successfully store and validate data so as to benefit
the organization.

More Related Content

What's hot

Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
MaulikLakhani
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
Yukti Kaura
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
Asis Mohanty
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data StackZubair Nabi
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
IT Strategy Group
 
Hadoop
HadoopHadoop
Hadoop
RittikaBaksi
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
Mahmoud Yassin
 
Introducing the hadoop ecosystem
Introducing the hadoop ecosystemIntroducing the hadoop ecosystem
Introducing the hadoop ecosystem
Geert Van Landeghem
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2
Imviplav
 
Bigdata and Hadoop Introduction
Bigdata and Hadoop IntroductionBigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
umapavankumar kethavarapu
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoop
TejashBansal2
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
inventionjournals
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Cognizant
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
Apache Apex
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
Sonal Tiwari
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time ApplicationsDataWorks Summit
 
IRJET- Secured Hadoop Environment
IRJET- Secured Hadoop EnvironmentIRJET- Secured Hadoop Environment
IRJET- Secured Hadoop Environment
IRJET Journal
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
Harikrishnan K
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
Ahmed Salman
 

What's hot (20)

Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 
Hadoop
HadoopHadoop
Hadoop
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
 
Introducing the hadoop ecosystem
Introducing the hadoop ecosystemIntroducing the hadoop ecosystem
Introducing the hadoop ecosystem
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2
 
Bigdata and Hadoop Introduction
Bigdata and Hadoop IntroductionBigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoop
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
 
IRJET- Secured Hadoop Environment
IRJET- Secured Hadoop EnvironmentIRJET- Secured Hadoop Environment
IRJET- Secured Hadoop Environment
 
IJARCCE_49
IJARCCE_49IJARCCE_49
IJARCCE_49
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 

Similar to paper

Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | Sysfore
Sysfore Technologies
 
Big data
Big dataBig data
Big data
Abilash Mavila
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
Manoj Jangalva
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
saisreealekhya
 
Big Data Hadoop Technology
Big Data Hadoop TechnologyBig Data Hadoop Technology
Big Data Hadoop Technology
Rahul Sharma
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
Nikita Sure
 
Big data
Big dataBig data
Big data
revathireddyb
 
Big data
Big dataBig data
Big data
revathireddyb
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
Laxmi Rauth
 
G017143640
G017143640G017143640
G017143640
IOSR Journals
 
Big data
Big dataBig data
Big data
Mohamed Salman
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
Ranjith Sekar
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
Md. Hasan Basri (Angel)
 
BIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfBIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdf
DIVYA370851
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
MarianJRuben
 
Hadoop
HadoopHadoop
Hadoop
Ankit Prasad
 
Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a Glance
Neev Technologies
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
himanshu arora
 

Similar to paper (20)

Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | Sysfore
 
Big data
Big dataBig data
Big data
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
 
Big Data Hadoop Technology
Big Data Hadoop TechnologyBig Data Hadoop Technology
Big Data Hadoop Technology
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
G017143640
G017143640G017143640
G017143640
 
Big data
Big dataBig data
Big data
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
BIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfBIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdf
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a Glance
 
Hadoop
HadoopHadoop
Hadoop
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
 

paper

  • 1. Big Data Analysis and Its Validation Using Hadoop Ankeeta Battalwar Department of Computer Engineering Fr. C. Rodrigues Institute of Technology, Vashi Navi Mumbai, India. Abstract-This paper intends to introduce the masses to the concept of Big Data and Big Data analytics. It provides an overview of Big Data and Big Data analysis using Hadoop, a popular analytical software. It also deals with the validation of Big Data in the Hadoop system. The study of this concept has established a strong foundation on which a data validation system can be created. The proposed system intends to validate the data coming from various data sources before entering the Big Data system thereby storing only the data that is atomic, consistent, integrated and relevant to the system. I. INTRODUCTION Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it. The fundamentals of big data analysis can only be understood by getting familiar with the term “Big Data”. Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. Big Data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time. Big data analysis includes the analysing of data and carefully processing it so that it can be made available to the customers. It is the organization, administration and governance of large volumes of both structured and unstructured data.The goal of big data management is to ensure a high level of data quality and accessibility for business intelligence and big data analytics applications. II. OVERVIEW OF BIG DATA A. Tools available for Big data analysis Among a horde of tools available for Big Data analysis a few popular ones are listed below: Hadoop: Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware. Hadoop is an Apache top-level project being built and used by a global community of contributors and users. It is licensed under the Apache License 2.0. MongoDB: MongoDB is a cross-platform document-oriented database system. Released under a combination of the GNU Affero General Public License and the Apache License, MongoDB is free and open-source software. Splunk: Splunk captures, indexes and correlates real-time data in a searchable repository from which it can generate graphs, reports, alerts, dashboards and visualizations. MapReduce: Originally developed by Google, the MapReduce website describe it as “a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes.” It’s used by Hadoop, as well as many other data processing applications. Operating System: OS Independent. GridGain: GridGrain offers an alternative to Hadoop’s MapReduce that is compatible with the Hadoop Distributed File System. It offers in-memory processing for fast analysis of real-time data. Operating System: Windows, Linux, OS X.
  • 2. Storm: Now owned by Twitter, Storm offers distributed real-time computation capabilities and is often described as the “Hadoop of real time.” It’s highly scalable, robust, fault-tolerant and works with nearly all programming languages. Operating System: Linux. HBase: Another Apache project, HBase is the non-relational data store for Hadoop. Features include linear and modular scalability, strictly consistent reads and writes, automatic failover support and much more. Operating System: OS Independent. Hibari: Used by many telecom companies, Hibari is a key-value, big data store with strong consistency, high availability and fast performance. Support is available through Gemini Mobile. Operating System: OS Independent. B. Traditional Data v/s Big Data Traditional Data warehouse Analytics Big Data Analytics 1. Traditional Analytics analyzes the known data terrain, that too the data that is well understood. Most of the data warehouses have elaborate ETL processes and database constraints, which means that the data that is loaded inside a data warehouse is well understood, cleansed and in line with the business metadata. The biggest advantage of Big Data is that it is targeted at unstructured data outside the traditional means of capturing data. This means that there is no guarantee that the incoming data is well-formed and clean and devoid of any errors. This makes it more challenging but at the same time it gives a scope for much more insight into the data. 2. Traditional Analytics is built on top of the relational data model. Relationships between the subjects of interest are created inside the system and the analysis is done based on them. In typical world, it is very difficult to establish a relationship between all the information in a formal way, and hence unstructured data in the form of images, videos, mobile generated information, RFID etc. have to be considered in big data analytics. 3. Traditional analytics is batch oriented and we need to wait for ETL and transformation jobs to complete before the required insight is obtained. Big Data Analytics is aimed at near-real time analysis of the data using the support of the software meant for it. 4. Parallelism in a traditional analytics system is achieved through costly hardware like MPP (Massively Parallel Processing) systems and/or SMP systems. While there are appliances in the market for the Big Data Analytics, this can also be achieved through commodity hardware and new generation of analytical software like Hadoop or other Analytical databases. III. OVERVIEW OF HADOOP Hadoop is the widely preferred Big Data analytical tool due to many reasons which include it being cheap as it is an open source software to it being very fault tolerant and effective at Big Data storage and processing. The Hadoop architecture performs the required functions with the help of two of its most vital components- MapReduce and Hadoop File System (HDFS). Both these components are essential in understanding how Hadoop handles Big Data. A. Hadoop Architecture Hadoop consists of the Hadoop Common package, which provides filesystem and OS level abstractions, a MapReduce engine (either MapReduce/MR1 or YARN/MR2) and the Hadoop Distributed File System (HDFS). The Hadoop Common package contains the necessary Java archive files and scripts needed to start Hadoop. The package also provides source code, documentation and a contribution section that includes projects from the Hadoop Community. A Node in Hadoop architecture is any end point of processing or storage activity. The storage system is divided physically and also virtually in to racks (a collection of smaller storages). For effective scheduling of work, every Hadoop-compatible file system should provide location awareness, that is, the name of the rack (more precisely, of the network switch) where a worker node is.
  • 3. A small Hadoop cluster includes a single master and multiple worker nodes. The master node consists of a JobTracker, TaskTracker, NameNode and DataNode. A slave or worker node acts as both a DataNode and TaskTracker, though it is possible to have data-only worker nodes and compute-only worker nodes. Hadoop requires Java Runtime Environment1.6 or higher. The standard start-up and shutdown scripts require Secure Shell (SSH) to be set up between nodes in the cluster. In a larger cluster, the HDFS is managed through a dedicated NameNode server to host the file system index, and a secondary NameNode that can generate snapshots of the NameNode’s memory structures, thus preventing file-system corruption and reducing loss of data. Similarly, a standalone JobTracker server can manage job scheduling. In clusters where the Hadoop MapReduce engine is deployed against an alternate file system, the NameNode, secondary NameNode and DataNode architecture of HDFS is replaced by the file-system- specific equivalent. B. Hadoop Distributed File System (HDFS) The HDFS is Hadoop’s file system to handle large data sets. It stores data over a number of small repositories using simple commonplace hardware to store it and process it. Each such storage and processing unit is called as a cluster and thus a distributed file system is generated. This file system is where all data resides and is scalable, fault- tolerant, redundant storage. Hadoop uses the strategy of bringing the processing code to data thus reducing large data transfers over the distributed network. The data, be it structured or unstructured is split over the network and each is processed separately. All data is replicated and each change in system is constantly recorded. The HDFS is a network of racks of low cost storage hardware where each rack will have smaller storage areas, of around 64 gigabytes, where data will be stored. The racks use TCP/IP based protocols and Remote Call Procedures (RPC) to communicate with each other. HDFS implements 3- way replication, two copies of data on same rack and one more copy on some other rack. A single DataNode may consist of one or multiple of such small storages. 1) Architecture The entire architecture of HDFS is divided into Namespaces that contain NameNodes which are further connected to DataNodes. It is a master-slave architecture where a NameNode acts as master controlling the DataNode slaves. Both are pieces of software that enable recognition and manipulation of data in the HDFS. The different components and functions are listed as follows: NameNode: A master server that manages the file system namespace and regulates access to files by clients. It manages data through DataNodes. It executes file system namespace operations like opening, closing, and renaming files and directories. DataNode: It manages storage attached to the nodes that they run on. A file is split into one or more blocks and these blocks are stored in a set of DataNodes. It executes file system namespace operations like opening, closing, and renaming files and directories, block creation, deletion, and replication upon instruction from the NameNode. HeartBeat: Receipt of a Heartbeat implies that the DataNode is functioning properly. Using HeartBeat the System knows which nodes are running and where failure may have occurred. This helps in real time process tracking and allows counter measures to be applied to maintain throughput. BlockReport: A BlockReport contains a list of all blocks on a DataNode. BlockReport along with HeartBeat keeps track of working nodes. Safemode: On startup, the NameNode enters a special state called Safemode. Replication of data blocks does not occur when the NameNode is in the Safemode state. The NameNode receives Heartbeat and Blockreport messages from the DataNodes, then starts replication. Once replicated, Safemode is exited.
  • 4. EditLog: The NameNode uses a transaction log called the EditLog to persistently record every change that occurs in file system metadata. FsImage: The NameNode creates a snapshot of its system called the FsImage. This is stored as file in the system itself and acts as a checkpoint. In case of a crash or malfunction instead of booting the system from the start we can use the FsImage to recreate the system quickly. DFSAdmin: The DFSAdmin command set is used for administering an HDFS cluster. These are commands that are used only by an HDFS administrator. C. MapReduce The MapReduce framework handles distributed parallel computation on the data sets of Big Data. It works on local block of data and functions in a manner to reduce traffic on the network. It uses JobTracker and TaskTracker applications to handle parallel and batch-wise processing of Big Data. IV. VALIDATION OF DATA Big Data presents challenges in all phases of data analytics. The Testing of Big data systems is one of the major challenges faced by Organizations because of lack of Knowledge regarding what to test and how much data should be tested. Since big data consists of a varied variety and large volume of data having high data rate, the organizations are facing challenges in defining and implementing the test strategies for the structured and unstructured data validation.
  • 5. The setting up of an optimal test environment is an essential for testing the large volume of data. Working with non-relational databases and performing non-functional testing are the challenges faced. These challenges are leading to poor quality of data going in production and causing delayed implementation and increase in the cost. Robust testing approaches need to be defined for validating structured and semi-structured data and start the testing activities early to identify the possible defects early in the implementation Life cycle and thus reduce the overall cost, time and efforts spent on testing these homogeneous data processing systems. A. Challenges to Data Validation While many organizations have well-defined IT security policies, those policies mostly revolve around trust and procedure, both of which can break down and fail. A major complicating factor is that nearly all physical forms of documentations and transactions have digitized versions — which means organizations must validate duplicate sets of data and be able to respond quickly to any invalid data set. At the same time, cybercrime has emerged as a very profitable business for criminals- underlining the growing need to validate and verify all kinds of data.Another area of concern is the increasing dominance of mobile communications within most organizations. This mobile dimension opens up another deep fissure of insecurity, as the myriad types of mobile devices and mobile networks can create new avenues for malicious practices. For testing and validating the data to be processed in the big data systems we need a test automation tool for validating the data in the RDBMS and big data systems, so that the correct data is processed to obtain the accurate results and ensure that the data has been tested and validated. B. Suggested system for Data Validation Thus, we suggest to build an utility that will fetch the data from the RDBMS and Big Data system (Hadoop) and compare the data by using a computing engine which will validate whether the records from the RDBMS systems like SQL server database and the Hadoop (HDFS) database match each other to ensure that there are no mismatched records, extra records. • Using a user interface, the system can extract the data from these databases and then using a comparison engine we will validate the records and generate appropriate test results and reports that will be generated once the test cases have been executed. • The reports will specify whether there are any mismatched records, extra or duplicate records, faulty records etc. The matched records will be processed and carried forward, thus ensuring that only the correct data is being inserted into the Big data systems. • The detailed multi-level reports can be accessed by different users of the system depending on the access privileges. The utility will provide detailed test result summary and allow the tester/user to perform different operations. • This system ensures data validation and helps to achieve some level of automation. The utility will fetch the data from the SQL server database & the Hadoop(HDFS) & will allow the comparison of the records to find the mismatching records, to remove any duplicate records & to perform data validations on the records. • The System validates the data quality by employing a structured testing technique, using the utility we will be able to compare both the structured as well as semi- structured database records and automate the process of data validation. • The system provides full control to the administrator to access all the files and data and reports. • Various analysis reports can be generated for improving the process of data validation and providing 100% test coverage. • The system is scalable and can reliably validate and process high volumes of data. • The use of algorithms for data validation process and adaptability for varying requirements makes the utility robust and reliable.
  • 6. V. Conclusion Our proposed system for data validation using Hadoop architecture allows us to handle big data efficiently. It uses the concept of MapReduce and HDFS to efficiently handle large amount of data by grouping them into clusters thus providing many features such as authentication, replication, data integrity and space reclamation. Once the data is segregated successfully into clusters, the validation system compares this stored data with a traditional relational database from which the data was derived using a computing engine. This ensures that the data entering the big data system is authentic, integrated and consistent. VI. Future scope Having seen the importance of Big data analysis in the current work place scenario, it is safe to assume that it has many applications in the foreseeable future. Millions of terabytes of organizational data can be converted into useful and beneficial data using the concepts of HDFS and MapReduce in Hadoop. For eg. with the right big data analytics platforms in place, an enterprise can boost sales, increase efficiency, and improve operations, customer service and risk management. The scope of the application can be expanded by making it available for larger and more complex data sources. Large organizations can use an optimum test environment like the one suggested above to successfully store and validate data so as to benefit the organization.