This document discusses big data analysis using Hadoop and proposes a system for validating data entering big data systems. It provides an overview of big data and Hadoop, describing how Hadoop uses MapReduce and HDFS to process and store large amounts of data across clusters of commodity hardware. The document then outlines challenges in validating big data and proposes a utility that would extract data from SQL and Hadoop databases, compare records to identify mismatches, and generate reports to ensure only correct data is processed.
Infrastructure Considerations for Analytical WorkloadsCognizant
Using Apache Hadoop clusters and Mahout for analyzing big data workloads yields extraordinary performance; we offer a detailed comparison of running Hadoop in a physical vs. virtual infrastructure environment.
View the Big Data Technology Stack in a nutshell. This Big Data Technology Stack deck covers the different layers of the Big Data world and summarizes the major technologies in vogue today.
The data management industry has matured over the last three decades, primarily based on relational database management system(RDBMS) technology. Since the amount of data collected, and analyzed in enterprises has increased several folds in volume, variety and velocityof generation and consumption, organisations have started struggling with architectural limitations of traditional RDBMS architecture. As a result a new class of systems had to be designed and implemented, giving rise to the new phenomenon of “Big Data”. In this paper we will trace the origin of new class of system called Hadoop to handle Big data.
Infrastructure Considerations for Analytical WorkloadsCognizant
Using Apache Hadoop clusters and Mahout for analyzing big data workloads yields extraordinary performance; we offer a detailed comparison of running Hadoop in a physical vs. virtual infrastructure environment.
View the Big Data Technology Stack in a nutshell. This Big Data Technology Stack deck covers the different layers of the Big Data world and summarizes the major technologies in vogue today.
The data management industry has matured over the last three decades, primarily based on relational database management system(RDBMS) technology. Since the amount of data collected, and analyzed in enterprises has increased several folds in volume, variety and velocityof generation and consumption, organisations have started struggling with architectural limitations of traditional RDBMS architecture. As a result a new class of systems had to be designed and implemented, giving rise to the new phenomenon of “Big Data”. In this paper we will trace the origin of new class of system called Hadoop to handle Big data.
Vikram Andem Big Data Strategy @ IATA Technology Roadmap IT Strategy Group
Vikram Andem, Senior Manager, United Airlines, A case for Bigdata Program and Strategy @ IATA Technology Roadmap 2014, October 13th, 2014, Montréal, Canada
Introduction to Big Data and Hadoop using Local Standalone Modeinventionjournals
Big Data is a term defined for data sets that are extreme and complex where traditional data processing applications are inadequate to deal with them. The term Big Data often refers simply to the use of predictive investigation on analytic methods that extract value from data. Big data is generalized as a large data which is a collection of big datasets that cannot be processed using traditional computing techniques. Big data is not purely a data, rather than it is a complete subject involves various tools, techniques and frameworks. Big data can be any structured collection which results incapability of conventional data management methods. Hadoop is a distributed example used to change the large amount of data. This manipulation contains not only storage as well as processing on the data. Hadoop is an open- source software framework for dispersed storage and processing of big data sets on computer clusters built from commodity hardware. HDFS was built to support high throughput, streaming reads and writes of extremely large files. Hadoop Map Reduce is a software framework for easily writing applications which process vast amounts of data. Wordcount example reads text files and counts how often words occur. The input is text files and the result is wordcount file, each line of which contains a word and the count of how often it occurred separated by a tab.
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
A guide to using Apache Hadoop as your open source big data platform of choice, including the vendors that make various Hadoop flavors, related open source tools, Hadoop capabilities and suitable applications.
Presentation regarding big data. The presentation also contains basics regarding Hadoop and Hadoop components along with their architecture. Contents of the PPT are
1. Understanding Big Data
2. Understanding Hadoop & It’s Components
3. Components of Hadoop Ecosystem
4. Data Storage Component of Hadoop
5. Data Processing Component of Hadoop
6. Data Access Component of Hadoop
7. Data Management Component of Hadoop
8.Hadoop Security Management Tool: Knox ,Ranger
This presentation Simplify the concepts of Big data and NoSQL databases & Hadoop components.
The Original Source:
http://zohararad.github.io/presentations/big-data-introduction/
Big data is data that, by virtue of its velocity, volume, or variety (the three Vs), cannot be easily stored or analyzed with traditional methods. Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware.
Vikram Andem Big Data Strategy @ IATA Technology Roadmap IT Strategy Group
Vikram Andem, Senior Manager, United Airlines, A case for Bigdata Program and Strategy @ IATA Technology Roadmap 2014, October 13th, 2014, Montréal, Canada
Introduction to Big Data and Hadoop using Local Standalone Modeinventionjournals
Big Data is a term defined for data sets that are extreme and complex where traditional data processing applications are inadequate to deal with them. The term Big Data often refers simply to the use of predictive investigation on analytic methods that extract value from data. Big data is generalized as a large data which is a collection of big datasets that cannot be processed using traditional computing techniques. Big data is not purely a data, rather than it is a complete subject involves various tools, techniques and frameworks. Big data can be any structured collection which results incapability of conventional data management methods. Hadoop is a distributed example used to change the large amount of data. This manipulation contains not only storage as well as processing on the data. Hadoop is an open- source software framework for dispersed storage and processing of big data sets on computer clusters built from commodity hardware. HDFS was built to support high throughput, streaming reads and writes of extremely large files. Hadoop Map Reduce is a software framework for easily writing applications which process vast amounts of data. Wordcount example reads text files and counts how often words occur. The input is text files and the result is wordcount file, each line of which contains a word and the count of how often it occurred separated by a tab.
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
A guide to using Apache Hadoop as your open source big data platform of choice, including the vendors that make various Hadoop flavors, related open source tools, Hadoop capabilities and suitable applications.
Presentation regarding big data. The presentation also contains basics regarding Hadoop and Hadoop components along with their architecture. Contents of the PPT are
1. Understanding Big Data
2. Understanding Hadoop & It’s Components
3. Components of Hadoop Ecosystem
4. Data Storage Component of Hadoop
5. Data Processing Component of Hadoop
6. Data Access Component of Hadoop
7. Data Management Component of Hadoop
8.Hadoop Security Management Tool: Knox ,Ranger
This presentation Simplify the concepts of Big data and NoSQL databases & Hadoop components.
The Original Source:
http://zohararad.github.io/presentations/big-data-introduction/
Big data is data that, by virtue of its velocity, volume, or variety (the three Vs), cannot be easily stored or analyzed with traditional methods. Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware.
A short overview of Bigdata along with its popularity, ups and downs from past to present. We had a look of its needs, challenges and risks too. Architectures involved in it. Vendors associated with it.
Presented By :- Rahul Sharma
B-Tech (Cloud Technology & Information Security)
2nd Year 4th Sem.
Poornima University (I.Nurture),Jaipur
www.facebook.com/rahulsharmarh18
this is a presentation on hadoop basics. Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models.
Hadoop as we know is a Java based massive scalable distributed framework for processing large data (several peta bytes) across a cluster (1000s) of commodity computers.
The Hadoop ecosystem has grown over the last few years and there is a lot of jargon in terms of tools as well as frameworks.
Many organizations are investing & innovating heavily in Hadoop to make it better and easier. The mind map on the next slide should be useful to get a high level picture of the ecosystem.
A short presentation on big data and the technologies available for managing Big Data. and it also contains a brief description of the Apache Hadoop Framework
1. Big Data Analysis and Its Validation Using
Hadoop
Ankeeta Battalwar
Department of Computer Engineering
Fr. C. Rodrigues Institute of Technology, Vashi
Navi Mumbai, India.
Abstract-This paper intends to introduce the masses to
the concept of Big Data and Big Data analytics. It
provides an overview of Big Data and Big Data analysis
using Hadoop, a popular analytical software. It also
deals with the validation of Big Data in the Hadoop
system. The study of this concept has established a
strong foundation on which a data validation system
can be created. The proposed system intends to validate
the data coming from various data sources before
entering the Big Data system thereby storing only the
data that is atomic, consistent, integrated and relevant
to the system.
I. INTRODUCTION
Big data is data that exceeds the processing
capacity of conventional database systems. The data
is too big, moves too fast, or doesn’t fit the strictures
of your database architectures. To gain value from
this data, you must choose an alternative way to
process it. The fundamentals of big data analysis can
only be understood by getting familiar with the term
“Big Data”.
Big data is the term for a collection of data
sets so large and complex that it becomes difficult to
process using on-hand database management tools or
traditional data processing applications. Big Data
usually includes data sets with sizes beyond the
ability of commonly used software tools to capture,
curate, manage, and process the data within a
tolerable elapsed time.
Big data analysis includes the analysing of data
and carefully processing it so that it can be made
available to the customers. It is the organization,
administration and governance of large volumes of
both structured and unstructured data.The goal of big
data management is to ensure a high level of data
quality and accessibility for business intelligence
and big data analytics applications.
II. OVERVIEW OF BIG DATA
A. Tools available for Big data analysis
Among a horde of tools available for Big Data
analysis a few popular ones are listed below:
Hadoop: Apache Hadoop is an open-source
software framework for storage and large-scale
processing of data-sets on clusters of commodity
hardware. Hadoop is an Apache top-level project
being built and used by a global community of
contributors and users. It is licensed under the
Apache License 2.0.
MongoDB: MongoDB is a cross-platform
document-oriented database system. Released under
a combination of the GNU Affero General Public
License and the Apache License, MongoDB is free
and open-source software.
Splunk: Splunk captures, indexes and correlates
real-time data in a searchable repository from which
it can generate graphs, reports, alerts, dashboards and
visualizations.
MapReduce: Originally developed by Google,
the MapReduce website describe it as “a
programming model and software framework for
writing applications that rapidly process vast amounts
of data in parallel on large clusters of compute
nodes.” It’s used by Hadoop, as well as many other
data processing applications.
Operating System: OS Independent.
GridGain: GridGrain offers an alternative to
Hadoop’s MapReduce that is compatible with the
Hadoop Distributed File System. It offers in-memory
processing for fast analysis of real-time data.
Operating System: Windows, Linux, OS X.
2. Storm: Now owned by Twitter, Storm offers
distributed real-time computation capabilities and is
often described as the “Hadoop of real time.” It’s
highly scalable, robust, fault-tolerant and works with
nearly all programming languages.
Operating System: Linux.
HBase: Another Apache project, HBase is
the non-relational data store for Hadoop. Features
include linear and modular scalability, strictly
consistent reads and writes, automatic failover
support and much more.
Operating System: OS Independent.
Hibari: Used by many telecom companies,
Hibari is a key-value, big data store with strong
consistency, high availability and fast performance.
Support is available through Gemini Mobile.
Operating System: OS Independent.
B. Traditional Data v/s Big Data
Traditional Data
warehouse Analytics
Big Data Analytics
1. Traditional
Analytics analyzes the
known data terrain, that
too the data that is well
understood. Most of the
data warehouses have
elaborate ETL processes
and database constraints,
which means that the
data that is loaded inside
a data warehouse is well
understood, cleansed and
in line with the business
metadata.
The biggest
advantage of Big
Data is that it is
targeted at
unstructured data
outside the traditional
means of capturing
data. This means that
there is no guarantee
that the incoming
data is well-formed
and clean and devoid
of any errors. This
makes it more
challenging but at the
same time it gives a
scope for much more
insight into the data.
2. Traditional Analytics is
built on top of the
relational data model.
Relationships between
the subjects of interest
are created inside the
system and the analysis
is done based on them.
In typical world, it is
very difficult to
establish
a relationship
between all the
information in a
formal way,
and hence
unstructured data in
the form of images,
videos, mobile
generated
information, RFID
etc. have to be
considered in big
data analytics.
3. Traditional analytics is
batch oriented and we
need to wait for ETL and
transformation jobs to
complete before the
required insight is
obtained.
Big Data Analytics is
aimed at near-real
time analysis of the
data using the
support of the
software meant for it.
4. Parallelism in a
traditional analytics
system is
achieved through costly
hardware like MPP
(Massively Parallel
Processing) systems
and/or SMP systems.
While there are
appliances in the
market for the Big
Data Analytics, this
can also be
achieved through
commodity hardware
and new generation
of analytical software
like Hadoop or other
Analytical databases.
III. OVERVIEW OF HADOOP
Hadoop is the widely preferred Big Data
analytical tool due to many reasons which include it
being cheap as it is an open source software to it
being very fault tolerant and effective at Big Data
storage and processing. The Hadoop architecture
performs the required functions with the help of two
of its most vital components- MapReduce and
Hadoop File System (HDFS). Both these components
are essential in understanding how Hadoop handles
Big Data.
A. Hadoop Architecture
Hadoop consists of the Hadoop Common
package, which provides filesystem and OS level
abstractions, a MapReduce engine (either
MapReduce/MR1 or YARN/MR2) and the Hadoop
Distributed File System (HDFS). The Hadoop
Common package contains the necessary Java
archive files and scripts needed to start Hadoop. The
package also provides source code, documentation
and a contribution section that includes projects from
the Hadoop Community.
A Node in Hadoop architecture is any end
point of processing or storage activity. The storage
system is divided physically and also virtually in to
racks (a collection of smaller storages). For effective
scheduling of work, every Hadoop-compatible file
system should provide location awareness, that is, the
name of the rack (more precisely, of the network
switch) where a worker node is.
3. A small Hadoop cluster includes a single
master and multiple worker nodes. The master node
consists of a JobTracker, TaskTracker, NameNode
and DataNode. A slave or worker node acts as both a
DataNode and TaskTracker, though it is possible to
have data-only worker nodes and compute-only
worker nodes. Hadoop requires Java Runtime
Environment1.6 or higher. The standard start-up and
shutdown scripts require Secure Shell (SSH) to be set
up between nodes in the cluster.
In a larger cluster, the HDFS is managed
through a dedicated NameNode server to host the file
system index, and a secondary NameNode that can
generate snapshots of the NameNode’s memory
structures, thus preventing file-system corruption and
reducing loss of data. Similarly, a standalone
JobTracker server can manage job scheduling. In
clusters where the Hadoop MapReduce engine is
deployed against an alternate file system, the
NameNode, secondary NameNode and DataNode
architecture of HDFS is replaced by the file-system-
specific equivalent.
B. Hadoop Distributed File System (HDFS)
The HDFS is Hadoop’s file system to handle
large data sets. It stores data over a number of small
repositories using simple commonplace hardware to
store it and process it. Each such storage and
processing unit is called as a cluster and thus a
distributed file system is generated. This file system
is where all data resides and is scalable, fault-
tolerant, redundant storage.
Hadoop uses the strategy of bringing the
processing code to data thus reducing large data
transfers over the distributed network. The data, be it
structured or unstructured is split over the network
and each is processed separately. All data is
replicated and each change in system is constantly
recorded. The HDFS is a network of racks of low
cost storage hardware where each rack will have
smaller storage areas, of around 64 gigabytes, where
data will be stored. The racks use TCP/IP based
protocols and Remote Call Procedures (RPC) to
communicate with each other. HDFS implements 3-
way replication, two copies of data on same rack and
one more copy on some other rack. A single
DataNode may consist of one or multiple of such
small storages.
1) Architecture
The entire architecture of HDFS is divided
into Namespaces that contain NameNodes which are
further connected to DataNodes. It is a master-slave
architecture where a NameNode acts as master
controlling the DataNode slaves. Both are pieces of
software that enable recognition and manipulation of
data in the HDFS.
The different components and functions are
listed as follows:
NameNode: A master server that manages
the file system namespace and regulates access to
files by clients. It manages data through DataNodes.
It executes file system namespace operations like
opening, closing, and renaming files and directories.
DataNode: It manages storage attached to
the nodes that they run on. A file is split into one or
more blocks and these blocks are stored in a set of
DataNodes. It executes file system namespace
operations like opening, closing, and renaming files
and directories, block creation, deletion, and
replication upon instruction from the NameNode.
HeartBeat: Receipt of a Heartbeat implies
that the DataNode is functioning properly. Using
HeartBeat the System knows which nodes are
running and where failure may have occurred. This
helps in real time process tracking and allows counter
measures to be applied to maintain throughput.
BlockReport: A BlockReport contains a list
of all blocks on a DataNode. BlockReport along with
HeartBeat keeps track of working nodes.
Safemode: On startup, the NameNode enters
a special state called Safemode. Replication of data
blocks does not occur when the NameNode is in the
Safemode state. The NameNode receives Heartbeat
and Blockreport messages from the DataNodes, then
starts replication. Once replicated, Safemode is
exited.
4. EditLog: The NameNode uses a transaction
log called the EditLog to persistently record every
change that occurs in file system metadata.
FsImage: The NameNode creates a snapshot
of its system called the FsImage. This is stored as file
in the system itself and acts as a checkpoint. In case
of a crash or malfunction instead of booting the
system from the start we can use the FsImage to
recreate the system quickly.
DFSAdmin: The DFSAdmin command set
is used for administering an HDFS cluster. These are
commands that are used only by an HDFS
administrator.
C. MapReduce
The MapReduce framework handles distributed
parallel computation on the data sets of Big Data. It
works on local block of data and functions in a
manner to reduce traffic on the network. It uses
JobTracker and TaskTracker applications to handle
parallel and batch-wise processing of Big Data.
IV. VALIDATION OF DATA
Big Data presents challenges in all phases of
data analytics. The Testing of Big data systems
is one of the major challenges faced by
Organizations because of lack of Knowledge
regarding what to test and how much data should
be tested. Since big data consists of a varied
variety and large volume of data having high
data rate, the organizations are facing challenges
in defining and implementing the test strategies
for the structured and unstructured data
validation.
5. The setting up of an optimal test
environment is an essential for testing the large
volume of data. Working with non-relational
databases and performing non-functional testing
are the challenges faced. These challenges are
leading to poor quality of data going in
production and causing delayed implementation
and increase in the cost. Robust testing
approaches need to be defined for validating
structured and semi-structured data and start the
testing activities early to identify the possible
defects early in the implementation Life cycle
and thus reduce the overall cost, time and efforts
spent on testing these homogeneous data
processing systems.
A. Challenges to Data Validation
While many organizations have well-defined IT
security policies, those policies mostly revolve
around trust and procedure, both of which can break
down and fail. A major complicating factor is that
nearly all physical forms of documentations and
transactions have digitized versions — which means
organizations must validate duplicate sets of data and
be able to respond quickly to any invalid data set.
At the same time, cybercrime has emerged as a
very profitable business for criminals- underlining
the growing need to validate and verify all kinds of
data.Another area of concern is the increasing
dominance of mobile communications within most
organizations. This mobile dimension opens up
another deep fissure of insecurity, as the myriad types
of mobile devices and mobile networks can create
new avenues for malicious practices.
For testing and validating the data to be
processed in the big data systems we need a test
automation tool for validating the data in the
RDBMS and big data systems, so that the correct
data is processed to obtain the accurate results and
ensure that the data has been tested and validated.
B. Suggested system for Data Validation
Thus, we suggest to build an utility that will
fetch the data from the RDBMS and Big Data system
(Hadoop) and compare the data by using a computing
engine which will validate whether the records from
the RDBMS systems like SQL server database and
the Hadoop (HDFS) database match each other to
ensure that there are no mismatched records, extra
records.
• Using a user interface, the system can
extract the data from these databases and
then using a comparison engine we will
validate the records and generate appropriate
test results and reports that will be generated
once the test cases have been executed.
• The reports will specify whether there are
any mismatched records, extra or duplicate
records, faulty records etc. The matched
records will be processed and carried
forward, thus ensuring that only the correct
data is being inserted into the Big data
systems.
• The detailed multi-level reports can be
accessed by different users of the system
depending on the access privileges. The
utility will provide detailed test result
summary and allow the tester/user to
perform different operations.
• This system ensures data validation and
helps to achieve some level of automation.
The utility will fetch the data from the SQL
server database & the Hadoop(HDFS) &
will allow the comparison of the records to
find the mismatching records, to remove any
duplicate records & to perform data
validations on the records.
• The System validates the data quality by
employing a structured testing technique,
using the utility we will be able to compare
both the structured as well as semi-
structured database records and automate the
process of data validation.
• The system provides full control to the
administrator to access all the files and data
and reports.
• Various analysis reports can be generated
for improving the process of data validation
and providing 100% test coverage.
• The system is scalable and can reliably
validate and process high volumes of data.
• The use of algorithms for data validation
process and adaptability for varying
requirements makes the utility robust and
reliable.
6. V. Conclusion
Our proposed system for data validation using
Hadoop architecture allows us to handle big data
efficiently. It uses the concept of MapReduce and
HDFS to efficiently handle large amount of data by
grouping them into clusters thus providing many
features such as authentication, replication, data
integrity and space reclamation.
Once the data is segregated successfully into
clusters, the validation system compares this stored
data with a traditional relational database from which
the data was derived using a computing engine. This
ensures that the data entering the big data system is
authentic, integrated and consistent.
VI. Future scope
Having seen the importance of Big data analysis
in the current work place scenario, it is safe to
assume that it has many applications in the
foreseeable future. Millions of terabytes of
organizational data can be converted into useful and
beneficial data using the concepts of HDFS and
MapReduce in Hadoop. For eg. with the right big
data analytics platforms in place, an enterprise can
boost sales, increase efficiency, and improve
operations, customer service and risk management.
The scope of the application can be expanded by
making it available for larger and more complex data
sources. Large organizations can use an optimum test
environment like the one suggested above to
successfully store and validate data so as to benefit
the organization.