SlideShare a Scribd company logo
Harnessing Hadoop: Understanding the Big
Data Processing Options for Optimizing
Analytical Workloads
As Hadoop’s ecosystem evolves, new open source tools and capabilities
are emerging that can help IT leaders build more capable and business-
relevant big data analytics environments.
Executive Summary
Asserting that data is vital to business is an
understatement. Organizations have generated
more and more data for years, but struggle to use
it effectively. Clearly data has more important
uses than ensuring compliance with regulatory
requirements. In addition, data is being generated
with greater velocity, due to the advent of new
pervasive devices (e.g., smartphones, tablets,
etc.), social Web sites (e.g., Facebook, Twitter,
LinkedIn, etc.) and other sources like GPS, Google
Maps, heat/pressure sensors, etc.
Moreover, a plethora of bits and bytes is being
created by data warehouses, repositories, backups
and archives. IT leaders need to work with the
lines of business to take advantage of the wealth
of insight that can be mined from these reposito-
ries. This search for data patterns will be similar
to the searches the oil and gas industry conduct
for precious resources. Similar business value can
be uncovered if organizations can improve how
“dirty” data is processed to refine key business
decisions.
According to IDC1
and Century Link,2
our planet
now has 2.75 zettabytes (ZB) worth of data;
this number is likely to reach 8ZB by 2015. This
situation has given rise to the new buzzword
known as “big data.”
Today, many organizations are applying big data
analytics to challenges such as fraud detection,
risk/threat analysis, sentiment analysis, equities
analysis, weather forecasting, product recom-
mendations, classifications, etc. All of these
activities require a platform that can process this
data and deliver useful (often predictive) insights.
This white paper discusses how to leverage a big
data analytics platform, dissecting its capabilities,
limitations and vendor choices.
Big Data’s Ground Zero
Apache Hadoop3
is a Java-based open source
platform that makes processing of large data
possible over thousands of distributed nodes.
Hadoop’s development resulted from publication
of two Google authored whitepapers (i.e., Google
File System and Google MapReduce). There are
numerous vendor-specific distributions available
• Cognizant 20-20 Insights
cognizant 20-20 insights | november 2013
http://withDrDavid.com
2
based on Apache Hadoop, from companies such
as Cloudera (CDH3/4), Hortonworks (1.x/2.x) and
MapR (M3/M5). In addition, appliance-based
solutions are offered by many vendors, such as
IBM, EMC, Oracle, etc.
Figure 1 illustrates Apache Hadoop’s two base
components.
Hadoop distributed file system (HDFS) is a base
component of the Hadoop framework that
manages the data storage. It stores data in the
form of data blocks (default size: 64MB) on the
local hard disk. The input data size defines the
block size to be used in the cluster. A block size of
128MB for a large file set is a good choice.
Keeping large block sizes means a small number
of blocks can be stored, thereby minimizing
the memory requirement on the master node
(commonly known as NameNode) for storing the
metadata information. Block size also impacts
the job execution time, and thereby cluster per-
formance. A file can be stored in HDFS with a
different block size during the upload process. For
file sizes smaller than the block size, the Hadoop
cluster may not perform optimally.
Example: A file size of 512MB requires eight
blocks (default: 64MB), so it will need memory to
store information of 25 items (1 filename + 8 * 3
copies). The same file with a block size of 128MB
will require memory for just 13 items.
HDFS has built-in fault tolerance capability,
thereby eliminating the need for setting up a
redundant array of inexpensive disks (RAID) on
the data nodes. The data blocks stored in HDFS
automatically get replicated across multiple data
nodes based on the replication factor (default: 3).
The replication factor determines the number of
copies of a data block to be created and distrib-
uted across other data nodes. The larger the rep-
lication factor, the better the read performance
of the cluster. On the other hand, it requires
more time to replicate the data blocks and more
storage for replicated copies.
For example, storing a file size of 1GB requires
around 4GB of storage space, due to a replica-
tion factor of 3 and an additional 20% to 30% of
space required during data processing.
Both block size and replication factor impact
Hadoop cluster performance. The value for
both the parameters depends on the input data
type and the application workload to run on this
cluster. In a development environment, the rep-
lication factor can be set to a smaller number —
say, 1 or 2 — but it is advisable to set this value to
a higher factor, if the data available on the cluster
is crucial. Keeping a default replication factor of
3 is a good choice from a data redundancy and
cluster performance perspective.
Figure 2 (next page) depicts the NameNode and
DataNode that constitute the HDFS service.
MapReduce is the second base component
of a Hadoop framework. It manages the data
processing part on the Hadoop cluster. A cluster
runs MapReduce jobs written in Java or any other
language (Python, Ruby, R, etc.) via streaming.
A single job may consist of multiple tasks. The
number of tasks that can run on a data node is
governed by the amount of memory installed. It is
recommended to have more memory installed on
each data node for better cluster performance.
Based on the application workload that is going
to run on the cluster, a data node may require
high compute power, high disk I/O or additional
memory.
cognizant 20-20 insights
Figure 1
Hadoop Base Components
MapReduce
(Distributed Programming Framework)
HDFS
(Hadoop Distributed File System)
3cognizant 20-20 insights
The job execution process is handled by the
JobTracker and TaskTracker services (see
Figure 2). The type of job scheduler chosen for
a cluster depends on the application workload.
If the cluster will be running short running jobs,
then FIFO scheduler will serve the purpose. For
a much more balanced cluster, one can choose
to use Fair Scheduler. In MRv1.0, there’s no high
availability (HA) for the JobTracker service. If the
node running JobTracker service fails, all jobs
submitted in the cluster will be discontinued and
must be resubmitted once the node is recovered
or another node is made available. This issue has
been fixed in the newer release — i.e., MRv2.0,
also known as yet another resource negotiator
(YARN).4
Figure 2 also depicts the interaction among these
five service components.
The Secondary Name Node plays the vital role of
check-pointing. It is this node that off-loads the
work of metadata image creation and updating
on a regular basis. It is recommended to keep the
machine configuration the same as Name Node
so that in the case of an emergency this node can
be promoted as a Name Node.
The Hadoop Ecosystem
Hadoop base components are surrounded by
numerous ecosystem5
projects. Many of them are
already top-level Apache (open source) projects;
some, however, remain in the incubation stage.
Figure 3 (next page) highlights a few of the ecosys-
tem components. The most common ones include:
•	Pig provides a high-level procedural language
known as PigLatin. Programs written in PigLatin
are automatically converted into map-reduce
jobs. This is very suitable in situations where
developers are not familiar with high-level
languages such as Java, but have strong
scripting knowledge and expertise. They can
create their own customized functions. It is
predominantly recommended for processing
large amounts of semi-structured data.
•	Hive provides a SQL-like language known
as HiveQL (query language). Analysts can
run queries against the data stored in HDFS
using SQL-like queries that automatically
get converted into map-reduce jobs. It is rec-
ommended where we have large amounts
of unstructured data and we want to fetch
particular details without writing map-reduce
jobs. Companies like RevolutionAnalytics6
can leverage Hadoop clusters running Hive
via RHadoop connectors. RHadoop provides
a connector for HDFS, MapReduce and HBase
connectivity.
•	HBase is a type of NOSQL database and offers
scalable and distributed database service.
However, it is not optimized for transactional
applications and relational analytics. Be aware:
HBase is not a replacement for a relational
database.AclusterwithHBaseinstalledrequires
an additional 24GB to 32GB of memory on each
data node. It is recommended that the data
nodes running HBase Region Server should run
a smaller number of MapReduce tasks.
Figure 2
Hadoop Services Schematic
HDFS – Hadoop Distributed File System
Task Tracker
Data Node
Job Tracker
Task Tracker Task Tracker
....
Name Node Data Node Data Node....
MapReduce
(Processing)
Secondary
Name Node
HDFS
(Storage)
cognizant 20-20 insights 4
Hadoop clusters are mainly used in R&D-relat-
ed activities. Many enterprises are now using
Hadoop with production workloads. The type of
ecosystem components required depends upon
the enterprise needs. Enterprises with large data
warehouses normally go for production-ready
tools from proven vendors such as IBM, Oracle,
EMC, etc. Also, not every enterprise requires
a Hadoop cluster. A cluster is mainly required
by business units that are developing Hadoop-
based solutions or running analytics workloads.
Business units that are just entering into analytics
may start small by building a 5-10 node cluster.
As shown in Figure 4 (next page), a Hadoop
cluster can scale from just a few data nodes to
thousands of data nodes as the need arises or
data grows. Hadoop clusters show linear scal-
ability, both in terms of performance and storage
capacity, as the count of nodes increase. Apache
Hadoop runs primarily on the Intel x86/x86_64-
based architecture.
Clusters must be designed with a particular
workload in mind (e.g., data-intensive loads,
processor-intensive loads, mixed loads, real-time
clusters, etc.). As there are many Hadoop dis-
tributions available, implementation selection
depends upon feature availability, licensing cost,
performance reports, hardware requirements,
skills availability within the enterprise and the
workload requirements.
Per a comStore report,7
MapR performs better
than Cloudera and works out to be less costly.
Hortonworks, on the other hand, has a strong
relationship with Microsoft. Its distribution is also
available for a Microsoft platform (HDP v1.3 for
Windows), which makes this distribution popular
among the Microsoft community.
Hadoop Appliance
An alternate approach to building a Hadoop
cluster is to use a Hadoop appliance. These are
customizable and ready-to-use complete stacks
from vendors such as IBM (IBM Netezza), Oracle
(Big Data Appliance), EMC (Data Computing
Appliance), HP (AppSystem), Teradata Aster, etc.
Using an appliance-based approach speeds go-
to-market and results in an integrated solution
from a single (and accountable) vendor like IBM,
Teradata, HP, Oracle, etc.
These appliances come in different sizes and con-
figurations. Every vendor uses a different Hadoop
distribution. For example: Oracle Big Data is based
on a Cloudera distribution; EMC uses MapR and
HP AppSystem can be configured with Cloudera,
Figure 3
Scanning the Hadoop Project Ecosystem
MapReduce
(Distributed Programming
Framework)
Pig
(Data Flow)
Zookeeper
(Coordination)
Ambari
(SystemManagement)
Hive
(SQL)
HBase
(NoSQLStore)
HDFS
(Hadoop Distributed File System)
HCatalog
(Table/SchemaMgmt.)
Avro
(Serialization)
Sqoop
(CLIforDataMovement)
Hadoop
Ecosystem
OS
Layer
Physical
Machines
Virtual
Machines Storage Network
Infrastructure
Layer
cognizant 20-20 insights 5
MapR or Hortonworks. These appliances are
designed to solve a particular type of problem or
for a particular workload in mind.
•	IBM Netezza 1000 is designed for high-per-
formance business intelligence and advanced
analytics. It’s a data warehouse appliance
that can run complex and mixed workloads.
In addition, it supports multiple languages (C/
C++, Java, Python, FORTRAN), frameworks
(Hadoop) and tools (IBM SPSS, R and SAS).
•	The Teradata Aster appliance comes with 50
pre-built MapReduce functions optimized for
graph, text and behavior/marketing analysis.
It also comes with Aster Database and Apache
Hadoop to process structured, unstructured
and semi-structured data. A full Hadoop cabinet
includes two masters and 16 data nodes which
can deliver up to 152TB of user space.
•	The HP AppSystem appliance comes with pre-
configured HP ProLiant DL380e Gen8 servers
running Red Hat Enterprise Linux 6.1 and a
Cloudera distribution. Enterprises can use this
appliance with HP Vertica Community Edition
to run real-time analysis of structured data.
HDFS and Hadoop
HDFS is not the only platform for performing big
data tasks. Vendors have emerged with solutions
that offer a replacement for HDFS. For example,
DataStax8
uses Cassandra; Amazon Web Services
uses S3; IBM has General Parallel File System
(GPFS); EMC relies on Isilon; MapR uses Direct
NFS; etc. (To learn more, read about alternate
technologies.)9
An EMC appliance uses Isilon
scale-out NAS technology, along with HDFS. This
combination eliminates the need to export/import
the data into HDFS (see Figure 5, next page).
IBM uses GPFS instead of HDFS in its solution
after finding that GPFS performed better than
HDFS (see Figure 6, also next page).
Many IT shops are scrambling to build Hadoop
clusters in their enterprise. The question is: What
are they going to do with these clusters? Not
every enterprise needs to have its own cluster;
it all depends on the nature of the business
and whether the cluster serves a legitimate and
compelling business use case. For example, enter-
prises into financial transactions would definitely
require an analytic platform for fraud detection,
Figure 4
Hadoop Cluster: An Anatomical View
Data Nodes
(Slave Nodes)
Network Switch
Client Nodes
Kerberos
(Security)
Name Nodes
Manual/Automatic
Failover
(Master Node)
Cluster Manager
(Management & Monitoring)
Job Tracker/
Resource Manager
(Scheduler)
(Edge Nodes)
cognizant 20-20 insights 6
portfolio analysis, churn analysis, etc. Similarly,
enterprises into research work require large-size
clusters for faster processing and large data
analysis.
Hadoop-as-a-Service
There are couple of ways in which Hadoop
clusters can be leveraged in addition to building
a new cluster or taking an appliance-based
approach. A more cost-effective approach is to
use Amazon10
Elastic Map-Reduce (EMR) service.
Enterprises can build a cluster on demand and
pay for the duration they used it. Amazon offers
compute instances running open-source Apache
Hadoop and MapR (M3/M5/M7) distribution.
Other vendors that offer “Hadoop-as-a-Service”
include:
•	Joyent, which offers Hadoop services based on
Hortonworks distribution.
•	Skytap, which provides Hadoop services based
on Cloudera (CDH4) distribution.
•	Mortar, which delivers Hadoop services based
on MapR distribution.
All of these vendors charge on a pay-per-use
basis. New vendors will eventually emerge as
market demand increases and more enterprises
require additional analytics, business intelligence
(BI) and data warehousing options. As the com-
petitive stakes rise, we recommend comparing
prices and capabilities to get the best price/per-
formance for your big data environment.
Figure 5
EMC Subs Isilon for HDFS
R (RHIPE) PIG Mahout
Job Tracker Task Tracker
Name Node
Data Node
Ethernet HCFS
HIVE HBASE
Figure 6
Benchmarking HDFS vs. GPFS
0
500
Postmark
ExecutionTime
HDFS GPFS
Teragen Grep CacheTest Terasort
1000
1500
2000
cognizant 20-20 insights 7
The Hadoop platform by itself can’t do anything.
Enterprises need applications that can leverage
this platform. Developing an analytics application
that offers business value requires data scientists
and data analysts. These applications can perform
many different types of jobs, from pure analytical,
data and BI jobs through data processing. Hadoop
can run all types of workloads, so it is becoming
increasingly important to design an effective and
efficient cluster using the latest technologies and
framework.
Hadoop Competitors
The Hadoop framework provides a batch
processing environment that works well with
historical data in data warehouses and archives.
What if there is a need to perform real-time
processing such as in the case of fraud detection,
bank transactions, etc.? Enterprises have begun
assessing alternative approaches and technolo-
gies to solve this issue.
Storm is an open source project that helps in
setting up the environment for real-time (stream)
processing. Besides, there are other options
such as massively parallel processing (MPP),
high performance computing cluster (HPCC) and
Hadapt to consider. However, all these technolo-
gies will require a strong understanding of the
workload that will be run. If the workload is more
processing intensive, it’s always recommended
to go with HPCC. Besides, HPCC uses a declara-
tive programming language that is extensible,
homogeneous and complete. It uses an Open
Data Model which is not constrained by key-value
pairs, and data processing can happen in parallel.
To learn more, read HPCC vs. Hadoop.11
Massively
parallel processing databases have been around
even before MapReduce hit the market. It makes
use of SQL as an interface to interact with the
system. SQL is a commonly used language by
data scientists and analysts. MPP systems are
mainly used in high-end data warehouse imple-
mentations.
Looking Ahead
Hadoop is a strong framework for running
data-intensive jobs, but it’s still evolving as the
ecosystem projects around it mature and expand.
Market demand is spurring new technological
developments, daily. With these innovations come
various options to choose from, such as building
a new cluster, leveraging an appliance or making
use of Hadoop-as-a-Service offerings.
Moreover,themarketisfillingwithnewcompanies,
which is creating more competitive technologies
and options to consider. For starters, enterprises
need to make the right decision in choosing their
Hadoop distributions. Enterprises can start small
by picking up any distribution and building a
5-10 node cluster to test the features, function-
ality and performance. Most reputable vendors
provide a free version of their distribution for
evaluation purposes. We advise enterprises to
test 2-3 vendor distributions against its business
requirements before making any final decisions.
However, IT decision-makers should pay close
attention to emerging competitors to position
their organizations for impending technology
enhancements.
Footnotes
1	
IDC report: Worldwide Big Data Technology and Services 2012-2015 Forecast.
2	
www.centurylink.com/business/artifacts/pdf/resources/big-data-defining-the-digital-deluge.pdf.
3	
http://en.wikipedia.org/wiki/Hadoop.
4	
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html.
5	
http://pig.apache.org/; http://hive.apache.org/; http://hbase.apache.org; “Comparing Pig Latin and SQL
for Constructing Data Processing Pipelines,” developer.yahoo.com/blogs/hadoop/comparing-pig-latin-
sql-constructing-data-processing-pipelines-444.html; “Process a Million Songs with Apache Pig,” blog.
cloudera.com/blog/2012/08/process-a-million-songs-with-apache-pig/.
6	
www.revolutionanalytics.com/products/revolution-enterprise.php.
7	
comStore report: Switching to MapR for Reliable Insights into Global Internet Consumer Behavior.
8	
www.datastax.com/wp-content/uploads/2012/09/WP-DataStax-HDFSvsCFS.pdf.
9	
gigaom.com/2012/07/11/because-hadoop-isnt-perfect-8-ways-to-replace-hdfs/.
10	
aws.amazon.com/elasticmapreduce/pricing/.
11	
hpccsystems.com/Why-HPCC/HPCC-vs-Hadoop.
http://Online.withDrDavid.com
About Cognizant
Cognizant (NASDAQ: CTSH) is a leading provider of information technology, consulting, and business process out-
sourcing services, dedicated to helping the world’s leading companies build stronger businesses. Headquartered in
Teaneck, New Jersey (U.S.), Cognizant combines a passion for client satisfaction, technology innovation, deep industry
and business process expertise, and a global, collaborative workforce that embodies the future of work. With over 50
delivery centers worldwide and approximately 164,300 employees as of June 30, 2013, Cognizant is a member of the
NASDAQ-100, the S&P 500, the Forbes Global 2000, and the Fortune 500 and is ranked among the top performing
and fastest growing companies in the world. Visit us online at www.cognizant.com or follow us on Twitter: Cognizant.
World Headquarters
500 Frank W. Burr Blvd.
Teaneck, NJ 07666 USA
Phone: +1 201 801 0233
Fax: +1 201 801 0243
Toll Free: +1 888 937 3277
Email: inquiry@cognizant.com
European Headquarters
1 Kingdom Street
Paddington Central
London W2 6BD
Phone: +44 (0) 20 7297 7600
Fax: +44 (0) 20 7121 0102
Email: infouk@cognizant.com
India Operations Headquarters
#5/535, Old Mahabalipuram Road
Okkiyam Pettai, Thoraipakkam
Chennai, 600 096 India
Phone: +91 (0) 44 4209 6000
Fax: +91 (0) 44 4209 6060
Email: inquiryindia@cognizant.com
­­© Copyright 2013, Cognizant. All rights reserved. No part of this document may be reproduced, stored in a retrieval system, transmitted in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise, without the express written permission from Cognizant. The information contained herein is
subject to change without notice. All other trademarks mentioned herein are the property of their respective owners.
About the Authors
Harish Chauhan is an Associate Director and leads new services development around big data and new
emerging technologies for Cognizant’s General Technology Office. He has over 21 years of experience and
holds a bachelor of engineering degree in computer science. His areas of expertise include virtualization,
cloud computing and distributed computing (Hadoop/HPC). He contributed technical articles, co-authored
an IBM Redbook on the virtualization engine and filed two patents in the areas of virtualization. He can be
reached at Harish.Chauhan@cognizant.com.
Jerald Murphy is Cognizant’s Worldwide Practice Leader for Cognizant Business Consulting’s IT Infra-
structure Services Practice. He also leads Cognizant’s Technology Office for Enterprise Computing,
Cloud Services and Emerging Technology. Jerry has spent over 20 years in the computer industry, in
various technical, marketing, sales, senior management and industry analyst positions. He is one of the
industry’s leading authorities on enterprise IT infrastructure design, security and management, and has
spent 20-plus years conducting engineering research and helping to develop infrastructure and security
strategies for many of the largest global companies and service providers. Jerry’s areas of expertise
include data center infrastructure and enterprise systems management. Additional focus areas include
managed services and IT outsourcing. Jerry has a B.S.E.E. from The United States Military Academy and a
master’s degree in electrical engineering from Princeton University. He can be reached at Jerald.Murphy@
cognizant.com.

More Related Content

What's hot

Survey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopSurvey Paper on Big Data and Hadoop
Survey Paper on Big Data and Hadoop
IRJET Journal
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
inventionjournals
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
paperpublications3
 
Hadoop
HadoopHadoop
Hadoop
RittikaBaksi
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
Atul Kushwaha
 
IRJET- Big Data-A Review Study with Comparitive Analysis of Hadoop
IRJET- Big Data-A Review Study with Comparitive Analysis of HadoopIRJET- Big Data-A Review Study with Comparitive Analysis of Hadoop
IRJET- Big Data-A Review Study with Comparitive Analysis of Hadoop
IRJET Journal
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
Nalini Mehta
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
Manish Borkar
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical Workloads
Cognizant
 
Hadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and AssessmentHadoop Cluster Analysis and Assessment
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
Urvashi Kataria
 
Cppt
CpptCppt
Big data
Big dataBig data
Big data
Abilash Mavila
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
SANTOSH WAYAL
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
IJCSIS Research Publications
 

What's hot (16)

Survey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopSurvey Paper on Big Data and Hadoop
Survey Paper on Big Data and Hadoop
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
 
Seminar Report Vaibhav
Seminar Report VaibhavSeminar Report Vaibhav
Seminar Report Vaibhav
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
IRJET- Big Data-A Review Study with Comparitive Analysis of Hadoop
IRJET- Big Data-A Review Study with Comparitive Analysis of HadoopIRJET- Big Data-A Review Study with Comparitive Analysis of Hadoop
IRJET- Big Data-A Review Study with Comparitive Analysis of Hadoop
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical Workloads
 
Hadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and AssessmentHadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and Assessment
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Cppt
CpptCppt
Cppt
 
Big data
Big dataBig data
Big data
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
 

Viewers also liked

Though The Lens of an iPhone: San Francisco
Though The Lens of an iPhone: San FranciscoThough The Lens of an iPhone: San Francisco
Though The Lens of an iPhone: San Francisco
Paul Brown
 
Diabetes care
Diabetes careDiabetes care
Ryan mulvihill pretak-ap_proposal
Ryan mulvihill pretak-ap_proposalRyan mulvihill pretak-ap_proposal
Ryan mulvihill pretak-ap_proposal
Ryan Mulvihill-Pretak
 
แคปเจอร์การแปลงพิกัด Porj4js
แคปเจอร์การแปลงพิกัด Porj4jsแคปเจอร์การแปลงพิกัด Porj4js
แคปเจอร์การแปลงพิกัด Porj4js
SP Chang
 
PRESENTACIÓN ALTO AL CRIMEN
PRESENTACIÓN ALTO AL CRIMENPRESENTACIÓN ALTO AL CRIMEN
PRESENTACIÓN ALTO AL CRIMEN
Renzo Reggiardo
 
Barg e sehra www,pakistanipoint.com
Barg e sehra www,pakistanipoint.comBarg e sehra www,pakistanipoint.com
Barg e sehra www,pakistanipoint.com
akhtar_Salik
 
Actors analysis
Actors   analysisActors   analysis
Actors analysis
FroggyFresh2012
 
Unit Plan Powerpoint Revised
Unit Plan Powerpoint RevisedUnit Plan Powerpoint Revised
Unit Plan Powerpoint Revised
Carley2017
 
Multivibrator bistable.
Multivibrator  bistable.Multivibrator  bistable.
Multivibrator bistable.
1000agung
 
2012 Messages in the Media
2012 Messages in the Media2012 Messages in the Media
2012 Messages in the MediaChris Granicolo
 
Azaab e-deed by muhsin naqvi (systematic@itdunya.com) akif
Azaab e-deed by muhsin naqvi (systematic@itdunya.com) akifAzaab e-deed by muhsin naqvi (systematic@itdunya.com) akif
Azaab e-deed by muhsin naqvi (systematic@itdunya.com) akif
akhtar_Salik
 
Plomo
PlomoPlomo
Plomo
merygi95
 
Matrix Warehouse Referral
Matrix Warehouse ReferralMatrix Warehouse Referral
Matrix Warehouse ReferralElen Roos
 
Resume
ResumeResume
Processo do Meridiano de Greenwich
Processo do Meridiano de GreenwichProcesso do Meridiano de Greenwich
Processo do Meridiano de Greenwich
... ...
 
Papaya
Papaya Papaya

Viewers also liked (20)

Though The Lens of an iPhone: San Francisco
Though The Lens of an iPhone: San FranciscoThough The Lens of an iPhone: San Francisco
Though The Lens of an iPhone: San Francisco
 
Final Dissertation
Final DissertationFinal Dissertation
Final Dissertation
 
Diabetes care
Diabetes careDiabetes care
Diabetes care
 
Lista 4
  Lista 4  Lista 4
Lista 4
 
Ryan mulvihill pretak-ap_proposal
Ryan mulvihill pretak-ap_proposalRyan mulvihill pretak-ap_proposal
Ryan mulvihill pretak-ap_proposal
 
แคปเจอร์การแปลงพิกัด Porj4js
แคปเจอร์การแปลงพิกัด Porj4jsแคปเจอร์การแปลงพิกัด Porj4js
แคปเจอร์การแปลงพิกัด Porj4js
 
PRESENTACIÓN ALTO AL CRIMEN
PRESENTACIÓN ALTO AL CRIMENPRESENTACIÓN ALTO AL CRIMEN
PRESENTACIÓN ALTO AL CRIMEN
 
Barg e sehra www,pakistanipoint.com
Barg e sehra www,pakistanipoint.comBarg e sehra www,pakistanipoint.com
Barg e sehra www,pakistanipoint.com
 
Actors analysis
Actors   analysisActors   analysis
Actors analysis
 
ragy abd el sallam 2015
ragy abd el sallam 2015ragy abd el sallam 2015
ragy abd el sallam 2015
 
Unit Plan Powerpoint Revised
Unit Plan Powerpoint RevisedUnit Plan Powerpoint Revised
Unit Plan Powerpoint Revised
 
Multivibrator bistable.
Multivibrator  bistable.Multivibrator  bistable.
Multivibrator bistable.
 
2012 Messages in the Media
2012 Messages in the Media2012 Messages in the Media
2012 Messages in the Media
 
Azaab e-deed by muhsin naqvi (systematic@itdunya.com) akif
Azaab e-deed by muhsin naqvi (systematic@itdunya.com) akifAzaab e-deed by muhsin naqvi (systematic@itdunya.com) akif
Azaab e-deed by muhsin naqvi (systematic@itdunya.com) akif
 
Plomo
PlomoPlomo
Plomo
 
Matrix Warehouse Referral
Matrix Warehouse ReferralMatrix Warehouse Referral
Matrix Warehouse Referral
 
Resume
ResumeResume
Resume
 
Processo do Meridiano de Greenwich
Processo do Meridiano de GreenwichProcesso do Meridiano de Greenwich
Processo do Meridiano de Greenwich
 
Introduction
IntroductionIntroduction
Introduction
 
Papaya
Papaya Papaya
Papaya
 

Similar to Harnessing Hadoop and Big Data to Reduce Execution Times

G017143640
G017143640G017143640
G017143640
IOSR Journals
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System
cscpconf
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
KamranKhan587
 
Hadoop
HadoopHadoop
Hadoop
Ankit Prasad
 
Big data
Big dataBig data
Big data
revathireddyb
 
Big data
Big dataBig data
Big data
revathireddyb
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
Manoj Jangalva
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
Hitendra Kumar
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
MarianJRuben
 
Bigdata and Hadoop Introduction
Bigdata and Hadoop IntroductionBigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
umapavankumar kethavarapu
 
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEPERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
ijdpsjournal
 
International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)
ijdpsjournal
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
Understanding hadoop
Understanding hadoopUnderstanding hadoop
Understanding hadoop
RexRamos9
 
BDA Mod2@AzDOCUMENTS.in.pdf
BDA Mod2@AzDOCUMENTS.in.pdfBDA Mod2@AzDOCUMENTS.in.pdf
BDA Mod2@AzDOCUMENTS.in.pdf
KUMARRISHAV37
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
chunkypandey12
 
Cppt
CpptCppt
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introduction
rajsandhu1989
 

Similar to Harnessing Hadoop and Big Data to Reduce Execution Times (20)

G017143640
G017143640G017143640
G017143640
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
 
Hadoop
HadoopHadoop
Hadoop
 
hadoop
hadoophadoop
hadoop
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
Bigdata and Hadoop Introduction
Bigdata and Hadoop IntroductionBigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
 
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEPERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
 
International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Understanding hadoop
Understanding hadoopUnderstanding hadoop
Understanding hadoop
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
BDA Mod2@AzDOCUMENTS.in.pdf
BDA Mod2@AzDOCUMENTS.in.pdfBDA Mod2@AzDOCUMENTS.in.pdf
BDA Mod2@AzDOCUMENTS.in.pdf
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introduction
 

Recently uploaded

一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
correoyaya
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
alex933524
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 

Recently uploaded (20)

一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 

Harnessing Hadoop and Big Data to Reduce Execution Times

  • 1. Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizing Analytical Workloads As Hadoop’s ecosystem evolves, new open source tools and capabilities are emerging that can help IT leaders build more capable and business- relevant big data analytics environments. Executive Summary Asserting that data is vital to business is an understatement. Organizations have generated more and more data for years, but struggle to use it effectively. Clearly data has more important uses than ensuring compliance with regulatory requirements. In addition, data is being generated with greater velocity, due to the advent of new pervasive devices (e.g., smartphones, tablets, etc.), social Web sites (e.g., Facebook, Twitter, LinkedIn, etc.) and other sources like GPS, Google Maps, heat/pressure sensors, etc. Moreover, a plethora of bits and bytes is being created by data warehouses, repositories, backups and archives. IT leaders need to work with the lines of business to take advantage of the wealth of insight that can be mined from these reposito- ries. This search for data patterns will be similar to the searches the oil and gas industry conduct for precious resources. Similar business value can be uncovered if organizations can improve how “dirty” data is processed to refine key business decisions. According to IDC1 and Century Link,2 our planet now has 2.75 zettabytes (ZB) worth of data; this number is likely to reach 8ZB by 2015. This situation has given rise to the new buzzword known as “big data.” Today, many organizations are applying big data analytics to challenges such as fraud detection, risk/threat analysis, sentiment analysis, equities analysis, weather forecasting, product recom- mendations, classifications, etc. All of these activities require a platform that can process this data and deliver useful (often predictive) insights. This white paper discusses how to leverage a big data analytics platform, dissecting its capabilities, limitations and vendor choices. Big Data’s Ground Zero Apache Hadoop3 is a Java-based open source platform that makes processing of large data possible over thousands of distributed nodes. Hadoop’s development resulted from publication of two Google authored whitepapers (i.e., Google File System and Google MapReduce). There are numerous vendor-specific distributions available • Cognizant 20-20 Insights cognizant 20-20 insights | november 2013 http://withDrDavid.com
  • 2. 2 based on Apache Hadoop, from companies such as Cloudera (CDH3/4), Hortonworks (1.x/2.x) and MapR (M3/M5). In addition, appliance-based solutions are offered by many vendors, such as IBM, EMC, Oracle, etc. Figure 1 illustrates Apache Hadoop’s two base components. Hadoop distributed file system (HDFS) is a base component of the Hadoop framework that manages the data storage. It stores data in the form of data blocks (default size: 64MB) on the local hard disk. The input data size defines the block size to be used in the cluster. A block size of 128MB for a large file set is a good choice. Keeping large block sizes means a small number of blocks can be stored, thereby minimizing the memory requirement on the master node (commonly known as NameNode) for storing the metadata information. Block size also impacts the job execution time, and thereby cluster per- formance. A file can be stored in HDFS with a different block size during the upload process. For file sizes smaller than the block size, the Hadoop cluster may not perform optimally. Example: A file size of 512MB requires eight blocks (default: 64MB), so it will need memory to store information of 25 items (1 filename + 8 * 3 copies). The same file with a block size of 128MB will require memory for just 13 items. HDFS has built-in fault tolerance capability, thereby eliminating the need for setting up a redundant array of inexpensive disks (RAID) on the data nodes. The data blocks stored in HDFS automatically get replicated across multiple data nodes based on the replication factor (default: 3). The replication factor determines the number of copies of a data block to be created and distrib- uted across other data nodes. The larger the rep- lication factor, the better the read performance of the cluster. On the other hand, it requires more time to replicate the data blocks and more storage for replicated copies. For example, storing a file size of 1GB requires around 4GB of storage space, due to a replica- tion factor of 3 and an additional 20% to 30% of space required during data processing. Both block size and replication factor impact Hadoop cluster performance. The value for both the parameters depends on the input data type and the application workload to run on this cluster. In a development environment, the rep- lication factor can be set to a smaller number — say, 1 or 2 — but it is advisable to set this value to a higher factor, if the data available on the cluster is crucial. Keeping a default replication factor of 3 is a good choice from a data redundancy and cluster performance perspective. Figure 2 (next page) depicts the NameNode and DataNode that constitute the HDFS service. MapReduce is the second base component of a Hadoop framework. It manages the data processing part on the Hadoop cluster. A cluster runs MapReduce jobs written in Java or any other language (Python, Ruby, R, etc.) via streaming. A single job may consist of multiple tasks. The number of tasks that can run on a data node is governed by the amount of memory installed. It is recommended to have more memory installed on each data node for better cluster performance. Based on the application workload that is going to run on the cluster, a data node may require high compute power, high disk I/O or additional memory. cognizant 20-20 insights Figure 1 Hadoop Base Components MapReduce (Distributed Programming Framework) HDFS (Hadoop Distributed File System)
  • 3. 3cognizant 20-20 insights The job execution process is handled by the JobTracker and TaskTracker services (see Figure 2). The type of job scheduler chosen for a cluster depends on the application workload. If the cluster will be running short running jobs, then FIFO scheduler will serve the purpose. For a much more balanced cluster, one can choose to use Fair Scheduler. In MRv1.0, there’s no high availability (HA) for the JobTracker service. If the node running JobTracker service fails, all jobs submitted in the cluster will be discontinued and must be resubmitted once the node is recovered or another node is made available. This issue has been fixed in the newer release — i.e., MRv2.0, also known as yet another resource negotiator (YARN).4 Figure 2 also depicts the interaction among these five service components. The Secondary Name Node plays the vital role of check-pointing. It is this node that off-loads the work of metadata image creation and updating on a regular basis. It is recommended to keep the machine configuration the same as Name Node so that in the case of an emergency this node can be promoted as a Name Node. The Hadoop Ecosystem Hadoop base components are surrounded by numerous ecosystem5 projects. Many of them are already top-level Apache (open source) projects; some, however, remain in the incubation stage. Figure 3 (next page) highlights a few of the ecosys- tem components. The most common ones include: • Pig provides a high-level procedural language known as PigLatin. Programs written in PigLatin are automatically converted into map-reduce jobs. This is very suitable in situations where developers are not familiar with high-level languages such as Java, but have strong scripting knowledge and expertise. They can create their own customized functions. It is predominantly recommended for processing large amounts of semi-structured data. • Hive provides a SQL-like language known as HiveQL (query language). Analysts can run queries against the data stored in HDFS using SQL-like queries that automatically get converted into map-reduce jobs. It is rec- ommended where we have large amounts of unstructured data and we want to fetch particular details without writing map-reduce jobs. Companies like RevolutionAnalytics6 can leverage Hadoop clusters running Hive via RHadoop connectors. RHadoop provides a connector for HDFS, MapReduce and HBase connectivity. • HBase is a type of NOSQL database and offers scalable and distributed database service. However, it is not optimized for transactional applications and relational analytics. Be aware: HBase is not a replacement for a relational database.AclusterwithHBaseinstalledrequires an additional 24GB to 32GB of memory on each data node. It is recommended that the data nodes running HBase Region Server should run a smaller number of MapReduce tasks. Figure 2 Hadoop Services Schematic HDFS – Hadoop Distributed File System Task Tracker Data Node Job Tracker Task Tracker Task Tracker .... Name Node Data Node Data Node.... MapReduce (Processing) Secondary Name Node HDFS (Storage)
  • 4. cognizant 20-20 insights 4 Hadoop clusters are mainly used in R&D-relat- ed activities. Many enterprises are now using Hadoop with production workloads. The type of ecosystem components required depends upon the enterprise needs. Enterprises with large data warehouses normally go for production-ready tools from proven vendors such as IBM, Oracle, EMC, etc. Also, not every enterprise requires a Hadoop cluster. A cluster is mainly required by business units that are developing Hadoop- based solutions or running analytics workloads. Business units that are just entering into analytics may start small by building a 5-10 node cluster. As shown in Figure 4 (next page), a Hadoop cluster can scale from just a few data nodes to thousands of data nodes as the need arises or data grows. Hadoop clusters show linear scal- ability, both in terms of performance and storage capacity, as the count of nodes increase. Apache Hadoop runs primarily on the Intel x86/x86_64- based architecture. Clusters must be designed with a particular workload in mind (e.g., data-intensive loads, processor-intensive loads, mixed loads, real-time clusters, etc.). As there are many Hadoop dis- tributions available, implementation selection depends upon feature availability, licensing cost, performance reports, hardware requirements, skills availability within the enterprise and the workload requirements. Per a comStore report,7 MapR performs better than Cloudera and works out to be less costly. Hortonworks, on the other hand, has a strong relationship with Microsoft. Its distribution is also available for a Microsoft platform (HDP v1.3 for Windows), which makes this distribution popular among the Microsoft community. Hadoop Appliance An alternate approach to building a Hadoop cluster is to use a Hadoop appliance. These are customizable and ready-to-use complete stacks from vendors such as IBM (IBM Netezza), Oracle (Big Data Appliance), EMC (Data Computing Appliance), HP (AppSystem), Teradata Aster, etc. Using an appliance-based approach speeds go- to-market and results in an integrated solution from a single (and accountable) vendor like IBM, Teradata, HP, Oracle, etc. These appliances come in different sizes and con- figurations. Every vendor uses a different Hadoop distribution. For example: Oracle Big Data is based on a Cloudera distribution; EMC uses MapR and HP AppSystem can be configured with Cloudera, Figure 3 Scanning the Hadoop Project Ecosystem MapReduce (Distributed Programming Framework) Pig (Data Flow) Zookeeper (Coordination) Ambari (SystemManagement) Hive (SQL) HBase (NoSQLStore) HDFS (Hadoop Distributed File System) HCatalog (Table/SchemaMgmt.) Avro (Serialization) Sqoop (CLIforDataMovement) Hadoop Ecosystem OS Layer Physical Machines Virtual Machines Storage Network Infrastructure Layer
  • 5. cognizant 20-20 insights 5 MapR or Hortonworks. These appliances are designed to solve a particular type of problem or for a particular workload in mind. • IBM Netezza 1000 is designed for high-per- formance business intelligence and advanced analytics. It’s a data warehouse appliance that can run complex and mixed workloads. In addition, it supports multiple languages (C/ C++, Java, Python, FORTRAN), frameworks (Hadoop) and tools (IBM SPSS, R and SAS). • The Teradata Aster appliance comes with 50 pre-built MapReduce functions optimized for graph, text and behavior/marketing analysis. It also comes with Aster Database and Apache Hadoop to process structured, unstructured and semi-structured data. A full Hadoop cabinet includes two masters and 16 data nodes which can deliver up to 152TB of user space. • The HP AppSystem appliance comes with pre- configured HP ProLiant DL380e Gen8 servers running Red Hat Enterprise Linux 6.1 and a Cloudera distribution. Enterprises can use this appliance with HP Vertica Community Edition to run real-time analysis of structured data. HDFS and Hadoop HDFS is not the only platform for performing big data tasks. Vendors have emerged with solutions that offer a replacement for HDFS. For example, DataStax8 uses Cassandra; Amazon Web Services uses S3; IBM has General Parallel File System (GPFS); EMC relies on Isilon; MapR uses Direct NFS; etc. (To learn more, read about alternate technologies.)9 An EMC appliance uses Isilon scale-out NAS technology, along with HDFS. This combination eliminates the need to export/import the data into HDFS (see Figure 5, next page). IBM uses GPFS instead of HDFS in its solution after finding that GPFS performed better than HDFS (see Figure 6, also next page). Many IT shops are scrambling to build Hadoop clusters in their enterprise. The question is: What are they going to do with these clusters? Not every enterprise needs to have its own cluster; it all depends on the nature of the business and whether the cluster serves a legitimate and compelling business use case. For example, enter- prises into financial transactions would definitely require an analytic platform for fraud detection, Figure 4 Hadoop Cluster: An Anatomical View Data Nodes (Slave Nodes) Network Switch Client Nodes Kerberos (Security) Name Nodes Manual/Automatic Failover (Master Node) Cluster Manager (Management & Monitoring) Job Tracker/ Resource Manager (Scheduler) (Edge Nodes)
  • 6. cognizant 20-20 insights 6 portfolio analysis, churn analysis, etc. Similarly, enterprises into research work require large-size clusters for faster processing and large data analysis. Hadoop-as-a-Service There are couple of ways in which Hadoop clusters can be leveraged in addition to building a new cluster or taking an appliance-based approach. A more cost-effective approach is to use Amazon10 Elastic Map-Reduce (EMR) service. Enterprises can build a cluster on demand and pay for the duration they used it. Amazon offers compute instances running open-source Apache Hadoop and MapR (M3/M5/M7) distribution. Other vendors that offer “Hadoop-as-a-Service” include: • Joyent, which offers Hadoop services based on Hortonworks distribution. • Skytap, which provides Hadoop services based on Cloudera (CDH4) distribution. • Mortar, which delivers Hadoop services based on MapR distribution. All of these vendors charge on a pay-per-use basis. New vendors will eventually emerge as market demand increases and more enterprises require additional analytics, business intelligence (BI) and data warehousing options. As the com- petitive stakes rise, we recommend comparing prices and capabilities to get the best price/per- formance for your big data environment. Figure 5 EMC Subs Isilon for HDFS R (RHIPE) PIG Mahout Job Tracker Task Tracker Name Node Data Node Ethernet HCFS HIVE HBASE Figure 6 Benchmarking HDFS vs. GPFS 0 500 Postmark ExecutionTime HDFS GPFS Teragen Grep CacheTest Terasort 1000 1500 2000
  • 7. cognizant 20-20 insights 7 The Hadoop platform by itself can’t do anything. Enterprises need applications that can leverage this platform. Developing an analytics application that offers business value requires data scientists and data analysts. These applications can perform many different types of jobs, from pure analytical, data and BI jobs through data processing. Hadoop can run all types of workloads, so it is becoming increasingly important to design an effective and efficient cluster using the latest technologies and framework. Hadoop Competitors The Hadoop framework provides a batch processing environment that works well with historical data in data warehouses and archives. What if there is a need to perform real-time processing such as in the case of fraud detection, bank transactions, etc.? Enterprises have begun assessing alternative approaches and technolo- gies to solve this issue. Storm is an open source project that helps in setting up the environment for real-time (stream) processing. Besides, there are other options such as massively parallel processing (MPP), high performance computing cluster (HPCC) and Hadapt to consider. However, all these technolo- gies will require a strong understanding of the workload that will be run. If the workload is more processing intensive, it’s always recommended to go with HPCC. Besides, HPCC uses a declara- tive programming language that is extensible, homogeneous and complete. It uses an Open Data Model which is not constrained by key-value pairs, and data processing can happen in parallel. To learn more, read HPCC vs. Hadoop.11 Massively parallel processing databases have been around even before MapReduce hit the market. It makes use of SQL as an interface to interact with the system. SQL is a commonly used language by data scientists and analysts. MPP systems are mainly used in high-end data warehouse imple- mentations. Looking Ahead Hadoop is a strong framework for running data-intensive jobs, but it’s still evolving as the ecosystem projects around it mature and expand. Market demand is spurring new technological developments, daily. With these innovations come various options to choose from, such as building a new cluster, leveraging an appliance or making use of Hadoop-as-a-Service offerings. Moreover,themarketisfillingwithnewcompanies, which is creating more competitive technologies and options to consider. For starters, enterprises need to make the right decision in choosing their Hadoop distributions. Enterprises can start small by picking up any distribution and building a 5-10 node cluster to test the features, function- ality and performance. Most reputable vendors provide a free version of their distribution for evaluation purposes. We advise enterprises to test 2-3 vendor distributions against its business requirements before making any final decisions. However, IT decision-makers should pay close attention to emerging competitors to position their organizations for impending technology enhancements. Footnotes 1 IDC report: Worldwide Big Data Technology and Services 2012-2015 Forecast. 2 www.centurylink.com/business/artifacts/pdf/resources/big-data-defining-the-digital-deluge.pdf. 3 http://en.wikipedia.org/wiki/Hadoop. 4 http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html. 5 http://pig.apache.org/; http://hive.apache.org/; http://hbase.apache.org; “Comparing Pig Latin and SQL for Constructing Data Processing Pipelines,” developer.yahoo.com/blogs/hadoop/comparing-pig-latin- sql-constructing-data-processing-pipelines-444.html; “Process a Million Songs with Apache Pig,” blog. cloudera.com/blog/2012/08/process-a-million-songs-with-apache-pig/. 6 www.revolutionanalytics.com/products/revolution-enterprise.php. 7 comStore report: Switching to MapR for Reliable Insights into Global Internet Consumer Behavior. 8 www.datastax.com/wp-content/uploads/2012/09/WP-DataStax-HDFSvsCFS.pdf. 9 gigaom.com/2012/07/11/because-hadoop-isnt-perfect-8-ways-to-replace-hdfs/. 10 aws.amazon.com/elasticmapreduce/pricing/. 11 hpccsystems.com/Why-HPCC/HPCC-vs-Hadoop. http://Online.withDrDavid.com
  • 8. About Cognizant Cognizant (NASDAQ: CTSH) is a leading provider of information technology, consulting, and business process out- sourcing services, dedicated to helping the world’s leading companies build stronger businesses. Headquartered in Teaneck, New Jersey (U.S.), Cognizant combines a passion for client satisfaction, technology innovation, deep industry and business process expertise, and a global, collaborative workforce that embodies the future of work. With over 50 delivery centers worldwide and approximately 164,300 employees as of June 30, 2013, Cognizant is a member of the NASDAQ-100, the S&P 500, the Forbes Global 2000, and the Fortune 500 and is ranked among the top performing and fastest growing companies in the world. Visit us online at www.cognizant.com or follow us on Twitter: Cognizant. World Headquarters 500 Frank W. Burr Blvd. Teaneck, NJ 07666 USA Phone: +1 201 801 0233 Fax: +1 201 801 0243 Toll Free: +1 888 937 3277 Email: inquiry@cognizant.com European Headquarters 1 Kingdom Street Paddington Central London W2 6BD Phone: +44 (0) 20 7297 7600 Fax: +44 (0) 20 7121 0102 Email: infouk@cognizant.com India Operations Headquarters #5/535, Old Mahabalipuram Road Okkiyam Pettai, Thoraipakkam Chennai, 600 096 India Phone: +91 (0) 44 4209 6000 Fax: +91 (0) 44 4209 6060 Email: inquiryindia@cognizant.com ­­© Copyright 2013, Cognizant. All rights reserved. No part of this document may be reproduced, stored in a retrieval system, transmitted in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the express written permission from Cognizant. The information contained herein is subject to change without notice. All other trademarks mentioned herein are the property of their respective owners. About the Authors Harish Chauhan is an Associate Director and leads new services development around big data and new emerging technologies for Cognizant’s General Technology Office. He has over 21 years of experience and holds a bachelor of engineering degree in computer science. His areas of expertise include virtualization, cloud computing and distributed computing (Hadoop/HPC). He contributed technical articles, co-authored an IBM Redbook on the virtualization engine and filed two patents in the areas of virtualization. He can be reached at Harish.Chauhan@cognizant.com. Jerald Murphy is Cognizant’s Worldwide Practice Leader for Cognizant Business Consulting’s IT Infra- structure Services Practice. He also leads Cognizant’s Technology Office for Enterprise Computing, Cloud Services and Emerging Technology. Jerry has spent over 20 years in the computer industry, in various technical, marketing, sales, senior management and industry analyst positions. He is one of the industry’s leading authorities on enterprise IT infrastructure design, security and management, and has spent 20-plus years conducting engineering research and helping to develop infrastructure and security strategies for many of the largest global companies and service providers. Jerry’s areas of expertise include data center infrastructure and enterprise systems management. Additional focus areas include managed services and IT outsourcing. Jerry has a B.S.E.E. from The United States Military Academy and a master’s degree in electrical engineering from Princeton University. He can be reached at Jerald.Murphy@ cognizant.com.