SlideShare a Scribd company logo
1 of 5
Download to read offline
Strengthening the Quality of Big Data
Implementations
Open-source technologies are helping organizations across industries
gain strategic insights from the torrents of data that now flow
through IT systems. A robust big-data validation framework can
significantly improve high-volume, big-data testing — helping to fortify
quality assurance, refine analytical thinking and enhance the overall
customer experience.
Executive Summary
Today’s enterprises rely on big-data analytics to
inform everything from new product develop-
ment and customer service to strategic decision
making across the business landscape. Yet trans-
forming these insights into serviceable and
competitive products challenges IT departments
to more effectively make meaning from the
vast pools of structured, unstructured and semi-
structured data they must manage. According
to a recent report from McKinsey & Company,1
decoding big data requires next-generation
data-integration platforms to ensure that all
relevant data is gathered and prepared for
further analysis — a task that involves enhancing
quality assurance (QA) mechanisms to confirm
data integrity.
Considering the increasing demand for QA, it is
essential that IT organizations deploy customiz-
able testing platforms. To keep pace with these
requirements, industry experts are building qual-
ity-assurance frameworks around open-source
platforms like Hadoop.
Since big data is a specialized domain, it requires
specialized testing — a job that is traditionally per-
formed manually. Automating this task remains a
work in progress. Nonetheless, given the expo-
nential growth of big data, testers need to stay
abreast of ever-evolving technologies, and lever-
age the latest and best methods for validating
these assets. This compels the testing community
to develop innovative accelerators and automat-
ed solutions for assuring the quality of big data.
In this white paper, we touch upon the addition-
al skill sets data warehouse testers will need to
achieve success in validating big data. We also
examine ways to test “the 3 Vs” of big data:
volume (scale of data), variety (different forms
of data) and velocity (analysis of streaming data
in microseconds). Finally, we illustrate the focal
points for testing and validating the end-to-end
architecture of Hadoop and its integrated systems.
• Cognizant 20-20 Insights
cognizant 20-20 insights | february 2015
cognizant 20-20 insights 2
The Need for a Big-Data Validation
Framework
Leading enterprises are scrambling to get a big
slice of the big data pie — putting pressure on
big-data experts to actively formulate new and
improved testing frameworks. Considering that
technologies change often before they find their
roots, it is imperative for enterprises to develop
a ready-to-use product within
a short time frame — making
automated testing a critical
part of the big-data equation.
Best-of-breed testing frame-
works empower respective
businesses and their IT
teams to rapidly develop,
test and deploy applications
— justifying investments in
sophisticated testing tools
and accelerators. Big-data val-
idation frameworks should be comprehensive yet
easy-to-use — designed to test high-volume, data-
centric implementations.
These frameworks can enable testing and QA
communities to:
•	 Reuse testing components.
•	 Detect defects earlier — preventing data
anomalies in the first stages of the software
development life cycle (SDLC).
•	 Dramatically improve requirements traceability.
•	 Automate test planning, test designing and test
execution at every stage in the big-data testing
lifecycle.
•	 Empower non-technical business users to
create automated test processes and test cases
without any prior scripting knowledge.
•	 Gain deeper insights to support key business
decisions and strategic planning.
Experts are utilizing open-source platforms such
as Hadoop to transform testing methodologies.
Considering the wide spectrum of Hadoop com-
ponents that are built to accommodate diverse
technologies, IT organizations can take full
advantage of this comprehensive, one-stop solu-
tion/framework for big-data testing.
Dive Deep into Hadoop
Traditional data-sampling techniques are no
longer viable options for big-data testing. Today,
testers must develop automated scripts that help
achieve better test coverage and higher qual-
ity in less time. This requires big-data testers to
Validation
frameworks should
be comprehensive
yet easy-to-use
— designed to test
high-volume, data-
centric big data
implementations.
Core
Components
Essential
Components
Supporting
Components
• HDFS
• MapReduce
• Hbase
• Hive
• NoSQL
• Pig
• Sqoop
• Flume
0110
0110
101
00111
101001
The House of Hadoop
Figure 1
cognizant 20-20 insights 3
Quick Take
assemble tailored short programs, scripts and
accelerators that can be packaged as a one-stop
testing framework.
The Hadoop architecture (see Figure 1) comprises
two core components — HDFS and MapReduce —
with the flexibility to integrate additional ele-
ments (Hive, HBase, Pig, Sqoop, Flume). NoSQL
databases Mongo, Cassandra, Marklogic, tradi-
tional databases (Oracle, MySQL, SQL Server)
and business intelligence tools such as Tableau,
Datameer and Splunk can also be integrated for
analytics and reporting.
Understanding the Hadoop
Testing Spectrum
Big-data testing is integral to translating business
insights harvested from big data and produc-
ing high-quality products. Testers will be better
equipped to utilize superior testing mechanisms
by understanding the big-data spectrum (see
case study, above).
Hadoop Testing at a Glance
Testing in Hadoop can be categorized along the
following pillars:
•	 Core components testing (HDFS, MapReduce)
•	 Essential components testing (Hive, HBase,
NoSQL)
•	 Supporting components testing (Flume, Sqoop)
Core Hadoop Components Testing
At its core, Hadoop is an open-source MapReduce
implementation. Testing Hadoop can be a chal-
lenge, since the distributed processing of large
data sets happens across clusters of computers.
However, testers can help ensure efficient test
coverage of the Hadoop ecosystem with system-
atic test-planning. To achieve the desired results,
Hadoop components should be categorized and
validation points finalized for each component to
ensure successful end-to-end data flow.
•	 Hadoop cluster setup testing. Hadoop
supports hundreds of nodes that sustain
distributed processing and high availability.
Multiple NameNodes and DataNodes, along
with TaskTrackers and JobTrackers, run simul-
taneously. To ensure that the setup is ready
prior to loading data into Hadoop, testers
should perform simple smoke tests, such as
HDFS format checks, health status checks
of NameNode, DataNodes and Secondary
NameNodes, as well as HDFS file copy
and listing checks. Additionally, Hadoop is
equipped with pre-built snippet programs such
as Terasort, MRBench, NN Bench and MiniDF-
SCluster/MiniMRCluster for testing the infra-
structure setup.
Helping a Telecom Assure the Quality of its Big-Data Implementation
A Fortune 500 media and telecommunications
service provider migrated a huge data set (in
volume and variety) captured from the Internet,
set-top boxes — as well as mobile, home-security
and control systems — to a big-data platform
consisting of the following Hadoop components:
HDFS, Hive, Pig, Sqoop and Flume. Cognizant
worked with the client to implement a big-data
validation framework that included:
•	A customized tool that automates the process
of capturing and validating data.
•	A custom-built result reporting engine, which
highlights discrepancies and presents results
on a user-friendly dashboard.
•	Easy connectivity and validation of tradi-
tional RDBMS, NoSQL databases and Hadoop
components.
Our big-data validation framework helped the
client to:
•	Reduce testing efforts by 60% — enabling
faster time-to-market.
•	Achieve ROI in six straight releases.
•	Deliver comprehensive test coverage with
complex data.
cognizant 20-20 insights 4
•	 HDFS testing. HDFS, a distributed file system
designed to run on commodity hardware, uses
the master-slave architecture. To examine
the highly complex architecture of HDFS, QA
teams need to verify that the file storage is
in accordance with the defined block size;
perform replication checks to ensure availabil-
ity of NameNode and DataNode; check for file
load and input file split per the defined block
size; and execute HDFS file upload/download
checks to ensure data integrity.
•	 Validation of MapReduce jobs. Programming
Hadoop at the MapReduce level means working
with the Java APIs and manually loading data
files into HDFS. Testing MapReduce requires
testers who are skilled in white-box testing. QA
teams need to validate whether transforma-
tion and aggregation are handled correctly by
the MapReduce code. Testers need to begin
thinking as developers. MapReduce core
business logic can be validated with JUnit.
However, Mapper and Reducer code testing
is accomplished by introducing MRUnit tests,
which are based on Junit, an extension for
testing MapReduce code.
Essential Hadoop Components
Working with Java APIs directly is a tedious and
often error-prone process. To make programming
easier, Hadoop offers components such as Hive,
HBase and Pig. Each of these are developed using
different technologies — providing a concrete
testing framework that can be highly beneficial in
speeding time to market.
•	 HBase testing. HBase, a non-relational
(NoSQL) database that runs on top of HDFS,
provides fault-tolerant storage and quick
access to large quantities of sparse data. Its
purpose is to host very large tables with billions
of rows and millions of columns. HBase testing
involves validation of RowKey (PK) usage for
all the access attempts to HBase tables; verifi-
cation of version settings in the query output;
verification that the latency of individual reads
and writes is per the threshold; and checks
on the Zookeeper to ensure the coordination
activities of the region servers.
•	 Hive testing. Hive enables Hadoop to operate
as a data warehouse. It superimposes structure
on data in HDFS, then permits queries over
the data using a familiar SQL-like syntax.
Hive’s core capabilities are extensible by UDFs
(user defined functions). Since Hive is recom-
mended for analysis of terabytes of data, the
volume and velocity of big data are extensively
covered in Hive testing. (Hive testing can be
classified into Hive functional testing and Hive
UDF testing).
From a functional standpoint, Hive testing
incorporates validation of successful setup of
the Hive meta-store database; data integrity
between HDFS vs. Hive and Hive vs. MySQL
(meta-store); correctness of the query and
data transformation logic; checks related to
number of MapReduce jobs triggered for each
business logic; export/import of data from/to
Hive; data integrity, and redundancy checks
when MapReduce jobs fail.
Hive core capabilities are made extensible by
writing custom UDFs. The UDFs must be tested
for correctness and completeness, which can
be achieved using tools such as JUnit. Testers
should also test the UDF in Hive CLI directly,
especially if there are uncertainties in func-
tions that deal with the right data types, such
as struct, map and array.
•	 NoSQL Testing. NoSQL (NotOnlySQL) are
next-generation databases that are non-
relational, distributed, open-source and hori-
zontally scalable. Some of the widely used
NoSQL databases are HBase, Cassandra,
Mongo, CouchDB, Riak and Marklogic. A data
warehouse tester who has hands-on experience
with any one of the traditional databases
should be able to work with other traditional
RDBMS, with minimal effort. However, a NoSQL
database tester will need to acquire supple-
mentary skills for each of the NoSQL databases
in order to perform quality testing.
Testing NoSQL databases can validate big
data’s “variety” aspect. Because each NoSQL
database has its own language, testers need to
be well versed in the programming languages
that are used to develop the NoSQL database.
Also, NoSQL databases are independent of
the specific application or schema, and can
operate on a variety of platforms and operat-
ing systems.
QA areas for these databases include data
type checks; count checks of the documents
in a collection; CRUD operations checks;
master-slave integrity checks; timestamp and
its format checks; checks related to cluster
failure handling; and data integrity and redun-
dancy checks on task failure.
A manual testing approach around Hadoop
components can be constructed into a robust
QA framework. Moreover, there is a cascading
scope for customizing the framework to a spe-
cific domain. In turn, this customized framework
can facilitate key business decisions and stra-
tegic planning.
Supporting Components
Flume and Sqoop  improve interoperability
with the rest of the data world. Sqoop is a tool
designed to import data from relational data-
bases into Hadoop and vice versa. Flume directly
imports streaming data into HDFS.
•	 Flume Sqoop testing. In a data warehouse,
the database supports the import/export
functions of data. Big data is equipped with
data ingestion tools such as Flume and Sqoop,
which can be used to move data into and out
of Hadoop. Instead of writing a stand-alone
application to move data to HDFS, it’s worth
considering existing tools for ingesting data,
since they offer most of the common functions.
General QA checkpoints include successfully
generating streaming data from Web sources
using Flume, checks over data propagation
from conventional data storages into Hive and
HBase, and vice versa.
Looking Forward:
The Only Constant is Change
Technology, more often than not, follows nature’s
patterns. Business insights are harvested from
abundant, untapped big data. Considering the
increasing demand for big-data QA, it is essential
that IT organizations deploy customizable testing
platforms and frameworks. To keep pace, experts
must build a robust QA framework for ensuring
accuracy and consistency — enabling businesses
and their IT departments to:
•	 Optimize resources to perform testing with
respect to hardware, people and time.
•	 Employ domain-specific, built-in libraries to
enable non-technical business users to perform
testing.
•	 Address early automation in every phase of the
big data software testing life cycle.
Footnotes
1	
“How to Get the Most from Big Data.” McKinsey & Company, December, 2014.
About the Author
Sushmitha Geddam is a Project Manager in the R&D team of Cognizant’s Data Testing Center of
Excellence and leads the company’s big-data QA initiatives. During her 11-year diversified career, she
has delivered multifaceted enterprise business software projects in a wide array of domains in special-
ized testing areas (big data, DW/ETL, data migration). She is responsible for providing delivery oversight,
thought leadership and best practices; developing internal talent in technology CoEs, and defining the
quality and testing processes and strategizing tools for big-data implementations. She is PMP-certified,
a Cloudera-certified developer for Apache Hadoop and a Datameer-certified analyst. She can be reached
at Sushmitha.Geddam@cognizant.com.
About Cognizant
Cognizant (NASDAQ: CTSH) is a leading provider of information technology, consulting, and business process out-
sourcing services, dedicated to helping the world’s leading companies build stronger businesses. Headquartered in
Teaneck, New Jersey (U.S.), Cognizant combines a passion for client satisfaction, technology innovation, deep industry
and business process expertise, and a global, collaborative workforce that embodies the future of work. With over
75 development and delivery centers worldwide and approximately 199,700 employees as of September 30, 2014,
Cognizant is a member of the NASDAQ-100, the S&P 500, the Forbes Global 2000, and the Fortune 500 and is ranked
among the top performing and fastest growing companies in the world. Visit us online at www.cognizant.com or follow
us on Twitter: Cognizant.
World Headquarters
500 Frank W. Burr Blvd.
Teaneck, NJ 07666 USA
Phone: +1 201 801 0233
Fax: +1 201 801 0243
Toll Free: +1 888 937 3277
Email: inquiry@cognizant.com
European Headquarters
1 Kingdom Street
Paddington Central
London W2 6BD
Phone: +44 (0) 20 7297 7600
Fax: +44 (0) 20 7121 0102
Email: infouk@cognizant.com
India Operations Headquarters
#5/535, Old Mahabalipuram Road
Okkiyam Pettai, Thoraipakkam
Chennai, 600 096 India
Phone: +91 (0) 44 4209 6000
Fax: +91 (0) 44 4209 6060
Email: inquiryindia@cognizant.com
­­© Copyright 2015, Cognizant. All rights reserved. No part of this document may be reproduced, stored in a retrieval system, transmitted in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise, without the express written permission from Cognizant. The information contained herein is
subject to change without notice. All other trademarks mentioned herein are the property of their respective owners. 	 TL Codex 1009

More Related Content

What's hot

Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
 
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...Databricks
 
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligenceSpark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligenceWei Di
 
WhyR? Analiza sentymentu
WhyR? Analiza sentymentuWhyR? Analiza sentymentu
WhyR? Analiza sentymentuŁukasz Grala
 
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...Noriaki Tatsumi
 
Tag.bio aws public jun 08 2021
Tag.bio aws public jun 08 2021 Tag.bio aws public jun 08 2021
Tag.bio aws public jun 08 2021 Sanjay Padhi, Ph.D
 
Machine learning with Spark
Machine learning with SparkMachine learning with Spark
Machine learning with SparkKhalid Salama
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
 
Introduction to Microsoft R Services
Introduction to Microsoft R ServicesIntroduction to Microsoft R Services
Introduction to Microsoft R ServicesGregg Barrett
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdDatabricks
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta LakeDatabricks
 
Data science and OSS
Data science and OSSData science and OSS
Data science and OSSKevin Crocker
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment Databricks
 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningDatabricks
 
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowDatabricks
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Spark Summit
 
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Databricks
 
Neo4j Graph Data Science Training - June 9 & 10 - Slides #6 Graph Algorithms
Neo4j Graph Data Science Training - June 9 & 10 - Slides #6 Graph AlgorithmsNeo4j Graph Data Science Training - June 9 & 10 - Slides #6 Graph Algorithms
Neo4j Graph Data Science Training - June 9 & 10 - Slides #6 Graph AlgorithmsNeo4j
 
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...Spark Summit
 

What's hot (20)

Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
 
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligenceSpark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence
 
WhyR? Analiza sentymentu
WhyR? Analiza sentymentuWhyR? Analiza sentymentu
WhyR? Analiza sentymentu
 
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
 
A gentle introduction to Oracle R Enterprise
A gentle introduction to Oracle R EnterpriseA gentle introduction to Oracle R Enterprise
A gentle introduction to Oracle R Enterprise
 
Tag.bio aws public jun 08 2021
Tag.bio aws public jun 08 2021 Tag.bio aws public jun 08 2021
Tag.bio aws public jun 08 2021
 
Machine learning with Spark
Machine learning with SparkMachine learning with Spark
Machine learning with Spark
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
 
Introduction to Microsoft R Services
Introduction to Microsoft R ServicesIntroduction to Microsoft R Services
Introduction to Microsoft R Services
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
Data science and OSS
Data science and OSSData science and OSS
Data science and OSS
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine Learning
 
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLow
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
 
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
 
Neo4j Graph Data Science Training - June 9 & 10 - Slides #6 Graph Algorithms
Neo4j Graph Data Science Training - June 9 & 10 - Slides #6 Graph AlgorithmsNeo4j Graph Data Science Training - June 9 & 10 - Slides #6 Graph Algorithms
Neo4j Graph Data Science Training - June 9 & 10 - Slides #6 Graph Algorithms
 
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...
 

Similar to Strengthening the Quality of Big Data Implementations

From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...Cognizant
 
Understanding big data testing
Understanding big data testingUnderstanding big data testing
Understanding big data testingNarola Infotech
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overviewvhrocca
 
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRobertsWP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRobertsJane Roberts
 
Venkata Sateesh_BigData_Latest-Resume
Venkata Sateesh_BigData_Latest-ResumeVenkata Sateesh_BigData_Latest-Resume
Venkata Sateesh_BigData_Latest-Resumevenkata sateeshs
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKRajesh Jayarman
 
ATAGTR2017 Performance Testing and Non-Functional Testing Strategy for Big Da...
ATAGTR2017 Performance Testing and Non-Functional Testing Strategy for Big Da...ATAGTR2017 Performance Testing and Non-Functional Testing Strategy for Big Da...
ATAGTR2017 Performance Testing and Non-Functional Testing Strategy for Big Da...Agile Testing Alliance
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsCognizant
 
Cisco Big Data Warehouse Expansion Featuring MapR Distribution
Cisco Big Data Warehouse Expansion Featuring MapR DistributionCisco Big Data Warehouse Expansion Featuring MapR Distribution
Cisco Big Data Warehouse Expansion Featuring MapR DistributionAppfluent Technology
 
flexpod_hadoop_cloudera
flexpod_hadoop_clouderaflexpod_hadoop_cloudera
flexpod_hadoop_clouderaPrem Jain
 
Cisco Big Data Warehouse Expansion Solution data sheet
Cisco Big Data Warehouse Expansion Solution data sheetCisco Big Data Warehouse Expansion Solution data sheet
Cisco Big Data Warehouse Expansion Solution data sheetAppfluent Technology
 
ds_Pivotal_Big_Data_Suite_Product_Suite
ds_Pivotal_Big_Data_Suite_Product_Suiteds_Pivotal_Big_Data_Suite_Product_Suite
ds_Pivotal_Big_Data_Suite_Product_SuiteRobin Fong 方俊强
 
SMAC - Social, Mobile, Analytics and Cloud - An overview
SMAC - Social, Mobile, Analytics and Cloud - An overview SMAC - Social, Mobile, Analytics and Cloud - An overview
SMAC - Social, Mobile, Analytics and Cloud - An overview Rajesh Menon
 
Pervasive DataRush
Pervasive DataRushPervasive DataRush
Pervasive DataRushtempledf
 
Big Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit KharabeBig Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit KharabeROHIT KHARABE
 
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...Hortonworks
 
Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopRTTS
 
A Tighter Weave – How YARN Changes the Data Quality Game
A Tighter Weave – How YARN Changes the Data Quality GameA Tighter Weave – How YARN Changes the Data Quality Game
A Tighter Weave – How YARN Changes the Data Quality GameInside Analysis
 

Similar to Strengthening the Quality of Big Data Implementations (20)

From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...
 
Understanding big data testing
Understanding big data testingUnderstanding big data testing
Understanding big data testing
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overview
 
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRobertsWP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
 
Venkata Sateesh_BigData_Latest-Resume
Venkata Sateesh_BigData_Latest-ResumeVenkata Sateesh_BigData_Latest-Resume
Venkata Sateesh_BigData_Latest-Resume
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
 
ATAGTR2017 Performance Testing and Non-Functional Testing Strategy for Big Da...
ATAGTR2017 Performance Testing and Non-Functional Testing Strategy for Big Da...ATAGTR2017 Performance Testing and Non-Functional Testing Strategy for Big Da...
ATAGTR2017 Performance Testing and Non-Functional Testing Strategy for Big Da...
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical Workloads
 
Cisco Big Data Warehouse Expansion Featuring MapR Distribution
Cisco Big Data Warehouse Expansion Featuring MapR DistributionCisco Big Data Warehouse Expansion Featuring MapR Distribution
Cisco Big Data Warehouse Expansion Featuring MapR Distribution
 
flexpod_hadoop_cloudera
flexpod_hadoop_clouderaflexpod_hadoop_cloudera
flexpod_hadoop_cloudera
 
Cisco Big Data Warehouse Expansion Solution data sheet
Cisco Big Data Warehouse Expansion Solution data sheetCisco Big Data Warehouse Expansion Solution data sheet
Cisco Big Data Warehouse Expansion Solution data sheet
 
ds_Pivotal_Big_Data_Suite_Product_Suite
ds_Pivotal_Big_Data_Suite_Product_Suiteds_Pivotal_Big_Data_Suite_Product_Suite
ds_Pivotal_Big_Data_Suite_Product_Suite
 
SMAC - Social, Mobile, Analytics and Cloud - An overview
SMAC - Social, Mobile, Analytics and Cloud - An overview SMAC - Social, Mobile, Analytics and Cloud - An overview
SMAC - Social, Mobile, Analytics and Cloud - An overview
 
Pervasive DataRush
Pervasive DataRushPervasive DataRush
Pervasive DataRush
 
Big Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit KharabeBig Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit Kharabe
 
hadoop exp
hadoop exphadoop exp
hadoop exp
 
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...
 
Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of Hadoop
 
A Tighter Weave – How YARN Changes the Data Quality Game
A Tighter Weave – How YARN Changes the Data Quality GameA Tighter Weave – How YARN Changes the Data Quality Game
A Tighter Weave – How YARN Changes the Data Quality Game
 
Big data Question bank.pdf
Big data Question bank.pdfBig data Question bank.pdf
Big data Question bank.pdf
 

More from Cognizant

Using Adaptive Scrum to Tame Process Reverse Engineering in Data Analytics Pr...
Using Adaptive Scrum to Tame Process Reverse Engineering in Data Analytics Pr...Using Adaptive Scrum to Tame Process Reverse Engineering in Data Analytics Pr...
Using Adaptive Scrum to Tame Process Reverse Engineering in Data Analytics Pr...Cognizant
 
Data Modernization: Breaking the AI Vicious Cycle for Superior Decision-making
Data Modernization: Breaking the AI Vicious Cycle for Superior Decision-makingData Modernization: Breaking the AI Vicious Cycle for Superior Decision-making
Data Modernization: Breaking the AI Vicious Cycle for Superior Decision-makingCognizant
 
It Takes an Ecosystem: How Technology Companies Deliver Exceptional Experiences
It Takes an Ecosystem: How Technology Companies Deliver Exceptional ExperiencesIt Takes an Ecosystem: How Technology Companies Deliver Exceptional Experiences
It Takes an Ecosystem: How Technology Companies Deliver Exceptional ExperiencesCognizant
 
Intuition Engineered
Intuition EngineeredIntuition Engineered
Intuition EngineeredCognizant
 
The Work Ahead: Transportation and Logistics Delivering on the Digital-Physic...
The Work Ahead: Transportation and Logistics Delivering on the Digital-Physic...The Work Ahead: Transportation and Logistics Delivering on the Digital-Physic...
The Work Ahead: Transportation and Logistics Delivering on the Digital-Physic...Cognizant
 
Enhancing Desirability: Five Considerations for Winning Digital Initiatives
Enhancing Desirability: Five Considerations for Winning Digital InitiativesEnhancing Desirability: Five Considerations for Winning Digital Initiatives
Enhancing Desirability: Five Considerations for Winning Digital InitiativesCognizant
 
The Work Ahead in Manufacturing: Fulfilling the Agility Mandate
The Work Ahead in Manufacturing: Fulfilling the Agility MandateThe Work Ahead in Manufacturing: Fulfilling the Agility Mandate
The Work Ahead in Manufacturing: Fulfilling the Agility MandateCognizant
 
The Work Ahead in Higher Education: Repaving the Road for the Employees of To...
The Work Ahead in Higher Education: Repaving the Road for the Employees of To...The Work Ahead in Higher Education: Repaving the Road for the Employees of To...
The Work Ahead in Higher Education: Repaving the Road for the Employees of To...Cognizant
 
Engineering the Next-Gen Digital Claims Organisation for Australian General I...
Engineering the Next-Gen Digital Claims Organisation for Australian General I...Engineering the Next-Gen Digital Claims Organisation for Australian General I...
Engineering the Next-Gen Digital Claims Organisation for Australian General I...Cognizant
 
Profitability in the Direct-to-Consumer Marketplace: A Playbook for Media and...
Profitability in the Direct-to-Consumer Marketplace: A Playbook for Media and...Profitability in the Direct-to-Consumer Marketplace: A Playbook for Media and...
Profitability in the Direct-to-Consumer Marketplace: A Playbook for Media and...Cognizant
 
Green Rush: The Economic Imperative for Sustainability
Green Rush: The Economic Imperative for SustainabilityGreen Rush: The Economic Imperative for Sustainability
Green Rush: The Economic Imperative for SustainabilityCognizant
 
Policy Administration Modernization: Four Paths for Insurers
Policy Administration Modernization: Four Paths for InsurersPolicy Administration Modernization: Four Paths for Insurers
Policy Administration Modernization: Four Paths for InsurersCognizant
 
The Work Ahead in Utilities: Powering a Sustainable Future with Digital
The Work Ahead in Utilities: Powering a Sustainable Future with DigitalThe Work Ahead in Utilities: Powering a Sustainable Future with Digital
The Work Ahead in Utilities: Powering a Sustainable Future with DigitalCognizant
 
AI in Media & Entertainment: Starting the Journey to Value
AI in Media & Entertainment: Starting the Journey to ValueAI in Media & Entertainment: Starting the Journey to Value
AI in Media & Entertainment: Starting the Journey to ValueCognizant
 
Operations Workforce Management: A Data-Informed, Digital-First Approach
Operations Workforce Management: A Data-Informed, Digital-First ApproachOperations Workforce Management: A Data-Informed, Digital-First Approach
Operations Workforce Management: A Data-Informed, Digital-First ApproachCognizant
 
Five Priorities for Quality Engineering When Taking Banking to the Cloud
Five Priorities for Quality Engineering When Taking Banking to the CloudFive Priorities for Quality Engineering When Taking Banking to the Cloud
Five Priorities for Quality Engineering When Taking Banking to the CloudCognizant
 
Getting Ahead With AI: How APAC Companies Replicate Success by Remaining Focused
Getting Ahead With AI: How APAC Companies Replicate Success by Remaining FocusedGetting Ahead With AI: How APAC Companies Replicate Success by Remaining Focused
Getting Ahead With AI: How APAC Companies Replicate Success by Remaining FocusedCognizant
 
Crafting the Utility of the Future
Crafting the Utility of the FutureCrafting the Utility of the Future
Crafting the Utility of the FutureCognizant
 
Utilities Can Ramp Up CX with a Customer Data Platform
Utilities Can Ramp Up CX with a Customer Data PlatformUtilities Can Ramp Up CX with a Customer Data Platform
Utilities Can Ramp Up CX with a Customer Data PlatformCognizant
 
The Work Ahead in Intelligent Automation: Coping with Complexity in a Post-Pa...
The Work Ahead in Intelligent Automation: Coping with Complexity in a Post-Pa...The Work Ahead in Intelligent Automation: Coping with Complexity in a Post-Pa...
The Work Ahead in Intelligent Automation: Coping with Complexity in a Post-Pa...Cognizant
 

More from Cognizant (20)

Using Adaptive Scrum to Tame Process Reverse Engineering in Data Analytics Pr...
Using Adaptive Scrum to Tame Process Reverse Engineering in Data Analytics Pr...Using Adaptive Scrum to Tame Process Reverse Engineering in Data Analytics Pr...
Using Adaptive Scrum to Tame Process Reverse Engineering in Data Analytics Pr...
 
Data Modernization: Breaking the AI Vicious Cycle for Superior Decision-making
Data Modernization: Breaking the AI Vicious Cycle for Superior Decision-makingData Modernization: Breaking the AI Vicious Cycle for Superior Decision-making
Data Modernization: Breaking the AI Vicious Cycle for Superior Decision-making
 
It Takes an Ecosystem: How Technology Companies Deliver Exceptional Experiences
It Takes an Ecosystem: How Technology Companies Deliver Exceptional ExperiencesIt Takes an Ecosystem: How Technology Companies Deliver Exceptional Experiences
It Takes an Ecosystem: How Technology Companies Deliver Exceptional Experiences
 
Intuition Engineered
Intuition EngineeredIntuition Engineered
Intuition Engineered
 
The Work Ahead: Transportation and Logistics Delivering on the Digital-Physic...
The Work Ahead: Transportation and Logistics Delivering on the Digital-Physic...The Work Ahead: Transportation and Logistics Delivering on the Digital-Physic...
The Work Ahead: Transportation and Logistics Delivering on the Digital-Physic...
 
Enhancing Desirability: Five Considerations for Winning Digital Initiatives
Enhancing Desirability: Five Considerations for Winning Digital InitiativesEnhancing Desirability: Five Considerations for Winning Digital Initiatives
Enhancing Desirability: Five Considerations for Winning Digital Initiatives
 
The Work Ahead in Manufacturing: Fulfilling the Agility Mandate
The Work Ahead in Manufacturing: Fulfilling the Agility MandateThe Work Ahead in Manufacturing: Fulfilling the Agility Mandate
The Work Ahead in Manufacturing: Fulfilling the Agility Mandate
 
The Work Ahead in Higher Education: Repaving the Road for the Employees of To...
The Work Ahead in Higher Education: Repaving the Road for the Employees of To...The Work Ahead in Higher Education: Repaving the Road for the Employees of To...
The Work Ahead in Higher Education: Repaving the Road for the Employees of To...
 
Engineering the Next-Gen Digital Claims Organisation for Australian General I...
Engineering the Next-Gen Digital Claims Organisation for Australian General I...Engineering the Next-Gen Digital Claims Organisation for Australian General I...
Engineering the Next-Gen Digital Claims Organisation for Australian General I...
 
Profitability in the Direct-to-Consumer Marketplace: A Playbook for Media and...
Profitability in the Direct-to-Consumer Marketplace: A Playbook for Media and...Profitability in the Direct-to-Consumer Marketplace: A Playbook for Media and...
Profitability in the Direct-to-Consumer Marketplace: A Playbook for Media and...
 
Green Rush: The Economic Imperative for Sustainability
Green Rush: The Economic Imperative for SustainabilityGreen Rush: The Economic Imperative for Sustainability
Green Rush: The Economic Imperative for Sustainability
 
Policy Administration Modernization: Four Paths for Insurers
Policy Administration Modernization: Four Paths for InsurersPolicy Administration Modernization: Four Paths for Insurers
Policy Administration Modernization: Four Paths for Insurers
 
The Work Ahead in Utilities: Powering a Sustainable Future with Digital
The Work Ahead in Utilities: Powering a Sustainable Future with DigitalThe Work Ahead in Utilities: Powering a Sustainable Future with Digital
The Work Ahead in Utilities: Powering a Sustainable Future with Digital
 
AI in Media & Entertainment: Starting the Journey to Value
AI in Media & Entertainment: Starting the Journey to ValueAI in Media & Entertainment: Starting the Journey to Value
AI in Media & Entertainment: Starting the Journey to Value
 
Operations Workforce Management: A Data-Informed, Digital-First Approach
Operations Workforce Management: A Data-Informed, Digital-First ApproachOperations Workforce Management: A Data-Informed, Digital-First Approach
Operations Workforce Management: A Data-Informed, Digital-First Approach
 
Five Priorities for Quality Engineering When Taking Banking to the Cloud
Five Priorities for Quality Engineering When Taking Banking to the CloudFive Priorities for Quality Engineering When Taking Banking to the Cloud
Five Priorities for Quality Engineering When Taking Banking to the Cloud
 
Getting Ahead With AI: How APAC Companies Replicate Success by Remaining Focused
Getting Ahead With AI: How APAC Companies Replicate Success by Remaining FocusedGetting Ahead With AI: How APAC Companies Replicate Success by Remaining Focused
Getting Ahead With AI: How APAC Companies Replicate Success by Remaining Focused
 
Crafting the Utility of the Future
Crafting the Utility of the FutureCrafting the Utility of the Future
Crafting the Utility of the Future
 
Utilities Can Ramp Up CX with a Customer Data Platform
Utilities Can Ramp Up CX with a Customer Data PlatformUtilities Can Ramp Up CX with a Customer Data Platform
Utilities Can Ramp Up CX with a Customer Data Platform
 
The Work Ahead in Intelligent Automation: Coping with Complexity in a Post-Pa...
The Work Ahead in Intelligent Automation: Coping with Complexity in a Post-Pa...The Work Ahead in Intelligent Automation: Coping with Complexity in a Post-Pa...
The Work Ahead in Intelligent Automation: Coping with Complexity in a Post-Pa...
 

Strengthening the Quality of Big Data Implementations

  • 1. Strengthening the Quality of Big Data Implementations Open-source technologies are helping organizations across industries gain strategic insights from the torrents of data that now flow through IT systems. A robust big-data validation framework can significantly improve high-volume, big-data testing — helping to fortify quality assurance, refine analytical thinking and enhance the overall customer experience. Executive Summary Today’s enterprises rely on big-data analytics to inform everything from new product develop- ment and customer service to strategic decision making across the business landscape. Yet trans- forming these insights into serviceable and competitive products challenges IT departments to more effectively make meaning from the vast pools of structured, unstructured and semi- structured data they must manage. According to a recent report from McKinsey & Company,1 decoding big data requires next-generation data-integration platforms to ensure that all relevant data is gathered and prepared for further analysis — a task that involves enhancing quality assurance (QA) mechanisms to confirm data integrity. Considering the increasing demand for QA, it is essential that IT organizations deploy customiz- able testing platforms. To keep pace with these requirements, industry experts are building qual- ity-assurance frameworks around open-source platforms like Hadoop. Since big data is a specialized domain, it requires specialized testing — a job that is traditionally per- formed manually. Automating this task remains a work in progress. Nonetheless, given the expo- nential growth of big data, testers need to stay abreast of ever-evolving technologies, and lever- age the latest and best methods for validating these assets. This compels the testing community to develop innovative accelerators and automat- ed solutions for assuring the quality of big data. In this white paper, we touch upon the addition- al skill sets data warehouse testers will need to achieve success in validating big data. We also examine ways to test “the 3 Vs” of big data: volume (scale of data), variety (different forms of data) and velocity (analysis of streaming data in microseconds). Finally, we illustrate the focal points for testing and validating the end-to-end architecture of Hadoop and its integrated systems. • Cognizant 20-20 Insights cognizant 20-20 insights | february 2015
  • 2. cognizant 20-20 insights 2 The Need for a Big-Data Validation Framework Leading enterprises are scrambling to get a big slice of the big data pie — putting pressure on big-data experts to actively formulate new and improved testing frameworks. Considering that technologies change often before they find their roots, it is imperative for enterprises to develop a ready-to-use product within a short time frame — making automated testing a critical part of the big-data equation. Best-of-breed testing frame- works empower respective businesses and their IT teams to rapidly develop, test and deploy applications — justifying investments in sophisticated testing tools and accelerators. Big-data val- idation frameworks should be comprehensive yet easy-to-use — designed to test high-volume, data- centric implementations. These frameworks can enable testing and QA communities to: • Reuse testing components. • Detect defects earlier — preventing data anomalies in the first stages of the software development life cycle (SDLC). • Dramatically improve requirements traceability. • Automate test planning, test designing and test execution at every stage in the big-data testing lifecycle. • Empower non-technical business users to create automated test processes and test cases without any prior scripting knowledge. • Gain deeper insights to support key business decisions and strategic planning. Experts are utilizing open-source platforms such as Hadoop to transform testing methodologies. Considering the wide spectrum of Hadoop com- ponents that are built to accommodate diverse technologies, IT organizations can take full advantage of this comprehensive, one-stop solu- tion/framework for big-data testing. Dive Deep into Hadoop Traditional data-sampling techniques are no longer viable options for big-data testing. Today, testers must develop automated scripts that help achieve better test coverage and higher qual- ity in less time. This requires big-data testers to Validation frameworks should be comprehensive yet easy-to-use — designed to test high-volume, data- centric big data implementations. Core Components Essential Components Supporting Components • HDFS • MapReduce • Hbase • Hive • NoSQL • Pig • Sqoop • Flume 0110 0110 101 00111 101001 The House of Hadoop Figure 1
  • 3. cognizant 20-20 insights 3 Quick Take assemble tailored short programs, scripts and accelerators that can be packaged as a one-stop testing framework. The Hadoop architecture (see Figure 1) comprises two core components — HDFS and MapReduce — with the flexibility to integrate additional ele- ments (Hive, HBase, Pig, Sqoop, Flume). NoSQL databases Mongo, Cassandra, Marklogic, tradi- tional databases (Oracle, MySQL, SQL Server) and business intelligence tools such as Tableau, Datameer and Splunk can also be integrated for analytics and reporting. Understanding the Hadoop Testing Spectrum Big-data testing is integral to translating business insights harvested from big data and produc- ing high-quality products. Testers will be better equipped to utilize superior testing mechanisms by understanding the big-data spectrum (see case study, above). Hadoop Testing at a Glance Testing in Hadoop can be categorized along the following pillars: • Core components testing (HDFS, MapReduce) • Essential components testing (Hive, HBase, NoSQL) • Supporting components testing (Flume, Sqoop) Core Hadoop Components Testing At its core, Hadoop is an open-source MapReduce implementation. Testing Hadoop can be a chal- lenge, since the distributed processing of large data sets happens across clusters of computers. However, testers can help ensure efficient test coverage of the Hadoop ecosystem with system- atic test-planning. To achieve the desired results, Hadoop components should be categorized and validation points finalized for each component to ensure successful end-to-end data flow. • Hadoop cluster setup testing. Hadoop supports hundreds of nodes that sustain distributed processing and high availability. Multiple NameNodes and DataNodes, along with TaskTrackers and JobTrackers, run simul- taneously. To ensure that the setup is ready prior to loading data into Hadoop, testers should perform simple smoke tests, such as HDFS format checks, health status checks of NameNode, DataNodes and Secondary NameNodes, as well as HDFS file copy and listing checks. Additionally, Hadoop is equipped with pre-built snippet programs such as Terasort, MRBench, NN Bench and MiniDF- SCluster/MiniMRCluster for testing the infra- structure setup. Helping a Telecom Assure the Quality of its Big-Data Implementation A Fortune 500 media and telecommunications service provider migrated a huge data set (in volume and variety) captured from the Internet, set-top boxes — as well as mobile, home-security and control systems — to a big-data platform consisting of the following Hadoop components: HDFS, Hive, Pig, Sqoop and Flume. Cognizant worked with the client to implement a big-data validation framework that included: • A customized tool that automates the process of capturing and validating data. • A custom-built result reporting engine, which highlights discrepancies and presents results on a user-friendly dashboard. • Easy connectivity and validation of tradi- tional RDBMS, NoSQL databases and Hadoop components. Our big-data validation framework helped the client to: • Reduce testing efforts by 60% — enabling faster time-to-market. • Achieve ROI in six straight releases. • Deliver comprehensive test coverage with complex data.
  • 4. cognizant 20-20 insights 4 • HDFS testing. HDFS, a distributed file system designed to run on commodity hardware, uses the master-slave architecture. To examine the highly complex architecture of HDFS, QA teams need to verify that the file storage is in accordance with the defined block size; perform replication checks to ensure availabil- ity of NameNode and DataNode; check for file load and input file split per the defined block size; and execute HDFS file upload/download checks to ensure data integrity. • Validation of MapReduce jobs. Programming Hadoop at the MapReduce level means working with the Java APIs and manually loading data files into HDFS. Testing MapReduce requires testers who are skilled in white-box testing. QA teams need to validate whether transforma- tion and aggregation are handled correctly by the MapReduce code. Testers need to begin thinking as developers. MapReduce core business logic can be validated with JUnit. However, Mapper and Reducer code testing is accomplished by introducing MRUnit tests, which are based on Junit, an extension for testing MapReduce code. Essential Hadoop Components Working with Java APIs directly is a tedious and often error-prone process. To make programming easier, Hadoop offers components such as Hive, HBase and Pig. Each of these are developed using different technologies — providing a concrete testing framework that can be highly beneficial in speeding time to market. • HBase testing. HBase, a non-relational (NoSQL) database that runs on top of HDFS, provides fault-tolerant storage and quick access to large quantities of sparse data. Its purpose is to host very large tables with billions of rows and millions of columns. HBase testing involves validation of RowKey (PK) usage for all the access attempts to HBase tables; verifi- cation of version settings in the query output; verification that the latency of individual reads and writes is per the threshold; and checks on the Zookeeper to ensure the coordination activities of the region servers. • Hive testing. Hive enables Hadoop to operate as a data warehouse. It superimposes structure on data in HDFS, then permits queries over the data using a familiar SQL-like syntax. Hive’s core capabilities are extensible by UDFs (user defined functions). Since Hive is recom- mended for analysis of terabytes of data, the volume and velocity of big data are extensively covered in Hive testing. (Hive testing can be classified into Hive functional testing and Hive UDF testing). From a functional standpoint, Hive testing incorporates validation of successful setup of the Hive meta-store database; data integrity between HDFS vs. Hive and Hive vs. MySQL (meta-store); correctness of the query and data transformation logic; checks related to number of MapReduce jobs triggered for each business logic; export/import of data from/to Hive; data integrity, and redundancy checks when MapReduce jobs fail. Hive core capabilities are made extensible by writing custom UDFs. The UDFs must be tested for correctness and completeness, which can be achieved using tools such as JUnit. Testers should also test the UDF in Hive CLI directly, especially if there are uncertainties in func- tions that deal with the right data types, such as struct, map and array. • NoSQL Testing. NoSQL (NotOnlySQL) are next-generation databases that are non- relational, distributed, open-source and hori- zontally scalable. Some of the widely used NoSQL databases are HBase, Cassandra, Mongo, CouchDB, Riak and Marklogic. A data warehouse tester who has hands-on experience with any one of the traditional databases should be able to work with other traditional RDBMS, with minimal effort. However, a NoSQL database tester will need to acquire supple- mentary skills for each of the NoSQL databases in order to perform quality testing. Testing NoSQL databases can validate big data’s “variety” aspect. Because each NoSQL database has its own language, testers need to be well versed in the programming languages that are used to develop the NoSQL database. Also, NoSQL databases are independent of the specific application or schema, and can operate on a variety of platforms and operat- ing systems. QA areas for these databases include data type checks; count checks of the documents in a collection; CRUD operations checks; master-slave integrity checks; timestamp and its format checks; checks related to cluster failure handling; and data integrity and redun- dancy checks on task failure. A manual testing approach around Hadoop components can be constructed into a robust QA framework. Moreover, there is a cascading scope for customizing the framework to a spe-
  • 5. cific domain. In turn, this customized framework can facilitate key business decisions and stra- tegic planning. Supporting Components Flume and Sqoop  improve interoperability with the rest of the data world. Sqoop is a tool designed to import data from relational data- bases into Hadoop and vice versa. Flume directly imports streaming data into HDFS. • Flume Sqoop testing. In a data warehouse, the database supports the import/export functions of data. Big data is equipped with data ingestion tools such as Flume and Sqoop, which can be used to move data into and out of Hadoop. Instead of writing a stand-alone application to move data to HDFS, it’s worth considering existing tools for ingesting data, since they offer most of the common functions. General QA checkpoints include successfully generating streaming data from Web sources using Flume, checks over data propagation from conventional data storages into Hive and HBase, and vice versa. Looking Forward: The Only Constant is Change Technology, more often than not, follows nature’s patterns. Business insights are harvested from abundant, untapped big data. Considering the increasing demand for big-data QA, it is essential that IT organizations deploy customizable testing platforms and frameworks. To keep pace, experts must build a robust QA framework for ensuring accuracy and consistency — enabling businesses and their IT departments to: • Optimize resources to perform testing with respect to hardware, people and time. • Employ domain-specific, built-in libraries to enable non-technical business users to perform testing. • Address early automation in every phase of the big data software testing life cycle. Footnotes 1 “How to Get the Most from Big Data.” McKinsey & Company, December, 2014. About the Author Sushmitha Geddam is a Project Manager in the R&D team of Cognizant’s Data Testing Center of Excellence and leads the company’s big-data QA initiatives. During her 11-year diversified career, she has delivered multifaceted enterprise business software projects in a wide array of domains in special- ized testing areas (big data, DW/ETL, data migration). She is responsible for providing delivery oversight, thought leadership and best practices; developing internal talent in technology CoEs, and defining the quality and testing processes and strategizing tools for big-data implementations. She is PMP-certified, a Cloudera-certified developer for Apache Hadoop and a Datameer-certified analyst. She can be reached at Sushmitha.Geddam@cognizant.com. About Cognizant Cognizant (NASDAQ: CTSH) is a leading provider of information technology, consulting, and business process out- sourcing services, dedicated to helping the world’s leading companies build stronger businesses. Headquartered in Teaneck, New Jersey (U.S.), Cognizant combines a passion for client satisfaction, technology innovation, deep industry and business process expertise, and a global, collaborative workforce that embodies the future of work. With over 75 development and delivery centers worldwide and approximately 199,700 employees as of September 30, 2014, Cognizant is a member of the NASDAQ-100, the S&P 500, the Forbes Global 2000, and the Fortune 500 and is ranked among the top performing and fastest growing companies in the world. Visit us online at www.cognizant.com or follow us on Twitter: Cognizant. World Headquarters 500 Frank W. Burr Blvd. Teaneck, NJ 07666 USA Phone: +1 201 801 0233 Fax: +1 201 801 0243 Toll Free: +1 888 937 3277 Email: inquiry@cognizant.com European Headquarters 1 Kingdom Street Paddington Central London W2 6BD Phone: +44 (0) 20 7297 7600 Fax: +44 (0) 20 7121 0102 Email: infouk@cognizant.com India Operations Headquarters #5/535, Old Mahabalipuram Road Okkiyam Pettai, Thoraipakkam Chennai, 600 096 India Phone: +91 (0) 44 4209 6000 Fax: +91 (0) 44 4209 6060 Email: inquiryindia@cognizant.com ­­© Copyright 2015, Cognizant. All rights reserved. No part of this document may be reproduced, stored in a retrieval system, transmitted in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the express written permission from Cognizant. The information contained herein is subject to change without notice. All other trademarks mentioned herein are the property of their respective owners. TL Codex 1009