Big Data Testing Approach - Rohit Kharabe

Data Source
(RDBMS,
web logs,
social
media, etc.)
Data Lake
(HDFS
Cluster)
Refined Data
(HDFS
Cluster)
Enterprise
Data
WareHouse
(Data
Factory for
Query and
Analysis)
BI
Data Stage
Validation
1
“MapReduce” Process Validation(Clustering,
Data aggregation or segregation rules,
KeyValue pairs Validation)
2
Algorithm / Output
Validation
3
Report / Business Requirements
Validation4
Python
Data
Integration
and
Refinement
Data
Synthesis
Structured
Data
ETL
Data
Preparation
2
ETL Process
Validation

Data Staging Validation
First stage involves process validation :
1) Data from various source like RDBMS, weblogs, social media, etc. should be validated to make
sure that correct data is pulled into system
2) Comparing source data with the data pushed into the Hadoop system to make sure they match
3) Verify the right data is extracted and loaded into the correct HDFS location
4) Tools like Talend, Datameer can be used for data staging validation
3

"MapReduce" Validation
The second step is a validation of "MapReduce". In this stage, the tester verifies the business logic
validation on every node and then validating them after running against multiple nodes, ensuring that
the :
1) Map Reduce process works correctly
2) Data aggregation or segregation rules are implemented on the data
3) Key value pairs are generated
4) Validating the data after Map Reduce process
Big data tools used for MapReduce are Hadoop, Spark, Hive, Pig, Cascading, Oozie, Kafka, S4, MapR, and
Flume
4

1) MRUnit - Unit testing for MR jobs :
MRUnit lets users define key-value pairs to be given to map and reduce functions, and it tests that the correct
key-value pairs are emitted from each of these functions
2) Local job runner testing - Running MR jobs on a single machine in a single JVM :
The local job runner lets you run Hadoop on a local machine, in one JVM, making MR jobs a little easier to debug
in the case of a job failing
3) Pseudo-distributed testing - Running MR jobs on a single machine using Hadoop :
A pseudo-distributed cluster is composed of a single machine running all Hadoop daemons. This cluster is still
relatively easy to manage (though harder than local job runner) and tests integration with Hadoop better than
the local job runner does
4) Full integration testing - Running MR jobs on a QA cluster
Used to test your MR jobs by running them on a QA cluster composed of at least a few machines. By running
your MR jobs on a QA cluster, you will be testing all aspects of both your job and its integration with Hadoop.
Testing Methods for Hadoop MapReduce Processes
5

Output Validation Phase
The third stage of Big Data testing is the output validation process. The output data files are generated
and ready to be moved to an EDW (Enterprise Data Warehouse) or any other system based on the
requirement.
Activities in third stage includes :
1) To check the transformation rules are correctly applied
2) To check the data integrity and successful data load into the target system
3) To check that there is no data corruption by comparing the target data with the HDFS file system data
6

Report Validation
In this final phase of testing, reports are checked against target data warehouse. Report data should match
with target data warehouse.
7

8
Oracle
NoSql
LogFiles
Social
Media, etc.
Apache
Hadoop
HDFS,
MapR-FS,
Cloudera
Apache
Hadoop
HDFS,
MapR-FS,
Hbase,
VoltDB
Enterprise
Data
WareHouse
BI
Python
Data
Integration
and
Refinement
Data
Synthesis
Data
Preparation
MapReduce,
Spark, Pig Latin
etc.
Flume, Scoop,
LogStash, etc. Pig, Hive, R, etc.
Tableau,
Datameer, d3.js,
etc.

1) Apache Flume Enterprises -
Used to ingest log files from application servers or other systems
2) Apache Sqoop -
Used to import data from a MySQL or Oracle database into HDFS
3) Hive -
Hive is a tool that structures data in Hadoop into the form of relational-like tables and allows queries
using a subset of SQL. An Infrastructure which provides us with various tools for easy extraction,
transformation and loading of data. Hive allows its users to embed customized mappers and reducers.
4) Pig Apache -
Pig provides an alternative language to SQL,called Pig Latin, for querying data stored in HDFS
5) NoSQL -
This enables them to store and retrieve data with all the features of the NoSQL database. Some NoSQL
databases available are CouchDB, MongoDB, Cassandra, Redis, ZooKeeper and Hbase.
Tools for validating pre-Hadoop processing
9

10
6) MapR-FS is a POSIX file system that provides distributed, reliable, high performance, scalable, and
full read/write data storage for the MapR Converged Data Platform. MapR-FS supports the HDFS API,
fast NFS access, access controls (MapR ACEs), and transparent data compression.
7) Apache Spark is an open source big data processing framework built around speed, ease of use, and
sophisticated analytics. Spark gives us a comprehensive, unified framework to manage big data
processing requirements with a variety of data sets that are diverse in nature (text data, graph data etc)
as well as the source of data (batch v. real-time streaming data.
8) HBase is a column-oriented database management system that runs on top of HDFS. It is well suited
for sparse data sets, which are common in many big data use cases.
9) Lucene/Solr -
The most popular open source tool for indexing large blocks of unstructured text is Lucene

Performance Testing
Performance Testing for Big Data includes :
Data ingestion and Throughout: In this stage, the tester verifies how the fast system can
consume data from various data source. Testing involves identifying different message that the
queue can process in a given time frame. It also includes how quickly data can be inserted into
underlying data store for example insertion rate into a Mongo and Cassandra database.
Data Processing: It involves verifying the speed with which the queries or map reduce jobs are
executed. It also includes testing the data processing in isolation when the underlying data store
is populated within the data sets. For example running Map Reduce jobs on the underlying
HDFS
Sub-Component Performance: These systems are made up of multiple components, and it is
essential to test each of these components in isolation. For example, how quickly message is
indexed and consumed, mapreduce jobs, query performance, search, etc.
11

Parameters for Performance Testing
Various parameters to be verified for performance testing are :
Data Storage: How data is stored in different nodes
Commit logs: How large the commit log is allowed to grow
Concurrency: How many threads can perform write and read operation
Caching: Tune the cache setting "row cache" and "key cache."
Timeouts: Values for connection timeout, query timeout, etc.
JVM Parameters: Heap size, GC collection algorithms, etc.
Map reduce performance: Sorts, merge, etc.
Message queue: Message rate, size, etc.
12

 Installation Testing - Installation testing is a kind of quality assurance work that focuses on
what customers will need to do to install and set up the new big data application successfully.
The testing process may involve full, partial or upgrades install/uninstall processes.
 End-to-End Test Environment Operational Testing - To provide complete end to end testing for
application . To verify the process right from the first phase ,i.e., when data gets fetched into
data lake till the last phase ,i.e., output validation of machine learning algorithms
 Backup and Restore Testing – To make ensure that back up nodes are working correctly or not.
Cluster is properly configured or not for backup nodes. All the nodes in cluster are interacting
with each other properly as expected
 Fail over Testing - To make sure that in the case of failure, back up nodes come into action and
the system will resume its work as before with the help of backup node without any much
degradation in performance
 Recovery Testing – To make sure that information can be recovered easily from backup node in
case of failure of data node
13

Big Data Testing Approach - Rohit Kharabe

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Big Data Testing Approach - Rohit Kharabe

Similar to Big Data Testing Approach - Rohit Kharabe (20)

Recently uploaded

Recently uploaded (20)

Big Data Testing Approach - Rohit Kharabe

Editor's Notes