SlideShare a Scribd company logo
Data Source
(RDBMS,
web logs,
social
media, etc.)
Data Lake
(HDFS
Cluster)
Refined Data
(HDFS
Cluster)
Enterprise
Data
WareHouse
(Data
Factory for
Query and
Analysis)
BI
Data Stage
Validation
1
“MapReduce” Process Validation(Clustering,
Data aggregation or segregation rules,
KeyValue pairs Validation)
2
Algorithm / Output
Validation
3
Report / Business Requirements
Validation4
Python
Data
Integration
and
Refinement
Data
Synthesis
Structured
Data
ETL
Data
Preparation
2
ETL Process
Validation
Data Staging Validation
First stage involves process validation :
1) Data from various source like RDBMS, weblogs, social media, etc. should be validated to make
sure that correct data is pulled into system
2) Comparing source data with the data pushed into the Hadoop system to make sure they match
3) Verify the right data is extracted and loaded into the correct HDFS location
4) Tools like Talend, Datameer can be used for data staging validation
3
"MapReduce" Validation
The second step is a validation of "MapReduce". In this stage, the tester verifies the business logic
validation on every node and then validating them after running against multiple nodes, ensuring that
the :
1) Map Reduce process works correctly
2) Data aggregation or segregation rules are implemented on the data
3) Key value pairs are generated
4) Validating the data after Map Reduce process
Big data tools used for MapReduce are Hadoop, Spark, Hive, Pig, Cascading, Oozie, Kafka, S4, MapR, and
Flume
4
1) MRUnit - Unit testing for MR jobs :
MRUnit lets users define key-value pairs to be given to map and reduce functions, and it tests that the correct
key-value pairs are emitted from each of these functions
2) Local job runner testing - Running MR jobs on a single machine in a single JVM :
The local job runner lets you run Hadoop on a local machine, in one JVM, making MR jobs a little easier to debug
in the case of a job failing
3) Pseudo-distributed testing - Running MR jobs on a single machine using Hadoop :
A pseudo-distributed cluster is composed of a single machine running all Hadoop daemons. This cluster is still
relatively easy to manage (though harder than local job runner) and tests integration with Hadoop better than
the local job runner does
4) Full integration testing - Running MR jobs on a QA cluster
Used to test your MR jobs by running them on a QA cluster composed of at least a few machines. By running
your MR jobs on a QA cluster, you will be testing all aspects of both your job and its integration with Hadoop.
Testing Methods for Hadoop MapReduce Processes
5
Output Validation Phase
The third stage of Big Data testing is the output validation process. The output data files are generated
and ready to be moved to an EDW (Enterprise Data Warehouse) or any other system based on the
requirement.
Activities in third stage includes :
1) To check the transformation rules are correctly applied
2) To check the data integrity and successful data load into the target system
3) To check that there is no data corruption by comparing the target data with the HDFS file system data
6
Report Validation
In this final phase of testing, reports are checked against target data warehouse. Report data should match
with target data warehouse.
7
8
Oracle
NoSql
LogFiles
Social
Media, etc.
Apache
Hadoop
HDFS,
MapR-FS,
Cloudera
Apache
Hadoop
HDFS,
MapR-FS,
Hbase,
VoltDB
Enterprise
Data
WareHouse
BI
Python
Data
Integration
and
Refinement
Data
Synthesis
Data
Preparation
MapReduce,
Spark, Pig Latin
etc.
Flume, Scoop,
LogStash, etc. Pig, Hive, R, etc.
Tableau,
Datameer, d3.js,
etc.
1) Apache Flume Enterprises -
Used to ingest log files from application servers or other systems
2) Apache Sqoop -
Used to import data from a MySQL or Oracle database into HDFS
3) Hive -
Hive is a tool that structures data in Hadoop into the form of relational-like tables and allows queries
using a subset of SQL. An Infrastructure which provides us with various tools for easy extraction,
transformation and loading of data. Hive allows its users to embed customized mappers and reducers.
4) Pig Apache -
Pig provides an alternative language to SQL,called Pig Latin, for querying data stored in HDFS
5) NoSQL -
This enables them to store and retrieve data with all the features of the NoSQL database. Some NoSQL
databases available are CouchDB, MongoDB, Cassandra, Redis, ZooKeeper and Hbase.
Tools for validating pre-Hadoop processing
9
10
6) MapR-FS is a POSIX file system that provides distributed, reliable, high performance, scalable, and
full read/write data storage for the MapR Converged Data Platform. MapR-FS supports the HDFS API,
fast NFS access, access controls (MapR ACEs), and transparent data compression.
7) Apache Spark is an open source big data processing framework built around speed, ease of use, and
sophisticated analytics. Spark gives us a comprehensive, unified framework to manage big data
processing requirements with a variety of data sets that are diverse in nature (text data, graph data etc)
as well as the source of data (batch v. real-time streaming data.
8) HBase is a column-oriented database management system that runs on top of HDFS. It is well suited
for sparse data sets, which are common in many big data use cases.
9) Lucene/Solr -
The most popular open source tool for indexing large blocks of unstructured text is Lucene
Performance Testing
Performance Testing for Big Data includes :
Data ingestion and Throughout: In this stage, the tester verifies how the fast system can
consume data from various data source. Testing involves identifying different message that the
queue can process in a given time frame. It also includes how quickly data can be inserted into
underlying data store for example insertion rate into a Mongo and Cassandra database.
Data Processing: It involves verifying the speed with which the queries or map reduce jobs are
executed. It also includes testing the data processing in isolation when the underlying data store
is populated within the data sets. For example running Map Reduce jobs on the underlying
HDFS
Sub-Component Performance: These systems are made up of multiple components, and it is
essential to test each of these components in isolation. For example, how quickly message is
indexed and consumed, mapreduce jobs, query performance, search, etc.
11
Parameters for Performance Testing
Various parameters to be verified for performance testing are :
Data Storage: How data is stored in different nodes
Commit logs: How large the commit log is allowed to grow
Concurrency: How many threads can perform write and read operation
Caching: Tune the cache setting "row cache" and "key cache."
Timeouts: Values for connection timeout, query timeout, etc.
JVM Parameters: Heap size, GC collection algorithms, etc.
Map reduce performance: Sorts, merge, etc.
Message queue: Message rate, size, etc.
12
 Installation Testing - Installation testing is a kind of quality assurance work that focuses on
what customers will need to do to install and set up the new big data application successfully.
The testing process may involve full, partial or upgrades install/uninstall processes.
 End-to-End Test Environment Operational Testing - To provide complete end to end testing for
application . To verify the process right from the first phase ,i.e., when data gets fetched into
data lake till the last phase ,i.e., output validation of machine learning algorithms
 Backup and Restore Testing – To make ensure that back up nodes are working correctly or not.
Cluster is properly configured or not for backup nodes. All the nodes in cluster are interacting
with each other properly as expected
 Fail over Testing - To make sure that in the case of failure, back up nodes come into action and
the system will resume its work as before with the help of backup node without any much
degradation in performance
 Recovery Testing – To make sure that information can be recovered easily from backup node in
case of failure of data node
13

More Related Content

What's hot

A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
DataWorks Summit
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
DataWorks Summit
 
Testing the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big ProblemsTesting the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big Problems
TechWell
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
datastack
 
Big Data Introduction
Big Data IntroductionBig Data Introduction
Big Data Introduction
Durga Gadiraju
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Sudhir Mallem
 
What's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and BeyondWhat's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and Beyond
DataWorks Summit/Hadoop Summit
 
Big Data Computing Architecture
Big Data Computing ArchitectureBig Data Computing Architecture
Big Data Computing Architecture
Gang Tao
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Cloudera, Inc.
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
Asis Mohanty
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Denodo
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
hadooparchbook
 
About Streaming Data Solutions for Hadoop
About Streaming Data Solutions for HadoopAbout Streaming Data Solutions for Hadoop
About Streaming Data Solutions for Hadoop
Lynn Langit
 
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseData Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
DataWorks Summit
 
Introducing Amazon Aurora
Introducing Amazon AuroraIntroducing Amazon Aurora
Introducing Amazon Aurora
Sailesh Krishnamurthy
 
Vayacondios: Divine into Complex Systems
Vayacondios: Divine into Complex SystemsVayacondios: Divine into Complex Systems
Vayacondios: Divine into Complex Systems
Infochimps, a CSC Big Data Business
 
Big Data Ready Enterprise
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise
DataWorks Summit/Hadoop Summit
 
Making BD Work~TIAS_20150622
Making BD Work~TIAS_20150622Making BD Work~TIAS_20150622
Making BD Work~TIAS_20150622
Anthony Potappel
 
Data Privacy at Scale
Data Privacy at ScaleData Privacy at Scale
Data Privacy at Scale
DataWorks Summit
 
Tools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloudTools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloud
DataWorks Summit
 

What's hot (20)

A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
 
Testing the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big ProblemsTesting the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big Problems
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Big Data Introduction
Big Data IntroductionBig Data Introduction
Big Data Introduction
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
 
What's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and BeyondWhat's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and Beyond
 
Big Data Computing Architecture
Big Data Computing ArchitectureBig Data Computing Architecture
Big Data Computing Architecture
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data Lakes
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
About Streaming Data Solutions for Hadoop
About Streaming Data Solutions for HadoopAbout Streaming Data Solutions for Hadoop
About Streaming Data Solutions for Hadoop
 
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseData Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
 
Introducing Amazon Aurora
Introducing Amazon AuroraIntroducing Amazon Aurora
Introducing Amazon Aurora
 
Vayacondios: Divine into Complex Systems
Vayacondios: Divine into Complex SystemsVayacondios: Divine into Complex Systems
Vayacondios: Divine into Complex Systems
 
Big Data Ready Enterprise
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise
 
Making BD Work~TIAS_20150622
Making BD Work~TIAS_20150622Making BD Work~TIAS_20150622
Making BD Work~TIAS_20150622
 
Data Privacy at Scale
Data Privacy at ScaleData Privacy at Scale
Data Privacy at Scale
 
Tools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloudTools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloud
 

Viewers also liked

How to Test Big Data Systems | QualiTest Group
How to Test Big Data Systems | QualiTest GroupHow to Test Big Data Systems | QualiTest Group
How to Test Big Data Systems | QualiTest Group
Qualitest
 
Performance Testing of Big Data Applications - Impetus Webcast
Performance Testing of Big Data Applications - Impetus WebcastPerformance Testing of Big Data Applications - Impetus Webcast
Performance Testing of Big Data Applications - Impetus Webcast
Impetus Technologies
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
RTTS
 
The Age of Exabytes: Tools & Approaches for Managing Big Data
The Age of Exabytes: Tools & Approaches for Managing Big DataThe Age of Exabytes: Tools & Approaches for Managing Big Data
The Age of Exabytes: Tools & Approaches for Managing Big Data
ReadWrite
 
Practical Pig and PigUnit (Michael Noll, Verisign)
Practical Pig and PigUnit (Michael Noll, Verisign)Practical Pig and PigUnit (Michael Noll, Verisign)
Practical Pig and PigUnit (Michael Noll, Verisign)
Swiss Big Data User Group
 
Big Data - Outcomes Performance Measured
Big Data - Outcomes Performance MeasuredBig Data - Outcomes Performance Measured
Big Data - Outcomes Performance Measured
Greenway Health
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Management
rightsize
 
5 Tips For Making Linked In Work For You
5 Tips For Making Linked In Work For You5 Tips For Making Linked In Work For You
5 Tips For Making Linked In Work For You
markkley
 
Ravikanth_CV_10 yrs_ETL-BI-BigData-Testing
Ravikanth_CV_10 yrs_ETL-BI-BigData-TestingRavikanth_CV_10 yrs_ETL-BI-BigData-Testing
Ravikanth_CV_10 yrs_ETL-BI-BigData-Testing
Ravikanth Marpuri
 
Big data testing (1)
Big data testing (1)Big data testing (1)
Big data testing (1)
vodqancr
 
Data Driven Design - Web Analytics & Testing for Designers (Web Directions So...
Data Driven Design - Web Analytics & Testing for Designers (Web Directions So...Data Driven Design - Web Analytics & Testing for Designers (Web Directions So...
Data Driven Design - Web Analytics & Testing for Designers (Web Directions So...
Luke Stevens
 
How to perform Analytics testing on your website and tools
How to perform Analytics testing on your website and toolsHow to perform Analytics testing on your website and tools
How to perform Analytics testing on your website and tools
Mayank Solanki
 
Islamic banking solution by oracle
Islamic banking solution by oracleIslamic banking solution by oracle
Islamic banking solution by oracle
Alhuda Centre of Islamic Banking & Economics
 
Tackling non-determinism in Hadoop - Testing and debugging distributed system...
Tackling non-determinism in Hadoop - Testing and debugging distributed system...Tackling non-determinism in Hadoop - Testing and debugging distributed system...
Tackling non-determinism in Hadoop - Testing and debugging distributed system...
Akihiro Suda
 
Big Data: Working with Big SQL data from Spark
Big Data:  Working with Big SQL data from Spark Big Data:  Working with Big SQL data from Spark
Big Data: Working with Big SQL data from Spark
Cynthia Saracco
 
UAT - Cards Migration (Whitepaper)
UAT - Cards Migration (Whitepaper)UAT - Cards Migration (Whitepaper)
UAT - Cards Migration (Whitepaper)
Thinksoft Global
 
Veracity think bugdata #2 6.7.2015
Veracity think bugdata #2   6.7.2015Veracity think bugdata #2   6.7.2015
Veracity think bugdata #2 6.7.2015
Veracity - Think Big Data
 
Advance Hive, NoSQL Database (HBase) - Module 7
Advance Hive, NoSQL Database (HBase) - Module 7Advance Hive, NoSQL Database (HBase) - Module 7
Advance Hive, NoSQL Database (HBase) - Module 7
Rohit Agrawal
 
55993161 te040-r12-cash-management-test-scripts
55993161 te040-r12-cash-management-test-scripts55993161 te040-r12-cash-management-test-scripts
55993161 te040-r12-cash-management-test-scripts
mdkhadarali
 
Pig and Pig Latin - Module 5
Pig and Pig Latin - Module 5Pig and Pig Latin - Module 5
Pig and Pig Latin - Module 5
Rohit Agrawal
 

Viewers also liked (20)

How to Test Big Data Systems | QualiTest Group
How to Test Big Data Systems | QualiTest GroupHow to Test Big Data Systems | QualiTest Group
How to Test Big Data Systems | QualiTest Group
 
Performance Testing of Big Data Applications - Impetus Webcast
Performance Testing of Big Data Applications - Impetus WebcastPerformance Testing of Big Data Applications - Impetus Webcast
Performance Testing of Big Data Applications - Impetus Webcast
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
 
The Age of Exabytes: Tools & Approaches for Managing Big Data
The Age of Exabytes: Tools & Approaches for Managing Big DataThe Age of Exabytes: Tools & Approaches for Managing Big Data
The Age of Exabytes: Tools & Approaches for Managing Big Data
 
Practical Pig and PigUnit (Michael Noll, Verisign)
Practical Pig and PigUnit (Michael Noll, Verisign)Practical Pig and PigUnit (Michael Noll, Verisign)
Practical Pig and PigUnit (Michael Noll, Verisign)
 
Big Data - Outcomes Performance Measured
Big Data - Outcomes Performance MeasuredBig Data - Outcomes Performance Measured
Big Data - Outcomes Performance Measured
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Management
 
5 Tips For Making Linked In Work For You
5 Tips For Making Linked In Work For You5 Tips For Making Linked In Work For You
5 Tips For Making Linked In Work For You
 
Ravikanth_CV_10 yrs_ETL-BI-BigData-Testing
Ravikanth_CV_10 yrs_ETL-BI-BigData-TestingRavikanth_CV_10 yrs_ETL-BI-BigData-Testing
Ravikanth_CV_10 yrs_ETL-BI-BigData-Testing
 
Big data testing (1)
Big data testing (1)Big data testing (1)
Big data testing (1)
 
Data Driven Design - Web Analytics & Testing for Designers (Web Directions So...
Data Driven Design - Web Analytics & Testing for Designers (Web Directions So...Data Driven Design - Web Analytics & Testing for Designers (Web Directions So...
Data Driven Design - Web Analytics & Testing for Designers (Web Directions So...
 
How to perform Analytics testing on your website and tools
How to perform Analytics testing on your website and toolsHow to perform Analytics testing on your website and tools
How to perform Analytics testing on your website and tools
 
Islamic banking solution by oracle
Islamic banking solution by oracleIslamic banking solution by oracle
Islamic banking solution by oracle
 
Tackling non-determinism in Hadoop - Testing and debugging distributed system...
Tackling non-determinism in Hadoop - Testing and debugging distributed system...Tackling non-determinism in Hadoop - Testing and debugging distributed system...
Tackling non-determinism in Hadoop - Testing and debugging distributed system...
 
Big Data: Working with Big SQL data from Spark
Big Data:  Working with Big SQL data from Spark Big Data:  Working with Big SQL data from Spark
Big Data: Working with Big SQL data from Spark
 
UAT - Cards Migration (Whitepaper)
UAT - Cards Migration (Whitepaper)UAT - Cards Migration (Whitepaper)
UAT - Cards Migration (Whitepaper)
 
Veracity think bugdata #2 6.7.2015
Veracity think bugdata #2   6.7.2015Veracity think bugdata #2   6.7.2015
Veracity think bugdata #2 6.7.2015
 
Advance Hive, NoSQL Database (HBase) - Module 7
Advance Hive, NoSQL Database (HBase) - Module 7Advance Hive, NoSQL Database (HBase) - Module 7
Advance Hive, NoSQL Database (HBase) - Module 7
 
55993161 te040-r12-cash-management-test-scripts
55993161 te040-r12-cash-management-test-scripts55993161 te040-r12-cash-management-test-scripts
55993161 te040-r12-cash-management-test-scripts
 
Pig and Pig Latin - Module 5
Pig and Pig Latin - Module 5Pig and Pig Latin - Module 5
Pig and Pig Latin - Module 5
 

Similar to Big Data Testing Approach - Rohit Kharabe

From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...
Cognizant
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdf
avenkatram
 
Data Engineering on GCP
Data Engineering on GCPData Engineering on GCP
Data Engineering on GCP
BlibBlobb
 
Strengthening the Quality of Big Data Implementations
Strengthening the Quality of Big Data ImplementationsStrengthening the Quality of Big Data Implementations
Strengthening the Quality of Big Data Implementations
Cognizant
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
Hadoop User Group
 
F1803013034
F1803013034F1803013034
F1803013034
IOSR Journals
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
sudhakara st
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performance
ijcsa
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptx
ShreyasKv13
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptx
VishalBH1
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Cognizant
 
Finding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopFinding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache Hadoop
Nushrat
 
Testing insights from data lakes
Testing insights from data lakesTesting insights from data lakes
Testing insights from data lakes
shivindkaur
 
Document 22.pdf
Document 22.pdfDocument 22.pdf
Document 22.pdf
rahulsahu887608
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
Nalini Mehta
 
Unit 1
Unit 1Unit 1
Hadoop paper
Hadoop paperHadoop paper
Hadoop paper
ATWIINE Simon Alex
 
Building an analytical platform
Building an analytical platformBuilding an analytical platform
Building an analytical platform
David Walker
 
Sybase IQ ile Analitik Platform
Sybase IQ ile Analitik PlatformSybase IQ ile Analitik Platform
Sybase IQ ile Analitik Platform
Sybase Türkiye
 

Similar to Big Data Testing Approach - Rohit Kharabe (20)

From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdf
 
Data Engineering on GCP
Data Engineering on GCPData Engineering on GCP
Data Engineering on GCP
 
Strengthening the Quality of Big Data Implementations
Strengthening the Quality of Big Data ImplementationsStrengthening the Quality of Big Data Implementations
Strengthening the Quality of Big Data Implementations
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
F1803013034
F1803013034F1803013034
F1803013034
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performance
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptx
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptx
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
Finding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopFinding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache Hadoop
 
Testing insights from data lakes
Testing insights from data lakesTesting insights from data lakes
Testing insights from data lakes
 
Document 22.pdf
Document 22.pdfDocument 22.pdf
Document 22.pdf
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Unit 1
Unit 1Unit 1
Unit 1
 
Hadoop paper
Hadoop paperHadoop paper
Hadoop paper
 
Building an analytical platform
Building an analytical platformBuilding an analytical platform
Building an analytical platform
 
Sybase IQ ile Analitik Platform
Sybase IQ ile Analitik PlatformSybase IQ ile Analitik Platform
Sybase IQ ile Analitik Platform
 

Recently uploaded

Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
safelyiotech
 
Flutter vs. React Native: A Detailed Comparison for App Development in 2024
Flutter vs. React Native: A Detailed Comparison for App Development in 2024Flutter vs. React Native: A Detailed Comparison for App Development in 2024
Flutter vs. React Native: A Detailed Comparison for App Development in 2024
dhavalvaghelanectarb
 
What is Continuous Testing in DevOps - A Definitive Guide.pdf
What is Continuous Testing in DevOps - A Definitive Guide.pdfWhat is Continuous Testing in DevOps - A Definitive Guide.pdf
What is Continuous Testing in DevOps - A Definitive Guide.pdf
kalichargn70th171
 
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
The Third Creative Media
 
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
kalichargn70th171
 
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
kalichargn70th171
 
Beginner's Guide to Observability@Devoxx PL 2024
Beginner's  Guide to Observability@Devoxx PL 2024Beginner's  Guide to Observability@Devoxx PL 2024
Beginner's Guide to Observability@Devoxx PL 2024
michniczscribd
 
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSISDECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
Tier1 app
 
42 Ways to Generate Real Estate Leads - Sellxpert
42 Ways to Generate Real Estate Leads - Sellxpert42 Ways to Generate Real Estate Leads - Sellxpert
42 Ways to Generate Real Estate Leads - Sellxpert
vaishalijagtap12
 
Optimizing Your E-commerce with WooCommerce.pptx
Optimizing Your E-commerce with WooCommerce.pptxOptimizing Your E-commerce with WooCommerce.pptx
Optimizing Your E-commerce with WooCommerce.pptx
WebConnect Pvt Ltd
 
The Role of DevOps in Digital Transformation.pdf
The Role of DevOps in Digital Transformation.pdfThe Role of DevOps in Digital Transformation.pdf
The Role of DevOps in Digital Transformation.pdf
mohitd6
 
Alluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio Webinar | 10x Faster Trino Queries on Your Data PlatformAlluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio, Inc.
 
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptxMigration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
ervikas4
 
Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...
Paul Brebner
 
Photoshop Tutorial for Beginners (2024 Edition)
Photoshop Tutorial for Beginners (2024 Edition)Photoshop Tutorial for Beginners (2024 Edition)
Photoshop Tutorial for Beginners (2024 Edition)
alowpalsadig
 
How GenAI Can Improve Supplier Performance Management.pdf
How GenAI Can Improve Supplier Performance Management.pdfHow GenAI Can Improve Supplier Performance Management.pdf
How GenAI Can Improve Supplier Performance Management.pdf
Zycus
 
Orca: Nocode Graphical Editor for Container Orchestration
Orca: Nocode Graphical Editor for Container OrchestrationOrca: Nocode Graphical Editor for Container Orchestration
Orca: Nocode Graphical Editor for Container Orchestration
Pedro J. Molina
 
Hands-on with Apache Druid: Installation & Data Ingestion Steps
Hands-on with Apache Druid: Installation & Data Ingestion StepsHands-on with Apache Druid: Installation & Data Ingestion Steps
Hands-on with Apache Druid: Installation & Data Ingestion Steps
servicesNitor
 
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
gapen1
 

Recently uploaded (20)

Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
 
Flutter vs. React Native: A Detailed Comparison for App Development in 2024
Flutter vs. React Native: A Detailed Comparison for App Development in 2024Flutter vs. React Native: A Detailed Comparison for App Development in 2024
Flutter vs. React Native: A Detailed Comparison for App Development in 2024
 
bgiolcb
bgiolcbbgiolcb
bgiolcb
 
What is Continuous Testing in DevOps - A Definitive Guide.pdf
What is Continuous Testing in DevOps - A Definitive Guide.pdfWhat is Continuous Testing in DevOps - A Definitive Guide.pdf
What is Continuous Testing in DevOps - A Definitive Guide.pdf
 
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
 
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
 
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
 
Beginner's Guide to Observability@Devoxx PL 2024
Beginner's  Guide to Observability@Devoxx PL 2024Beginner's  Guide to Observability@Devoxx PL 2024
Beginner's Guide to Observability@Devoxx PL 2024
 
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSISDECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
 
42 Ways to Generate Real Estate Leads - Sellxpert
42 Ways to Generate Real Estate Leads - Sellxpert42 Ways to Generate Real Estate Leads - Sellxpert
42 Ways to Generate Real Estate Leads - Sellxpert
 
Optimizing Your E-commerce with WooCommerce.pptx
Optimizing Your E-commerce with WooCommerce.pptxOptimizing Your E-commerce with WooCommerce.pptx
Optimizing Your E-commerce with WooCommerce.pptx
 
The Role of DevOps in Digital Transformation.pdf
The Role of DevOps in Digital Transformation.pdfThe Role of DevOps in Digital Transformation.pdf
The Role of DevOps in Digital Transformation.pdf
 
Alluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio Webinar | 10x Faster Trino Queries on Your Data PlatformAlluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio Webinar | 10x Faster Trino Queries on Your Data Platform
 
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptxMigration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
 
Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...
 
Photoshop Tutorial for Beginners (2024 Edition)
Photoshop Tutorial for Beginners (2024 Edition)Photoshop Tutorial for Beginners (2024 Edition)
Photoshop Tutorial for Beginners (2024 Edition)
 
How GenAI Can Improve Supplier Performance Management.pdf
How GenAI Can Improve Supplier Performance Management.pdfHow GenAI Can Improve Supplier Performance Management.pdf
How GenAI Can Improve Supplier Performance Management.pdf
 
Orca: Nocode Graphical Editor for Container Orchestration
Orca: Nocode Graphical Editor for Container OrchestrationOrca: Nocode Graphical Editor for Container Orchestration
Orca: Nocode Graphical Editor for Container Orchestration
 
Hands-on with Apache Druid: Installation & Data Ingestion Steps
Hands-on with Apache Druid: Installation & Data Ingestion StepsHands-on with Apache Druid: Installation & Data Ingestion Steps
Hands-on with Apache Druid: Installation & Data Ingestion Steps
 
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
 

Big Data Testing Approach - Rohit Kharabe

  • 1.
  • 2. Data Source (RDBMS, web logs, social media, etc.) Data Lake (HDFS Cluster) Refined Data (HDFS Cluster) Enterprise Data WareHouse (Data Factory for Query and Analysis) BI Data Stage Validation 1 “MapReduce” Process Validation(Clustering, Data aggregation or segregation rules, KeyValue pairs Validation) 2 Algorithm / Output Validation 3 Report / Business Requirements Validation4 Python Data Integration and Refinement Data Synthesis Structured Data ETL Data Preparation 2 ETL Process Validation
  • 3. Data Staging Validation First stage involves process validation : 1) Data from various source like RDBMS, weblogs, social media, etc. should be validated to make sure that correct data is pulled into system 2) Comparing source data with the data pushed into the Hadoop system to make sure they match 3) Verify the right data is extracted and loaded into the correct HDFS location 4) Tools like Talend, Datameer can be used for data staging validation 3
  • 4. "MapReduce" Validation The second step is a validation of "MapReduce". In this stage, the tester verifies the business logic validation on every node and then validating them after running against multiple nodes, ensuring that the : 1) Map Reduce process works correctly 2) Data aggregation or segregation rules are implemented on the data 3) Key value pairs are generated 4) Validating the data after Map Reduce process Big data tools used for MapReduce are Hadoop, Spark, Hive, Pig, Cascading, Oozie, Kafka, S4, MapR, and Flume 4
  • 5. 1) MRUnit - Unit testing for MR jobs : MRUnit lets users define key-value pairs to be given to map and reduce functions, and it tests that the correct key-value pairs are emitted from each of these functions 2) Local job runner testing - Running MR jobs on a single machine in a single JVM : The local job runner lets you run Hadoop on a local machine, in one JVM, making MR jobs a little easier to debug in the case of a job failing 3) Pseudo-distributed testing - Running MR jobs on a single machine using Hadoop : A pseudo-distributed cluster is composed of a single machine running all Hadoop daemons. This cluster is still relatively easy to manage (though harder than local job runner) and tests integration with Hadoop better than the local job runner does 4) Full integration testing - Running MR jobs on a QA cluster Used to test your MR jobs by running them on a QA cluster composed of at least a few machines. By running your MR jobs on a QA cluster, you will be testing all aspects of both your job and its integration with Hadoop. Testing Methods for Hadoop MapReduce Processes 5
  • 6. Output Validation Phase The third stage of Big Data testing is the output validation process. The output data files are generated and ready to be moved to an EDW (Enterprise Data Warehouse) or any other system based on the requirement. Activities in third stage includes : 1) To check the transformation rules are correctly applied 2) To check the data integrity and successful data load into the target system 3) To check that there is no data corruption by comparing the target data with the HDFS file system data 6
  • 7. Report Validation In this final phase of testing, reports are checked against target data warehouse. Report data should match with target data warehouse. 7
  • 9. 1) Apache Flume Enterprises - Used to ingest log files from application servers or other systems 2) Apache Sqoop - Used to import data from a MySQL or Oracle database into HDFS 3) Hive - Hive is a tool that structures data in Hadoop into the form of relational-like tables and allows queries using a subset of SQL. An Infrastructure which provides us with various tools for easy extraction, transformation and loading of data. Hive allows its users to embed customized mappers and reducers. 4) Pig Apache - Pig provides an alternative language to SQL,called Pig Latin, for querying data stored in HDFS 5) NoSQL - This enables them to store and retrieve data with all the features of the NoSQL database. Some NoSQL databases available are CouchDB, MongoDB, Cassandra, Redis, ZooKeeper and Hbase. Tools for validating pre-Hadoop processing 9
  • 10. 10 6) MapR-FS is a POSIX file system that provides distributed, reliable, high performance, scalable, and full read/write data storage for the MapR Converged Data Platform. MapR-FS supports the HDFS API, fast NFS access, access controls (MapR ACEs), and transparent data compression. 7) Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. Spark gives us a comprehensive, unified framework to manage big data processing requirements with a variety of data sets that are diverse in nature (text data, graph data etc) as well as the source of data (batch v. real-time streaming data. 8) HBase is a column-oriented database management system that runs on top of HDFS. It is well suited for sparse data sets, which are common in many big data use cases. 9) Lucene/Solr - The most popular open source tool for indexing large blocks of unstructured text is Lucene
  • 11. Performance Testing Performance Testing for Big Data includes : Data ingestion and Throughout: In this stage, the tester verifies how the fast system can consume data from various data source. Testing involves identifying different message that the queue can process in a given time frame. It also includes how quickly data can be inserted into underlying data store for example insertion rate into a Mongo and Cassandra database. Data Processing: It involves verifying the speed with which the queries or map reduce jobs are executed. It also includes testing the data processing in isolation when the underlying data store is populated within the data sets. For example running Map Reduce jobs on the underlying HDFS Sub-Component Performance: These systems are made up of multiple components, and it is essential to test each of these components in isolation. For example, how quickly message is indexed and consumed, mapreduce jobs, query performance, search, etc. 11
  • 12. Parameters for Performance Testing Various parameters to be verified for performance testing are : Data Storage: How data is stored in different nodes Commit logs: How large the commit log is allowed to grow Concurrency: How many threads can perform write and read operation Caching: Tune the cache setting "row cache" and "key cache." Timeouts: Values for connection timeout, query timeout, etc. JVM Parameters: Heap size, GC collection algorithms, etc. Map reduce performance: Sorts, merge, etc. Message queue: Message rate, size, etc. 12
  • 13.  Installation Testing - Installation testing is a kind of quality assurance work that focuses on what customers will need to do to install and set up the new big data application successfully. The testing process may involve full, partial or upgrades install/uninstall processes.  End-to-End Test Environment Operational Testing - To provide complete end to end testing for application . To verify the process right from the first phase ,i.e., when data gets fetched into data lake till the last phase ,i.e., output validation of machine learning algorithms  Backup and Restore Testing – To make ensure that back up nodes are working correctly or not. Cluster is properly configured or not for backup nodes. All the nodes in cluster are interacting with each other properly as expected  Fail over Testing - To make sure that in the case of failure, back up nodes come into action and the system will resume its work as before with the help of backup node without any much degradation in performance  Recovery Testing – To make sure that information can be recovered easily from backup node in case of failure of data node 13

Editor's Notes

  1. A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags. When a business question arises, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question.
  2. Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS).