SlideShare a Scribd company logo
Webinar
Testing Big Data:
Automated ETL Testing of Hadoop

Laura Poggi
Marketing Manager
RTTS

Bill Hayduk
CEO/President
RTTS

Jeff Bocarsly, Ph.D.
Chief Architect
RTTS

built by
Today’s Agenda
• About Big Data and Hadoop

• Data Warehouse refresher

AGENDA
Topic: Testing Big Data:
Automated ETL Testing of Hadoop

• Hadoop and DWH Use Case

Host: RTTS
Date: Thursday, January 30, 2014
Time: 1:00 pm, Eastern Standard
Time (New York, GMT-05:00)

• How to test Big Data

Session number:630 771 732

• Demo of QuerySurge & Hadoop

built by
FACTS
Founded: 1996

About

Primary Focus:
consulting services, software

Locations: New York,
Atlanta, Philly, Phoenix

Geographic region:
North America

Customer profile:
Fortune 1000, > 600 clients

Software:

RTTS is the leading provider of software quality
for critical business systems
Facebook handles 300 million photos a day and
about 105 terabytes of data every 30 minutes.
- TechCrunch
The big data market will grow from $3.2 billion in
2010 to $32.4 billion in 2017.
- Research Firm IDC
65% of…advanced analytics will have Hadoop
embedded (in them) by 2015.
-Gartner
built by
about Big Data
Big data – defined as too much
volume, velocity and variety to
work on normal database
architectures.
Size
Defined as 5 petabytes or more
1 petabyte = 1,000 terabytes
1,000 terabytes = 1,000,000 gigabytes
1,000,000 gigabytes = 1,000,000,000 megabytes

built by
?

What is

Hadoop is an
open source project that
develops software for scalable, distributed computing.
•

•

is a
of large data sets across
clusters of computers using simple programming models.
easily deals with complexities of high

of data

from single servers to 1,000’s of machines, each offering local
computation and storage.

•

detects and

at the application layer

built by
Key Attributes of Hadoop
• Redundant and reliable
• Extremely powerful
• Easy to program distributed apps
• Runs on commodity hardware
built by
Basic Hadoop Architecture
MapReduce – processing part that manages
the programming jobs. (a.k.a. Task Tracker)
HDFS (Hadoop Distributed File System) –
stores data on the machines. (a.k.a. Data
Node)

MapReduce
(Task Tracker)

HDFS
(Data
Node)

machine

built by
Basic Hadoop Architecture (continued)
Cluster

Add more machines for scaling – from 1 to 100 to 1,000
Job Tracker accepts jobs, assigns tasks, identifies failed machines

Task
Task
Task
Task
Task
Task
Task
Task
Task
Task
Task
Task
Tracker Tracker Tracker Tracker Tracker Tracker Tracker Tracker Tracker Tracker Tracker Tracker
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node

Name Node
Name Node
Coordination for HDFS. Inserts and extraction are communicated through
the Name Node.
built by
Apache Hive
Apache Hive - a data warehouse infrastructure built on top
of Hadoop for providing data summarization, query, and analysis.
Hive provides a mechanism to query the data using a SQL-like language
called HiveQL that interacts with the HDFS files

MapReduce

•
•
•
•
•

(Task Tracker)

create
insert
update
delete
select

HiveQL

HiveQL HiveQL
HiveQL
HiveQL

HDFS
(Data
Node)

built by
Data Warehouse Review

built by
about Data Warehouses…
Data Warehouse
•

typically a relational database that is designed for query and analysis rather
than for transaction processing

•

a place where historical data is stored for archival, analysis and security
purposes.

•

contains either raw data or formatted data

•

combines data from multiple sources
•
•
•
•
•
•
•
•
•

Sales
salaries
operational data
human resource data
inventory data
web logs
Social networks
Internet text and docs
other
built by
Data Warehousing: the ETL process
ETL = Extract, Transform, Load
Why ETL?
Need to load the data warehouse regularly (daily/weekly) so that it
can serve its purpose of facilitating business analysis.
Extract - data from one or more OLTP systems and copied into
the warehouse
Transform – removing inconsistencies, assemble to a common
format, adding missing fields, summarizing detailed data and
deriving new fields to store calculated data.
Load – map the data and load it into the DWH

built by
Data Warehouse – the ETL process
Source Data
Legacy DB

ETL Process

Target DWH

Extract

CRM/ERP
DB

Finance DB

Transform

Load

built by
Data Warehouse & Hadoop:
A Use Case
DWH

Hadoop

built by
DWH & Hadoop: A Use Case

USE CASE***
Use Hadoop as a landing zone for big data & raw data
1)

bring all raw, big data into Hadoop

2)

perform some pre-processing of this data

3)

determine which data goes to EDWH

4)

Extract, transform and load (ETL) pertinent data into EDHW

***Source: Vijay Ramaiah, IBM product manager, datanami magazine, June 10, 2013

built by
DWH & Hadoop: A Use Case
Use case data flow
Source Data
Source

ETL Process

Target DWH

ETL

built by
Testing Big Data

built by
Testing Big Data: Entry Points
Recommended functional test strategy: Test every entry point in the
system (feeds, databases, internal messaging, front-end transactions).

The goal: provide rapid localization of data issues between points

test entry point

test entry point

Source Data
Source Hadoop

ETL Process

Target DWH

B
I

ETL

built by
Testing Big Data: 3 Big Issues
- we need to verify more data and to do it faster

- we need to automate the testing
effort

- We need to be able to test across different platforms

We need a testing tool!
built by
About QuerySurge

built by

21
What is QuerySurge?
QuerySurge
is the
premier test tool
built
to automate
Data Warehouse testing
and the
ETL Testing Process

built by
What does QuerySurge ™do?
QuerySurge finds bad data
•

Most firms test < 1% of their data

•

BI apps sit on top of DWHs that have at best, untested data & at worst, bad data

•

CEOs, CFOs, CTOs, executives rely on BI apps to make strategic decisions

•

Bad data will cause execs to make decisions that will cost them $millions

•

QuerySurge tests up to 100% of your data quickly & finds bad data

built by
QuerySurge Roles & Uses
Testers
- functional testing
- regression testing

ETL Developers
- unit testing

Data Analysts
- review, analyze data
- verify mappings and
failures.

Operations teams
- monitoring

built by
QuerySurge™ Architecture

Sources

Target

built by
QuerySurge™ Modules
Design Library
 Create Query Pairs (source & target queries)

Scheduling
 Build groups of Query Pairs
 Schedule Test Runs

built by
26
QuerySurge™ Modules
Run Dashboard
 View real-time execution
 Analyze real-time results

Deep-Dive Reporting
 Examine and automatically
email test results

built by
the QuerySurge solution…
verifies more data
 verifies upwards of 100% of all data quickly
automates the testing effort
the kickoff, the tests, the comparison, emailing the results

tests across different platforms
any JDBC-compliant db, DWH, DMart, flat file, XML, Hadoop
speeds up testing
up to 1,000 times faster than manual testing

built by
QuerySurge Value-Add
QuerySurge provides value by either:
in testing data coverage from < 1% to
upwards of 100%

in testing time by as much as 1,000 x
combination of
testing time

in test coverage while

in

built by

29
Return on Investment (ROI)
 redeployment of head count because of an increase in

coverage
 a savings over manual testing (minus queries, manual
compares, other)

 an increase in better data due to shorter / more thorough

testing cycle, possibly saving $ millions by preventing key
decisions made on bad data.

built by

30
Demonstration

Jeff Bocarsly, Ph.D.
Chief Architect
RTTS

Ensuring Data Warehouse Quality

More Related Content

What's hot

Data Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical IndustryData Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical Industry
RTTS
 
Completing the Data Equation: Test Data + Data Validation = Success
Completing the Data Equation: Test Data + Data Validation = SuccessCompleting the Data Equation: Test Data + Data Validation = Success
Completing the Data Equation: Test Data + Data Validation = Success
RTTS
 
Big Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit KharabeBig Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit Kharabe
ROHIT KHARABE
 
How to Test Big Data Systems | QualiTest Group
How to Test Big Data Systems | QualiTest GroupHow to Test Big Data Systems | QualiTest Group
How to Test Big Data Systems | QualiTest Group
Qualitest
 
Improve the Health of Your Data
Improve the Health of Your DataImprove the Health of Your Data
Improve the Health of Your Data
RTTS
 
An introduction to QuerySurge webinar
An introduction to QuerySurge webinarAn introduction to QuerySurge webinar
An introduction to QuerySurge webinar
RTTS
 
Testing the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big ProblemsTesting the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big Problems
TechWell
 
Big Data – A New Testing Challenge
Big Data – A New Testing ChallengeBig Data – A New Testing Challenge
Big Data – A New Testing Challenge
TEST Huddle
 
Implementing Azure DevOps with your Testing Project
Implementing Azure DevOps with your Testing ProjectImplementing Azure DevOps with your Testing Project
Implementing Azure DevOps with your Testing Project
RTTS
 
TESTING IN BIG DATA WORLD
TESTING IN BIG DATA  WORLDTESTING IN BIG DATA  WORLD
TESTING IN BIG DATA WORLD
Konstantin Pletenev
 
Testing the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big ProblemsTesting the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big Problems
TechWell
 
Testing the Data Warehouse
Testing the Data WarehouseTesting the Data Warehouse
Testing the Data Warehouse
TechWell
 
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit AgarwalSuccinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Spark Summit
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Denodo
 
Big Data - Hadoop and MapReduce for QA and testing by Aditya Garg
Big Data - Hadoop and MapReduce for QA and testing by Aditya GargBig Data - Hadoop and MapReduce for QA and testing by Aditya Garg
Big Data - Hadoop and MapReduce for QA and testing by Aditya Garg
QA or the Highway
 
Building Data Science into Organizations: Field Experience
Building Data Science into Organizations: Field ExperienceBuilding Data Science into Organizations: Field Experience
Building Data Science into Organizations: Field Experience
Databricks
 
Hadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance InitiativeHadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance Initiative
DataWorks Summit
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
DataWorks Summit/Hadoop Summit
 
Aster getting started
Aster getting startedAster getting started
Aster getting started
JESUS WORLD MISSION
 
Big Data Computing Architecture
Big Data Computing ArchitectureBig Data Computing Architecture
Big Data Computing Architecture
Gang Tao
 

What's hot (20)

Data Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical IndustryData Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical Industry
 
Completing the Data Equation: Test Data + Data Validation = Success
Completing the Data Equation: Test Data + Data Validation = SuccessCompleting the Data Equation: Test Data + Data Validation = Success
Completing the Data Equation: Test Data + Data Validation = Success
 
Big Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit KharabeBig Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit Kharabe
 
How to Test Big Data Systems | QualiTest Group
How to Test Big Data Systems | QualiTest GroupHow to Test Big Data Systems | QualiTest Group
How to Test Big Data Systems | QualiTest Group
 
Improve the Health of Your Data
Improve the Health of Your DataImprove the Health of Your Data
Improve the Health of Your Data
 
An introduction to QuerySurge webinar
An introduction to QuerySurge webinarAn introduction to QuerySurge webinar
An introduction to QuerySurge webinar
 
Testing the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big ProblemsTesting the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big Problems
 
Big Data – A New Testing Challenge
Big Data – A New Testing ChallengeBig Data – A New Testing Challenge
Big Data – A New Testing Challenge
 
Implementing Azure DevOps with your Testing Project
Implementing Azure DevOps with your Testing ProjectImplementing Azure DevOps with your Testing Project
Implementing Azure DevOps with your Testing Project
 
TESTING IN BIG DATA WORLD
TESTING IN BIG DATA  WORLDTESTING IN BIG DATA  WORLD
TESTING IN BIG DATA WORLD
 
Testing the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big ProblemsTesting the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big Problems
 
Testing the Data Warehouse
Testing the Data WarehouseTesting the Data Warehouse
Testing the Data Warehouse
 
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit AgarwalSuccinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data Lakes
 
Big Data - Hadoop and MapReduce for QA and testing by Aditya Garg
Big Data - Hadoop and MapReduce for QA and testing by Aditya GargBig Data - Hadoop and MapReduce for QA and testing by Aditya Garg
Big Data - Hadoop and MapReduce for QA and testing by Aditya Garg
 
Building Data Science into Organizations: Field Experience
Building Data Science into Organizations: Field ExperienceBuilding Data Science into Organizations: Field Experience
Building Data Science into Organizations: Field Experience
 
Hadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance InitiativeHadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance Initiative
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
Aster getting started
Aster getting startedAster getting started
Aster getting started
 
Big Data Computing Architecture
Big Data Computing ArchitectureBig Data Computing Architecture
Big Data Computing Architecture
 

Similar to Testing Big Data: Automated ETL Testing of Hadoop

Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APS
Stéphane Fréchette
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Bridging the Big Data Gap in the Software-Driven World
Bridging the Big Data Gap in the Software-Driven WorldBridging the Big Data Gap in the Software-Driven World
Bridging the Big Data Gap in the Software-Driven World
CA Technologies
 
the Data World Distilled
the Data World Distilledthe Data World Distilled
the Data World Distilled
RTTS
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing Webinar
RTTS
 
TSE_Pres12.pptx
TSE_Pres12.pptxTSE_Pres12.pptx
TSE_Pres12.pptx
ssuseracaaae2
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKRajesh Jayarman
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data Warehouse
DataWorks Summit
 
Web Briefing: Unlock the power of Hadoop to enable interactive analytics
Web Briefing: Unlock the power of Hadoop to enable interactive analyticsWeb Briefing: Unlock the power of Hadoop to enable interactive analytics
Web Briefing: Unlock the power of Hadoop to enable interactive analyticsKognitio
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the Organization
Seeling Cheung
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016
Joan Novino
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
Skillwise Group
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
Skillwise Group
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & Hadoop
Blackvard
 
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seeling Cheung
 
Making BD Work~TIAS_20150622
Making BD Work~TIAS_20150622Making BD Work~TIAS_20150622
Making BD Work~TIAS_20150622Anthony Potappel
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overview
vhrocca
 
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Impetus Technologies
 
OOP 2014
OOP 2014OOP 2014
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data Analytics
Attunity
 

Similar to Testing Big Data: Automated ETL Testing of Hadoop (20)

Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APS
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Bridging the Big Data Gap in the Software-Driven World
Bridging the Big Data Gap in the Software-Driven WorldBridging the Big Data Gap in the Software-Driven World
Bridging the Big Data Gap in the Software-Driven World
 
the Data World Distilled
the Data World Distilledthe Data World Distilled
the Data World Distilled
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing Webinar
 
TSE_Pres12.pptx
TSE_Pres12.pptxTSE_Pres12.pptx
TSE_Pres12.pptx
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data Warehouse
 
Web Briefing: Unlock the power of Hadoop to enable interactive analytics
Web Briefing: Unlock the power of Hadoop to enable interactive analyticsWeb Briefing: Unlock the power of Hadoop to enable interactive analytics
Web Briefing: Unlock the power of Hadoop to enable interactive analytics
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the Organization
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & Hadoop
 
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
 
Making BD Work~TIAS_20150622
Making BD Work~TIAS_20150622Making BD Work~TIAS_20150622
Making BD Work~TIAS_20150622
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overview
 
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
 
OOP 2014
OOP 2014OOP 2014
OOP 2014
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data Analytics
 

Recently uploaded

UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 

Recently uploaded (20)

UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 

Testing Big Data: Automated ETL Testing of Hadoop

  • 1. Webinar Testing Big Data: Automated ETL Testing of Hadoop Laura Poggi Marketing Manager RTTS Bill Hayduk CEO/President RTTS Jeff Bocarsly, Ph.D. Chief Architect RTTS built by
  • 2. Today’s Agenda • About Big Data and Hadoop • Data Warehouse refresher AGENDA Topic: Testing Big Data: Automated ETL Testing of Hadoop • Hadoop and DWH Use Case Host: RTTS Date: Thursday, January 30, 2014 Time: 1:00 pm, Eastern Standard Time (New York, GMT-05:00) • How to test Big Data Session number:630 771 732 • Demo of QuerySurge & Hadoop built by
  • 3. FACTS Founded: 1996 About Primary Focus: consulting services, software Locations: New York, Atlanta, Philly, Phoenix Geographic region: North America Customer profile: Fortune 1000, > 600 clients Software: RTTS is the leading provider of software quality for critical business systems
  • 4. Facebook handles 300 million photos a day and about 105 terabytes of data every 30 minutes. - TechCrunch The big data market will grow from $3.2 billion in 2010 to $32.4 billion in 2017. - Research Firm IDC 65% of…advanced analytics will have Hadoop embedded (in them) by 2015. -Gartner built by
  • 5. about Big Data Big data – defined as too much volume, velocity and variety to work on normal database architectures. Size Defined as 5 petabytes or more 1 petabyte = 1,000 terabytes 1,000 terabytes = 1,000,000 gigabytes 1,000,000 gigabytes = 1,000,000,000 megabytes built by
  • 6. ? What is Hadoop is an open source project that develops software for scalable, distributed computing. • • is a of large data sets across clusters of computers using simple programming models. easily deals with complexities of high of data from single servers to 1,000’s of machines, each offering local computation and storage. • detects and at the application layer built by
  • 7. Key Attributes of Hadoop • Redundant and reliable • Extremely powerful • Easy to program distributed apps • Runs on commodity hardware built by
  • 8. Basic Hadoop Architecture MapReduce – processing part that manages the programming jobs. (a.k.a. Task Tracker) HDFS (Hadoop Distributed File System) – stores data on the machines. (a.k.a. Data Node) MapReduce (Task Tracker) HDFS (Data Node) machine built by
  • 9. Basic Hadoop Architecture (continued) Cluster Add more machines for scaling – from 1 to 100 to 1,000 Job Tracker accepts jobs, assigns tasks, identifies failed machines Task Task Task Task Task Task Task Task Task Task Task Task Tracker Tracker Tracker Tracker Tracker Tracker Tracker Tracker Tracker Tracker Tracker Tracker Data Data Data Data Data Data Data Data Data Data Data Data Node Node Node Node Node Node Node Node Node Node Node Node Name Node Name Node Coordination for HDFS. Inserts and extraction are communicated through the Name Node. built by
  • 10. Apache Hive Apache Hive - a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Hive provides a mechanism to query the data using a SQL-like language called HiveQL that interacts with the HDFS files MapReduce • • • • • (Task Tracker) create insert update delete select HiveQL HiveQL HiveQL HiveQL HiveQL HDFS (Data Node) built by
  • 12. about Data Warehouses… Data Warehouse • typically a relational database that is designed for query and analysis rather than for transaction processing • a place where historical data is stored for archival, analysis and security purposes. • contains either raw data or formatted data • combines data from multiple sources • • • • • • • • • Sales salaries operational data human resource data inventory data web logs Social networks Internet text and docs other built by
  • 13. Data Warehousing: the ETL process ETL = Extract, Transform, Load Why ETL? Need to load the data warehouse regularly (daily/weekly) so that it can serve its purpose of facilitating business analysis. Extract - data from one or more OLTP systems and copied into the warehouse Transform – removing inconsistencies, assemble to a common format, adding missing fields, summarizing detailed data and deriving new fields to store calculated data. Load – map the data and load it into the DWH built by
  • 14. Data Warehouse – the ETL process Source Data Legacy DB ETL Process Target DWH Extract CRM/ERP DB Finance DB Transform Load built by
  • 15. Data Warehouse & Hadoop: A Use Case DWH Hadoop built by
  • 16. DWH & Hadoop: A Use Case USE CASE*** Use Hadoop as a landing zone for big data & raw data 1) bring all raw, big data into Hadoop 2) perform some pre-processing of this data 3) determine which data goes to EDWH 4) Extract, transform and load (ETL) pertinent data into EDHW ***Source: Vijay Ramaiah, IBM product manager, datanami magazine, June 10, 2013 built by
  • 17. DWH & Hadoop: A Use Case Use case data flow Source Data Source ETL Process Target DWH ETL built by
  • 19. Testing Big Data: Entry Points Recommended functional test strategy: Test every entry point in the system (feeds, databases, internal messaging, front-end transactions). The goal: provide rapid localization of data issues between points test entry point test entry point Source Data Source Hadoop ETL Process Target DWH B I ETL built by
  • 20. Testing Big Data: 3 Big Issues - we need to verify more data and to do it faster - we need to automate the testing effort - We need to be able to test across different platforms We need a testing tool! built by
  • 22. What is QuerySurge? QuerySurge is the premier test tool built to automate Data Warehouse testing and the ETL Testing Process built by
  • 23. What does QuerySurge ™do? QuerySurge finds bad data • Most firms test < 1% of their data • BI apps sit on top of DWHs that have at best, untested data & at worst, bad data • CEOs, CFOs, CTOs, executives rely on BI apps to make strategic decisions • Bad data will cause execs to make decisions that will cost them $millions • QuerySurge tests up to 100% of your data quickly & finds bad data built by
  • 24. QuerySurge Roles & Uses Testers - functional testing - regression testing ETL Developers - unit testing Data Analysts - review, analyze data - verify mappings and failures. Operations teams - monitoring built by
  • 26. QuerySurge™ Modules Design Library  Create Query Pairs (source & target queries) Scheduling  Build groups of Query Pairs  Schedule Test Runs built by 26
  • 27. QuerySurge™ Modules Run Dashboard  View real-time execution  Analyze real-time results Deep-Dive Reporting  Examine and automatically email test results built by
  • 28. the QuerySurge solution… verifies more data  verifies upwards of 100% of all data quickly automates the testing effort the kickoff, the tests, the comparison, emailing the results tests across different platforms any JDBC-compliant db, DWH, DMart, flat file, XML, Hadoop speeds up testing up to 1,000 times faster than manual testing built by
  • 29. QuerySurge Value-Add QuerySurge provides value by either: in testing data coverage from < 1% to upwards of 100% in testing time by as much as 1,000 x combination of testing time in test coverage while in built by 29
  • 30. Return on Investment (ROI)  redeployment of head count because of an increase in coverage  a savings over manual testing (minus queries, manual compares, other)  an increase in better data due to shorter / more thorough testing cycle, possibly saving $ millions by preventing key decisions made on bad data. built by 30
  • 31. Demonstration Jeff Bocarsly, Ph.D. Chief Architect RTTS Ensuring Data Warehouse Quality

Editor's Notes

  1. Largest known cluster is 4500 nodes
  2. Designing and maintaining the ETL process is often considered one of the most difficult and resource-intensive portions of a data warehouse project. Many data warehousing projects use ETL tools to manage this process. Other data warehouse builders create their own ETL tools and processes, either inside or outside the database.Besides the support of extraction, transformation, and loading, there are some other tasks that are important for a successful ETL implementation as part of the daily operations of the data warehouse and its support for further enhancements.
  3. Web browsers: Internet Explorer, Chrome, Firefox and Safari.Operating systems: Windows &amp; Linux.