Webinar
Testing Big Data:
Automated ETL Testing of Hadoop

Laura Poggi
Marketing Manager
RTTS

Bill Hayduk
CEO/President
R...
Today’s Agenda
• About Big Data and Hadoop

• Data Warehouse refresher

AGENDA
Topic: Testing Big Data:
Automated ETL Test...
FACTS
Founded: 1996

About

Primary Focus:
consulting services, software

Locations: New York,
Atlanta, Philly, Phoenix

G...
Facebook handles 300 million photos a day and
about 105 terabytes of data every 30 minutes.
- TechCrunch
The big data mark...
about Big Data
Big data – defined as too much
volume, velocity and variety to
work on normal database
architectures.
Size
...
?

What is

Hadoop is an
open source project that
develops software for scalable, distributed computing.
•

•

is a
of lar...
Key Attributes of Hadoop
• Redundant and reliable
• Extremely powerful
• Easy to program distributed apps
• Runs on commod...
Basic Hadoop Architecture
MapReduce – processing part that manages
the programming jobs. (a.k.a. Task Tracker)
HDFS (Hadoo...
Basic Hadoop Architecture (continued)
Cluster

Add more machines for scaling – from 1 to 100 to 1,000
Job Tracker accepts ...
Apache Hive
Apache Hive - a data warehouse infrastructure built on top
of Hadoop for providing data summarization, query, ...
Data Warehouse Review

built by
about Data Warehouses…
Data Warehouse
•

typically a relational database that is designed for query and analysis rather
th...
Data Warehousing: the ETL process
ETL = Extract, Transform, Load
Why ETL?
Need to load the data warehouse regularly (daily...
Data Warehouse – the ETL process
Source Data
Legacy DB

ETL Process

Target DWH

Extract

CRM/ERP
DB

Finance DB

Transfor...
Data Warehouse & Hadoop:
A Use Case
DWH

Hadoop

built by
DWH & Hadoop: A Use Case

USE CASE***
Use Hadoop as a landing zone for big data & raw data
1)

bring all raw, big data int...
DWH & Hadoop: A Use Case
Use case data flow
Source Data
Source

ETL Process

Target DWH

ETL

built by
Testing Big Data

built by
Testing Big Data: Entry Points
Recommended functional test strategy: Test every entry point in the
system (feeds, database...
Testing Big Data: 3 Big Issues
- we need to verify more data and to do it faster

- we need to automate the testing
effort...
About QuerySurge

built by

21
What is QuerySurge?
QuerySurge
is the
premier test tool
built
to automate
Data Warehouse testing
and the
ETL Testing Proce...
What does QuerySurge ™do?
QuerySurge finds bad data
•

Most firms test < 1% of their data

•

BI apps sit on top of DWHs t...
QuerySurge Roles & Uses
Testers
- functional testing
- regression testing

ETL Developers
- unit testing

Data Analysts
- ...
QuerySurge™ Architecture

Sources

Target

built by
QuerySurge™ Modules
Design Library
 Create Query Pairs (source & target queries)

Scheduling
 Build groups of Query Pair...
QuerySurge™ Modules
Run Dashboard
 View real-time execution
 Analyze real-time results

Deep-Dive Reporting
 Examine an...
the QuerySurge solution…
verifies more data
 verifies upwards of 100% of all data quickly
automates the testing effort
t...
QuerySurge Value-Add
QuerySurge provides value by either:
in testing data coverage from < 1% to
upwards of 100%

in testin...
Return on Investment (ROI)
 redeployment of head count because of an increase in

coverage
 a savings over manual testin...
Demonstration

Jeff Bocarsly, Ph.D.
Chief Architect
RTTS

Ensuring Data Warehouse Quality
Upcoming SlideShare
Loading in …5
×

Testing Big Data: Automated ETL Testing of Hadoop

1,662
-1

Published on

Learn why testing your enterprise's data is pivotal for success with Big Data and Hadoop. See how to increase your testing speed, boost your testing coverage (up to 100%), and improve the level of quality within your data warehouse - all with one ETL testing tool.

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,662
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
155
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • Largest known cluster is 4500 nodes
  • Designing and maintaining the ETL process is often considered one of the most difficult and resource-intensive portions of a data warehouse project. Many data warehousing projects use ETL tools to manage this process. Other data warehouse builders create their own ETL tools and processes, either inside or outside the database.Besides the support of extraction, transformation, and loading, there are some other tasks that are important for a successful ETL implementation as part of the daily operations of the data warehouse and its support for further enhancements.
  • Web browsers: Internet Explorer, Chrome, Firefox and Safari.Operating systems: Windows &amp; Linux.
  • Testing Big Data: Automated ETL Testing of Hadoop

    1. 1. Webinar Testing Big Data: Automated ETL Testing of Hadoop Laura Poggi Marketing Manager RTTS Bill Hayduk CEO/President RTTS Jeff Bocarsly, Ph.D. Chief Architect RTTS built by
    2. 2. Today’s Agenda • About Big Data and Hadoop • Data Warehouse refresher AGENDA Topic: Testing Big Data: Automated ETL Testing of Hadoop • Hadoop and DWH Use Case Host: RTTS Date: Thursday, January 30, 2014 Time: 1:00 pm, Eastern Standard Time (New York, GMT-05:00) • How to test Big Data Session number:630 771 732 • Demo of QuerySurge & Hadoop built by
    3. 3. FACTS Founded: 1996 About Primary Focus: consulting services, software Locations: New York, Atlanta, Philly, Phoenix Geographic region: North America Customer profile: Fortune 1000, > 600 clients Software: RTTS is the leading provider of software quality for critical business systems
    4. 4. Facebook handles 300 million photos a day and about 105 terabytes of data every 30 minutes. - TechCrunch The big data market will grow from $3.2 billion in 2010 to $32.4 billion in 2017. - Research Firm IDC 65% of…advanced analytics will have Hadoop embedded (in them) by 2015. -Gartner built by
    5. 5. about Big Data Big data – defined as too much volume, velocity and variety to work on normal database architectures. Size Defined as 5 petabytes or more 1 petabyte = 1,000 terabytes 1,000 terabytes = 1,000,000 gigabytes 1,000,000 gigabytes = 1,000,000,000 megabytes built by
    6. 6. ? What is Hadoop is an open source project that develops software for scalable, distributed computing. • • is a of large data sets across clusters of computers using simple programming models. easily deals with complexities of high of data from single servers to 1,000’s of machines, each offering local computation and storage. • detects and at the application layer built by
    7. 7. Key Attributes of Hadoop • Redundant and reliable • Extremely powerful • Easy to program distributed apps • Runs on commodity hardware built by
    8. 8. Basic Hadoop Architecture MapReduce – processing part that manages the programming jobs. (a.k.a. Task Tracker) HDFS (Hadoop Distributed File System) – stores data on the machines. (a.k.a. Data Node) MapReduce (Task Tracker) HDFS (Data Node) machine built by
    9. 9. Basic Hadoop Architecture (continued) Cluster Add more machines for scaling – from 1 to 100 to 1,000 Job Tracker accepts jobs, assigns tasks, identifies failed machines Task Task Task Task Task Task Task Task Task Task Task Task Tracker Tracker Tracker Tracker Tracker Tracker Tracker Tracker Tracker Tracker Tracker Tracker Data Data Data Data Data Data Data Data Data Data Data Data Node Node Node Node Node Node Node Node Node Node Node Node Name Node Name Node Coordination for HDFS. Inserts and extraction are communicated through the Name Node. built by
    10. 10. Apache Hive Apache Hive - a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Hive provides a mechanism to query the data using a SQL-like language called HiveQL that interacts with the HDFS files MapReduce • • • • • (Task Tracker) create insert update delete select HiveQL HiveQL HiveQL HiveQL HiveQL HDFS (Data Node) built by
    11. 11. Data Warehouse Review built by
    12. 12. about Data Warehouses… Data Warehouse • typically a relational database that is designed for query and analysis rather than for transaction processing • a place where historical data is stored for archival, analysis and security purposes. • contains either raw data or formatted data • combines data from multiple sources • • • • • • • • • Sales salaries operational data human resource data inventory data web logs Social networks Internet text and docs other built by
    13. 13. Data Warehousing: the ETL process ETL = Extract, Transform, Load Why ETL? Need to load the data warehouse regularly (daily/weekly) so that it can serve its purpose of facilitating business analysis. Extract - data from one or more OLTP systems and copied into the warehouse Transform – removing inconsistencies, assemble to a common format, adding missing fields, summarizing detailed data and deriving new fields to store calculated data. Load – map the data and load it into the DWH built by
    14. 14. Data Warehouse – the ETL process Source Data Legacy DB ETL Process Target DWH Extract CRM/ERP DB Finance DB Transform Load built by
    15. 15. Data Warehouse & Hadoop: A Use Case DWH Hadoop built by
    16. 16. DWH & Hadoop: A Use Case USE CASE*** Use Hadoop as a landing zone for big data & raw data 1) bring all raw, big data into Hadoop 2) perform some pre-processing of this data 3) determine which data goes to EDWH 4) Extract, transform and load (ETL) pertinent data into EDHW ***Source: Vijay Ramaiah, IBM product manager, datanami magazine, June 10, 2013 built by
    17. 17. DWH & Hadoop: A Use Case Use case data flow Source Data Source ETL Process Target DWH ETL built by
    18. 18. Testing Big Data built by
    19. 19. Testing Big Data: Entry Points Recommended functional test strategy: Test every entry point in the system (feeds, databases, internal messaging, front-end transactions). The goal: provide rapid localization of data issues between points test entry point test entry point Source Data Source Hadoop ETL Process Target DWH B I ETL built by
    20. 20. Testing Big Data: 3 Big Issues - we need to verify more data and to do it faster - we need to automate the testing effort - We need to be able to test across different platforms We need a testing tool! built by
    21. 21. About QuerySurge built by 21
    22. 22. What is QuerySurge? QuerySurge is the premier test tool built to automate Data Warehouse testing and the ETL Testing Process built by
    23. 23. What does QuerySurge ™do? QuerySurge finds bad data • Most firms test < 1% of their data • BI apps sit on top of DWHs that have at best, untested data & at worst, bad data • CEOs, CFOs, CTOs, executives rely on BI apps to make strategic decisions • Bad data will cause execs to make decisions that will cost them $millions • QuerySurge tests up to 100% of your data quickly & finds bad data built by
    24. 24. QuerySurge Roles & Uses Testers - functional testing - regression testing ETL Developers - unit testing Data Analysts - review, analyze data - verify mappings and failures. Operations teams - monitoring built by
    25. 25. QuerySurge™ Architecture Sources Target built by
    26. 26. QuerySurge™ Modules Design Library  Create Query Pairs (source & target queries) Scheduling  Build groups of Query Pairs  Schedule Test Runs built by 26
    27. 27. QuerySurge™ Modules Run Dashboard  View real-time execution  Analyze real-time results Deep-Dive Reporting  Examine and automatically email test results built by
    28. 28. the QuerySurge solution… verifies more data  verifies upwards of 100% of all data quickly automates the testing effort the kickoff, the tests, the comparison, emailing the results tests across different platforms any JDBC-compliant db, DWH, DMart, flat file, XML, Hadoop speeds up testing up to 1,000 times faster than manual testing built by
    29. 29. QuerySurge Value-Add QuerySurge provides value by either: in testing data coverage from < 1% to upwards of 100% in testing time by as much as 1,000 x combination of testing time in test coverage while in built by 29
    30. 30. Return on Investment (ROI)  redeployment of head count because of an increase in coverage  a savings over manual testing (minus queries, manual compares, other)  an increase in better data due to shorter / more thorough testing cycle, possibly saving $ millions by preventing key decisions made on bad data. built by 30
    31. 31. Demonstration Jeff Bocarsly, Ph.D. Chief Architect RTTS Ensuring Data Warehouse Quality
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×