Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
built by
Bill Hayduk
CEO/President
RTTS
Testing Big Data:
Automated ETL Testing of Hadoop
Jeff Bocarsly, Ph.D.
Chief Archi...
built by
QuerySurge™
Today’s Agenda
• About Big Data and Hadoop
• Data Warehouse refresher
• Hadoop and DWH Use Case
• How...
built by
QuerySurge™
About
FACTS
Founded:
1996
Locations:
New York (HQ), Atlanta,
Philadelphia, Phoenix
Strategic Partners...
built by
Facebook handles 300 million photos a day and
about 105 terabytes of data every 30 minutes.
- TechCrunch
The big ...
ETL
Business Intelligence (BI) software
CxOs are using Business Intelligence & Analytics to make critical business decisio...
Big data – defined as too much
volume, velocity and variety to
work on normal database
architectures.
Size
Defined as 5 pe...
Big Data Impact
Handles more than 1 million customer transactions every hour.
• data imported into databases that contain ...
Requires exceptional technologies to efficiently process large quantities of
data within tolerable elapsed times.
Technolo...
built by
QuerySurge™
What is ?
• easily deals with complexities of high of data
Hadoop is an open source project that
deve...
built by
QuerySurge™
Key Attributes of Hadoop
• Redundant and reliable
• Extremely powerful
• Easy to program distributed ...
Top Vendors
built by
QuerySurge™
“Spending on Hadoop software and subscriptions will increase
to approximately $677 millio...
built by
QuerySurge™
MapReduce
(Task Tracker)
HDFS
(Data
Node)
Basic Hadoop Architecture
MapReduce – processing part that ...
built by
QuerySurge™
Cluster
Add more machines for scaling – from 1 to 100 to 1,000
Job Tracker accepts jobs, assigns task...
built by
QuerySurge™
MapReduce
(Task Tracker)
HDFS
(Data
Node)HiveQLHiveQL
HiveQLHiveQL
HiveQL
Apache Hive - a data wareho...
built by
QuerySurge™
Data Warehouse Review
about Data Warehouses…
Data Warehouse
• typically a relational database that is designed for query and analysis rather tha...
Data Warehouse: the ETL process
ETL: Extract, Transform, Load
Why ETL?
Need to load the data warehouse regularly (daily/we...
Data Warehouse: the Marketplace
“The data warehousing market will see a compound annual growth rate of
11.5% …to reach a t...
Extract
built by
QuerySurge™
Legacy DB
CRM/ERP
DB
Finance DB
Testing the Data Warehouse: the ETL process
Source
Data
ETL P...
built by
QuerySurge™
Testing Big Data
built by
QuerySurge™
Data Warehouse & Hadoop:
2 Use Cases:
Data
Warehouse
Hadoop
NoSQL
Hadoop
Data
Warehouse
built by
QuerySurge™
USE CASE 1***
Use Hadoop as a landing zone for big data & raw data
1) bring all raw, big data into Ha...
Recommended functional test strategy: Test every entry point in the system
(feeds, databases, internal messaging, front-en...
Use Case #2:
MongoDB, Hadoop, DWH &
Relational DB & Data
WarehousingSource Data
@
BI, Analytics &
ReportingIngestion
built...
built by
QuerySurge™
Testing Big Data: 3 Big Issues
- we need to verify more data and to do it faster
- we need to automat...
built by
QuerySurge™
About QuerySurge™
built by
built by
QuerySurge™
What is QuerySurge™?
the collaborative
Big Data Testing solution
that finds bad data &
provides a hol...
the QuerySurge advantage
built by
QuerySurge™
Automate the entire testing cycle
 Automate kickoff, tests, comparison, aut...
QuerySurge™ Architecture
Web-based…
Installs on...
Linux
Connects to…
…or any other JDBC compliant data source
built by
Qu...
built by
QuerySurge™
QuerySurge™ Modules
Design Library
SchedulingDeep-Dive Reporting
Run Dashboard
Query Wizards
Data Hea...
Fast and Easy.
No programming needed.
built by
QuerySurge™
QuerySurge™ Modules
• Perform 80% of all data tests -
no SQL co...
Design Library
• Create Query Pairs (source & target SQLs)
• Great for team members skilled with SQL
QuerySurge™ Modules
S...
Deep-Dive Reporting
 Examine and automatically
email test results
Run Dashboard
 View real-time execution
 Analyze real...
built by
QuerySurge™
• view data reliability & pass rate
• add, move, filter, zoom-in on any data
widget & underlying data...
(1) Trial in the Cloud of QuerySurgeTM, including self-learning
tutorial that works with sample data for 3 days
(2) Downlo...
a last word about Hadoop…
built by
built by
QuerySurge™
To see the video of this webinar please visit:
http://www.querysur...
Upcoming SlideShare
Loading in …5
×

Testing Big Data: Automated Testing of Hadoop with QuerySurge

157,768 views

Published on

Are You Ready? Stepping Up To The Big Data Challenge In 2016 - Learn why Testing is pivotal to the success of your Big Data Strategy.

According to a new report by analyst firm IDG, 70% of enterprises have either deployed or are planning to deploy big data projects and programs this year due to the increase in the amount of data they need to manage.

The growing variety of new data sources is pushing organizations to look for streamlined ways to manage complexities and get the most out of their data-related investments. The companies that do this correctly are realizing the power of big data for business expansion and growth.

Learn why testing your enterprise's data is pivotal for success with big data and Hadoop. Learn how to increase your testing speed, boost your testing coverage (up to 100%), and improve the level of quality within your data - all with one data testing tool.

Published in: Technology, Business

Testing Big Data: Automated Testing of Hadoop with QuerySurge

  1. 1. built by Bill Hayduk CEO/President RTTS Testing Big Data: Automated ETL Testing of Hadoop Jeff Bocarsly, Ph.D. Chief Architect QuerySurge Division, RTTS built by QuerySurge™ Automate your Data Warehouse & Big Data Testing and Reap the Benefits
  2. 2. built by QuerySurge™ Today’s Agenda • About Big Data and Hadoop • Data Warehouse refresher • Hadoop and DWH Use Case • How to test Big Data • Demo of QuerySurge & Hadoop AGENDA Topic: Testing Big Data: Automated ETL Testing of Hadoop Host: RTTS Date: Thursday, January 30, 2014 Time: 1:00 pm, Eastern Standard Time (New York, GMT-05:00) Session number:630 771 732
  3. 3. built by QuerySurge™ About FACTS Founded: 1996 Locations: New York (HQ), Atlanta, Philadelphia, Phoenix Strategic Partners: IBM, Microsoft, HP, Oracle, Teradata, HortonWorks, Cloudera, Amazon Software: QuerySurge RTTS is the leading provider of software & data quality for critical business systems
  4. 4. built by Facebook handles 300 million photos a day and about 105 terabytes of data every 30 minutes. - TechCrunch The big data market will grow from $3.2 billion in 2010 to $32.4 billion in 2017. - Research Firm IDC 65% of…advanced analytics will have Hadoop embedded (in them) by 2015. -Gartner built by QuerySurge™
  5. 5. ETL Business Intelligence (BI) software CxOs are using Business Intelligence & Analytics to make critical business decisions – with the assumption that the underlying data is fine. “The average organization loses $8.2 million annually through poor Data Quality.” - Gartner Data Architecture The Executive Office and Big Data potential problem areas
  6. 6. Big data – defined as too much volume, velocity and variety to work on normal database architectures. Size Defined as 5 petabytes or more 1 petabyte = 1,000 terabytes 1,000 terabytes = 1,000,000 gigabytes 1,000,000 gigabytes = 1,000,000,000 megabytes about Big Data built by built by QuerySurge™
  7. 7. Big Data Impact Handles more than 1 million customer transactions every hour. • data imported into databases that contain > 2.5 petabytes of data • the equivalent of 167 times the information contained in all the books in the US Library of Congress. Facebook handles 40 billion photos from its user base. Google processes 1 Terabyte per hour Twitter processes 85 million tweets per day eBay processes 80 Terabytes per day others built by QuerySurge™
  8. 8. Requires exceptional technologies to efficiently process large quantities of data within tolerable elapsed times. Technologies include: • massively parallel processing (MPP) databases • data warehouses • Data mining grids • distributed file systems • distributed databases • cloud computing platforms • the Internet, and • scalable storage system Big Data Solutions built by QuerySurge™
  9. 9. built by QuerySurge™ What is ? • easily deals with complexities of high of data Hadoop is an open source project that develops software for scalable, distributed computing. • is a of large data sets across clusters of computers using simple programming models. from single servers to 1,000’s of machines, each offering local computation and storage. • detects and at the application layer
  10. 10. built by QuerySurge™ Key Attributes of Hadoop • Redundant and reliable • Extremely powerful • Easy to program distributed apps • Runs on commodity hardware
  11. 11. Top Vendors built by QuerySurge™ “Spending on Hadoop software and subscriptions will increase to approximately $677 million by the end of 2017, with overall big data market anticipated to reach the $50 billion mark.” - Wikibon
  12. 12. built by QuerySurge™ MapReduce (Task Tracker) HDFS (Data Node) Basic Hadoop Architecture MapReduce – processing part that manages the programming jobs. (a.k.a. Task Tracker) HDFS (Hadoop Distributed File System) – stores data on the machines. (a.k.a. Data Node) machine
  13. 13. built by QuerySurge™ Cluster Add more machines for scaling – from 1 to 100 to 1,000 Job Tracker accepts jobs, assigns tasks, identifies failed machines Name Node Coordination for HDFS. Inserts and extraction are communicated through the Name Node. Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Name Node Basic Hadoop Architecture (continued)
  14. 14. built by QuerySurge™ MapReduce (Task Tracker) HDFS (Data Node)HiveQLHiveQL HiveQLHiveQL HiveQL Apache Hive - a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Hive provides a mechanism to query the data using a SQL-like language called HiveQL that interacts with the HDFS files • create • insert • update • delete • select Apache Hive
  15. 15. built by QuerySurge™ Data Warehouse Review
  16. 16. about Data Warehouses… Data Warehouse • typically a relational database that is designed for query and analysis rather than for transaction processing • a place where historical data is stored for archival, analysis & security purposes. • contains either raw or formatted data • combines data from multiple sources: o sales o salaries o operational data o human resource data o inventory data o web logs o social networks o internet text and docs o other built by QuerySurge™
  17. 17. Data Warehouse: the ETL process ETL: Extract, Transform, Load Why ETL? Need to load the data warehouse regularly (daily/weekly) so that it can serve its purpose of facilitating business analysis. Extract - data from one or more OLTP systems and copied into the warehouse Extract Transform – removing inconsistencies, assemble to a common format, adding missing fields, summarizing detailed data and deriving new fields to store calculated data. Transform Load – map the data and load it into the DW Load built by QuerySurge™
  18. 18. Data Warehouse: the Marketplace “The data warehousing market will see a compound annual growth rate of 11.5% …to reach a total of $13.2 billion in revenue.” - consulting specialist The 451 Group Data Warehouse size Small data warehouses: < 5 TB Midsize data warehouses: 5 TB - 20 TB Large data warehouses: >20 TB - Analyst firm Gartner Leaders in Data Warehouse Data Management Systems       - Analyst firm Gartner’s ‘Magic Quadrant for Data Warehouse Database Management Systems’ built by QuerySurge™
  19. 19. Extract built by QuerySurge™ Legacy DB CRM/ERP DB Finance DB Testing the Data Warehouse: the ETL process Source Data ETL Process Target Data Warehouse Transform Load
  20. 20. built by QuerySurge™ Testing Big Data
  21. 21. built by QuerySurge™ Data Warehouse & Hadoop: 2 Use Cases: Data Warehouse Hadoop NoSQL Hadoop Data Warehouse
  22. 22. built by QuerySurge™ USE CASE 1*** Use Hadoop as a landing zone for big data & raw data 1) bring all raw, big data into Hadoop 2) perform some pre-processing of this data 3) determine which data goes to Data Warehouse 4) Extract, transform and load (ETL) pertinent data into Data Warehouse Use Case #1: Data Warehouse & Hadoop ***Source: Vijay Ramaiah, IBM product manager, datanami magazine, June 10, 2013 built by QuerySurge™
  23. 23. Recommended functional test strategy: Test every entry point in the system (feeds, databases, internal messaging, front-end transactions). The goal: provide rapid localization of data issues between points test entry point built by Business Intelligence software ETL Source Data Source Hadoop ETL Process Target DWH built by QuerySurge™ Use Case #1: Data Warehouse & Hadoop test entry point test entry points
  24. 24. Use Case #2: MongoDB, Hadoop, DWH & Relational DB & Data WarehousingSource Data @ BI, Analytics & ReportingIngestion built by ™ ™ test entry point test entry point test entry point test entry point test entry point
  25. 25. built by QuerySurge™ Testing Big Data: 3 Big Issues - we need to verify more data and to do it faster - we need to automate the testing effort - We need to be able to test across different platforms We need a testing tool!
  26. 26. built by QuerySurge™ About QuerySurge™ built by
  27. 27. built by QuerySurge™ What is QuerySurge™? the collaborative Big Data Testing solution that finds bad data & provides a holistic view of your data’s health built by
  28. 28. the QuerySurge advantage built by QuerySurge™ Automate the entire testing cycle  Automate kickoff, tests, comparison, auto-emailed results Create Tests easily with no SQL programming  ensures minimal time & effort to create tests / obtain results Test across different platforms  Hadoop, data warehouses, NoSQL, database, flat file, XML Collaborate with team  Data Health dashboard, shared tests & auto-emailed reports Verify more data & do it quickly  verifies up to 100% of all data up to 1,000 x faster Integrate for Continuous Delivery  Integrates with most Build, ETL & QA management software
  29. 29. QuerySurge™ Architecture Web-based… Installs on... Linux Connects to… …or any other JDBC compliant data source built by QuerySurge™ QuerySurge Controller QuerySurge Server QuerySurge Agents Flat Files
  30. 30. built by QuerySurge™ QuerySurge™ Modules Design Library SchedulingDeep-Dive Reporting Run Dashboard Query Wizards Data Health Dashboard
  31. 31. Fast and Easy. No programming needed. built by QuerySurge™ QuerySurge™ Modules • Perform 80% of all data tests - no SQL coding needed • Opens up testing to novices & non-technical team members • Speeds up testing for skilled SQL coders • provides a huge Return-On-Investment
  32. 32. Design Library • Create Query Pairs (source & target SQLs) • Great for team members skilled with SQL QuerySurge™ Modules Scheduling  Build groups of Query Pairs  Schedule Test Runs built by QuerySurge™
  33. 33. Deep-Dive Reporting  Examine and automatically email test results Run Dashboard  View real-time execution  Analyze real-time results QuerySurge™ Modules built by QuerySurge™
  34. 34. built by QuerySurge™ • view data reliability & pass rate • add, move, filter, zoom-in on any data widget & underlying data • verify build success or failure QuerySurge™ Modules
  35. 35. (1) Trial in the Cloud of QuerySurgeTM, including self-learning tutorial that works with sample data for 3 days (2) Downloaded Trial of QuerySurgeTM, including self-learning tutorial with sample data or your data for 15 days for more information on our Trials, please visit: www.querysurge.com/compare-trial-options TRIAL IN THE CLOUD built by QuerySurge™ Free Trials & TrainingQuerySurge™ http://www.rttsweb.com/training/courses/big-data-testing-courses Big Data Testing Courses Filled with examples and labs, this hands-on training teaches concepts and HQL techniques used in Big Data testing. For more information on our Big Data Testing classes, please visit:
  36. 36. a last word about Hadoop… built by built by QuerySurge™ To see the video of this webinar please visit: http://www.querysurge.com/solutions/testing-big-data/big-data-testing-for-hadoop Big Data and Hadoop are on the verge of revolutionizing enterprise data management architectures. - DeZyre

×