SlideShare a Scribd company logo
built by
Bill Hayduk
CEO/President
RTTS
Testing Big Data:
Automated ETL Testing of Hadoop
Jeff Bocarsly, Ph.D.
Chief Architect
QuerySurge Division, RTTS
built by
QuerySurge™
Automate your
Data Warehouse & Big Data Testing
and Reap the Benefits
built by
QuerySurge™
Today’s Agenda
• About Big Data and Hadoop
• Data Warehouse refresher
• Hadoop and DWH Use Case
• How to test Big Data
• Demo of QuerySurge & Hadoop
AGENDA
Topic: Testing Big Data:
Automated ETL Testing of Hadoop
Host: RTTS
Date: Thursday, January 30, 2014
Time: 1:00 pm, Eastern Standard
Time (New York, GMT-05:00)
Session number:630 771 732
built by
QuerySurge™
About
FACTS
Founded:
1996
Locations:
New York (HQ), Atlanta,
Philadelphia, Phoenix
Strategic Partners:
IBM, Microsoft, HP,
Oracle, Teradata,
HortonWorks, Cloudera,
Amazon
Software:
QuerySurge
RTTS is the leading provider of software & data quality
for critical business systems
built by
Facebook handles 300 million photos a day and
about 105 terabytes of data every 30 minutes.
- TechCrunch
The big data market will grow from $3.2 billion in
2010 to $32.4 billion in 2017.
- Research Firm IDC
65% of…advanced analytics will have Hadoop
embedded (in them) by 2015.
-Gartner
built by
QuerySurge™
ETL
Business Intelligence (BI) software
CxOs are using Business Intelligence & Analytics to make critical business decisions
– with the assumption that the underlying data is fine.
“The average organization loses
$8.2 million annually through
poor Data Quality.”
- Gartner
Data Architecture
The Executive Office and Big Data
potential problem
areas
Big data – defined as too much
volume, velocity and variety to
work on normal database
architectures.
Size
Defined as 5 petabytes or more
1 petabyte = 1,000 terabytes
1,000 terabytes = 1,000,000 gigabytes
1,000,000 gigabytes = 1,000,000,000 megabytes
about Big Data
built by
built by
QuerySurge™
Big Data Impact
Handles more than 1 million customer transactions every hour.
• data imported into databases that contain > 2.5 petabytes of data
• the equivalent of 167 times the information contained in all the books in the US Library of
Congress.
Facebook handles 40 billion photos from its user base.
Google processes 1 Terabyte per hour
Twitter processes 85 million tweets per day
eBay processes 80 Terabytes per day
others
built by
QuerySurge™
Requires exceptional technologies to efficiently process large quantities of
data within tolerable elapsed times.
Technologies include:
• massively parallel processing (MPP) databases
• data warehouses
• Data mining grids
• distributed file systems
• distributed databases
• cloud computing platforms
• the Internet, and
• scalable storage system
Big Data Solutions
built by
QuerySurge™
built by
QuerySurge™
What is ?
• easily deals with complexities of high of data
Hadoop is an open source project that
develops software for scalable, distributed computing.
• is a of large data sets across
clusters of computers using simple programming models.
from single servers to 1,000’s of machines, each offering local
computation and storage.
• detects and at the application layer
built by
QuerySurge™
Key Attributes of Hadoop
• Redundant and reliable
• Extremely powerful
• Easy to program distributed apps
• Runs on commodity hardware
Top Vendors
built by
QuerySurge™
“Spending on Hadoop software and subscriptions will increase
to approximately $677 million by the end of 2017, with overall
big data market anticipated to reach the $50 billion mark.”
- Wikibon
built by
QuerySurge™
MapReduce
(Task Tracker)
HDFS
(Data
Node)
Basic Hadoop Architecture
MapReduce – processing part that manages
the programming jobs. (a.k.a. Task Tracker)
HDFS (Hadoop Distributed File System) –
stores data on the machines. (a.k.a. Data
Node)
machine
built by
QuerySurge™
Cluster
Add more machines for scaling – from 1 to 100 to 1,000
Job Tracker accepts jobs, assigns tasks, identifies failed machines
Name Node
Coordination for HDFS. Inserts and extraction are communicated through the Name Node.
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Name Node
Basic Hadoop Architecture (continued)
built by
QuerySurge™
MapReduce
(Task Tracker)
HDFS
(Data
Node)HiveQLHiveQL
HiveQLHiveQL
HiveQL
Apache Hive - a data warehouse infrastructure built on top
of Hadoop for providing data summarization, query, and analysis.
Hive provides a mechanism to query the data using a SQL-like language
called HiveQL that interacts with the HDFS files
• create
• insert
• update
• delete
• select
Apache Hive
built by
QuerySurge™
Data Warehouse Review
about Data Warehouses…
Data Warehouse
• typically a relational database that is designed for query and analysis rather than
for transaction processing
• a place where historical data is stored for archival, analysis & security purposes.
• contains either raw or formatted data
• combines data from multiple sources:
o sales
o salaries
o operational data
o human resource data
o inventory data
o web logs
o social networks
o internet text and docs
o other
built by
QuerySurge™
Data Warehouse: the ETL process
ETL: Extract, Transform, Load
Why ETL?
Need to load the data warehouse regularly (daily/weekly) so that it
can serve its purpose of facilitating business analysis.
Extract - data from one or more OLTP systems and copied into
the warehouse
Extract
Transform – removing inconsistencies, assemble to a common
format, adding missing fields, summarizing detailed data and
deriving new fields to store calculated data.
Transform
Load – map the data and load it into the DW
Load
built by
QuerySurge™
Data Warehouse: the Marketplace
“The data warehousing market will see a compound annual growth rate of
11.5% …to reach a total of $13.2 billion in revenue.”
- consulting specialist The 451 Group
Data Warehouse size
Small data warehouses: < 5 TB
Midsize data warehouses: 5 TB - 20 TB
Large data warehouses: >20 TB
- Analyst firm Gartner
Leaders in Data Warehouse Data Management Systems
  
  
- Analyst firm Gartner’s ‘Magic Quadrant for Data Warehouse Database Management Systems’
built by
QuerySurge™
Extract
built by
QuerySurge™
Legacy DB
CRM/ERP
DB
Finance DB
Testing the Data Warehouse: the ETL process
Source
Data
ETL Process Target
Data Warehouse
Transform
Load
built by
QuerySurge™
Testing Big Data
built by
QuerySurge™
Data Warehouse & Hadoop:
2 Use Cases:
Data
Warehouse
Hadoop
NoSQL
Hadoop
Data
Warehouse
built by
QuerySurge™
USE CASE 1***
Use Hadoop as a landing zone for big data & raw data
1) bring all raw, big data into Hadoop
2) perform some pre-processing of this data
3) determine which data goes to Data Warehouse
4) Extract, transform and load (ETL) pertinent data into Data Warehouse
Use Case #1:
Data Warehouse & Hadoop
***Source: Vijay Ramaiah, IBM product manager, datanami magazine, June 10, 2013
built by
QuerySurge™
Recommended functional test strategy: Test every entry point in the system
(feeds, databases, internal messaging, front-end transactions).
The goal: provide rapid localization of data issues between points
test entry point
built by
Business
Intelligence
software
ETL
Source Data
Source Hadoop ETL Process Target DWH
built by
QuerySurge™
Use Case #1:
Data Warehouse & Hadoop
test entry point test entry points
Use Case #2:
MongoDB, Hadoop, DWH &
Relational DB & Data
WarehousingSource Data
@
BI, Analytics &
ReportingIngestion
built by
™
™
test entry point
test entry point
test entry point
test entry point test entry point
built by
QuerySurge™
Testing Big Data: 3 Big Issues
- we need to verify more data and to do it faster
- we need to automate the testing
effort
- We need to be able to test across different platforms
We need a testing tool!
built by
QuerySurge™
About QuerySurge™
built by
built by
QuerySurge™
What is QuerySurge™?
the collaborative
Big Data Testing solution
that finds bad data &
provides a holistic view of
your data’s health
built by
the QuerySurge advantage
built by
QuerySurge™
Automate the entire testing cycle
 Automate kickoff, tests, comparison, auto-emailed results
Create Tests easily with no SQL programming
 ensures minimal time & effort to create tests / obtain results
Test across different platforms
 Hadoop, data warehouses, NoSQL, database, flat file, XML
Collaborate with team
 Data Health dashboard, shared tests & auto-emailed reports
Verify more data & do it quickly
 verifies up to 100% of all data up to 1,000 x faster
Integrate for Continuous Delivery
 Integrates with most Build, ETL & QA management software
QuerySurge™ Architecture
Web-based…
Installs on...
Linux
Connects to…
…or any other JDBC compliant data source
built by
QuerySurge™
QuerySurge
Controller
QuerySurge
Server
QuerySurge
Agents
Flat Files
built by
QuerySurge™
QuerySurge™ Modules
Design Library
SchedulingDeep-Dive Reporting
Run Dashboard
Query Wizards
Data Health Dashboard
Fast and Easy.
No programming needed.
built by
QuerySurge™
QuerySurge™ Modules
• Perform 80% of all data tests -
no SQL coding needed
• Opens up testing to novices &
non-technical team members
• Speeds up testing for skilled SQL coders
• provides a huge Return-On-Investment
Design Library
• Create Query Pairs (source & target SQLs)
• Great for team members skilled with SQL
QuerySurge™ Modules
Scheduling
 Build groups of Query Pairs
 Schedule Test Runs
built by
QuerySurge™
Deep-Dive Reporting
 Examine and automatically
email test results
Run Dashboard
 View real-time execution
 Analyze real-time results
QuerySurge™ Modules
built by
QuerySurge™
built by
QuerySurge™
• view data reliability & pass rate
• add, move, filter, zoom-in on any data
widget & underlying data
• verify build success or failure
QuerySurge™ Modules
(1) Trial in the Cloud of QuerySurgeTM, including self-learning
tutorial that works with sample data for 3 days
(2) Downloaded Trial of QuerySurgeTM, including self-learning
tutorial with sample data or your data for 15 days
for more information on our Trials, please visit:
www.querysurge.com/compare-trial-options
TRIAL
IN THE CLOUD
built by
QuerySurge™
Free Trials & TrainingQuerySurge™
http://www.rttsweb.com/training/courses/big-data-testing-courses
Big Data Testing Courses
Filled with examples and labs, this hands-on training teaches concepts
and HQL techniques used in Big Data testing.
For more information on our Big Data Testing classes, please visit:
a last word about Hadoop…
built by
built by
QuerySurge™
To see the video of this webinar please visit:
http://www.querysurge.com/solutions/testing-big-data/big-data-testing-for-hadoop
Big Data and Hadoop are on the verge of revolutionizing
enterprise data management architectures.
- DeZyre

More Related Content

What's hot

Overview of oracle database
Overview of oracle databaseOverview of oracle database
Overview of oracle databaseSamar Prasad
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
Benjamin Leonhardi
 
Data Vault 2.0 DeMystified with Dan Linstedt and WhereScape
Data Vault 2.0 DeMystified with Dan Linstedt and WhereScapeData Vault 2.0 DeMystified with Dan Linstedt and WhereScape
Data Vault 2.0 DeMystified with Dan Linstedt and WhereScape
WhereScape
 
Managing 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with AmbariManaging 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with AmbariDataWorks Summit
 
MySQL Architecture and Engine
MySQL Architecture and EngineMySQL Architecture and Engine
MySQL Architecture and EngineAbdul Manaf
 
Ms sql server architecture
Ms sql server architectureMs sql server architecture
Ms sql server architecture
Ajeet Singh
 
Basic oracle-database-administration
Basic oracle-database-administrationBasic oracle-database-administration
Basic oracle-database-administration
sreehari orienit
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
Thang Bui (Bob)
 
Managing Millions of Tests Using Databricks
Managing Millions of Tests Using DatabricksManaging Millions of Tests Using Databricks
Managing Millions of Tests Using Databricks
Databricks
 
Sql server basics
Sql server basicsSql server basics
Sql server basics
VishalJharwade
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in Netflix
Danny Yuan
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
 
The oracle database architecture
The oracle database architectureThe oracle database architecture
The oracle database architecture
Akash Pramanik
 
Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?
James Serra
 
MongoDB Memory Management Demystified
MongoDB Memory Management DemystifiedMongoDB Memory Management Demystified
MongoDB Memory Management DemystifiedMongoDB
 
Criação de Data Warehouse em Banco de Dados NoSQL com Cassandra, Spark e Python
Criação de Data Warehouse em Banco de Dados NoSQL com Cassandra, Spark e PythonCriação de Data Warehouse em Banco de Dados NoSQL com Cassandra, Spark e Python
Criação de Data Warehouse em Banco de Dados NoSQL com Cassandra, Spark e Python
Leandro Mendes Ferreira
 
Le novità di SQL Server 2022
Le novità di SQL Server 2022Le novità di SQL Server 2022
Le novità di SQL Server 2022
Gianluca Hotz
 
MariaDB 10: The Complete Tutorial
MariaDB 10: The Complete TutorialMariaDB 10: The Complete Tutorial
MariaDB 10: The Complete Tutorial
Colin Charles
 

What's hot (20)

Overview of oracle database
Overview of oracle databaseOverview of oracle database
Overview of oracle database
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Oracle archi ppt
Oracle archi pptOracle archi ppt
Oracle archi ppt
 
Data Vault 2.0 DeMystified with Dan Linstedt and WhereScape
Data Vault 2.0 DeMystified with Dan Linstedt and WhereScapeData Vault 2.0 DeMystified with Dan Linstedt and WhereScape
Data Vault 2.0 DeMystified with Dan Linstedt and WhereScape
 
Managing 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with AmbariManaging 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with Ambari
 
MySQL Architecture and Engine
MySQL Architecture and EngineMySQL Architecture and Engine
MySQL Architecture and Engine
 
Ms sql server architecture
Ms sql server architectureMs sql server architecture
Ms sql server architecture
 
Basic oracle-database-administration
Basic oracle-database-administrationBasic oracle-database-administration
Basic oracle-database-administration
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
Managing Millions of Tests Using Databricks
Managing Millions of Tests Using DatabricksManaging Millions of Tests Using Databricks
Managing Millions of Tests Using Databricks
 
Sql server basics
Sql server basicsSql server basics
Sql server basics
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in Netflix
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
The oracle database architecture
The oracle database architectureThe oracle database architecture
The oracle database architecture
 
Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?
 
MongoDB Memory Management Demystified
MongoDB Memory Management DemystifiedMongoDB Memory Management Demystified
MongoDB Memory Management Demystified
 
Criação de Data Warehouse em Banco de Dados NoSQL com Cassandra, Spark e Python
Criação de Data Warehouse em Banco de Dados NoSQL com Cassandra, Spark e PythonCriação de Data Warehouse em Banco de Dados NoSQL com Cassandra, Spark e Python
Criação de Data Warehouse em Banco de Dados NoSQL com Cassandra, Spark e Python
 
Le novità di SQL Server 2022
Le novità di SQL Server 2022Le novità di SQL Server 2022
Le novità di SQL Server 2022
 
MariaDB 10: The Complete Tutorial
MariaDB 10: The Complete TutorialMariaDB 10: The Complete Tutorial
MariaDB 10: The Complete Tutorial
 

Similar to Testing Big Data: Automated Testing of Hadoop with QuerySurge

Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of Hadoop
Bill Hayduk
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing Webinar
RTTS
 
the Data World Distilled
the Data World Distilledthe Data World Distilled
the Data World Distilled
RTTS
 
Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APS
Stéphane Fréchette
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data Warehouse
DataWorks Summit
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKRajesh Jayarman
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the Organization
Seeling Cheung
 
Making BD Work~TIAS_20150622
Making BD Work~TIAS_20150622Making BD Work~TIAS_20150622
Making BD Work~TIAS_20150622Anthony Potappel
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data Analytics
Attunity
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016
Joan Novino
 
Big data talking stories in Healthcare
Big data talking stories in Healthcare Big data talking stories in Healthcare
Big data talking stories in Healthcare
Mostafa
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
Martin Bém
 
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid WarehouseUsing the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Rizaldy Ignacio
 
Enterprise Hadoop is Here to Stay: Plan Your Evolution Strategy
Enterprise Hadoop is Here to Stay: Plan Your Evolution StrategyEnterprise Hadoop is Here to Stay: Plan Your Evolution Strategy
Enterprise Hadoop is Here to Stay: Plan Your Evolution Strategy
Inside Analysis
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
James Serra
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightTillmann Eitelberg
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
Skillwise Group
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & Hadoop
Blackvard
 
Webinar - QuerySurge and Azure DevOps in the Azure Cloud
 Webinar - QuerySurge and Azure DevOps in the Azure Cloud Webinar - QuerySurge and Azure DevOps in the Azure Cloud
Webinar - QuerySurge and Azure DevOps in the Azure Cloud
RTTS
 

Similar to Testing Big Data: Automated Testing of Hadoop with QuerySurge (20)

Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of Hadoop
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing Webinar
 
the Data World Distilled
the Data World Distilledthe Data World Distilled
the Data World Distilled
 
Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APS
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data Warehouse
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the Organization
 
Making BD Work~TIAS_20150622
Making BD Work~TIAS_20150622Making BD Work~TIAS_20150622
Making BD Work~TIAS_20150622
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data Analytics
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016
 
Big data talking stories in Healthcare
Big data talking stories in Healthcare Big data talking stories in Healthcare
Big data talking stories in Healthcare
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid WarehouseUsing the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
 
Enterprise Hadoop is Here to Stay: Plan Your Evolution Strategy
Enterprise Hadoop is Here to Stay: Plan Your Evolution StrategyEnterprise Hadoop is Here to Stay: Plan Your Evolution Strategy
Enterprise Hadoop is Here to Stay: Plan Your Evolution Strategy
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsight
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & Hadoop
 
Webinar - QuerySurge and Azure DevOps in the Azure Cloud
 Webinar - QuerySurge and Azure DevOps in the Azure Cloud Webinar - QuerySurge and Azure DevOps in the Azure Cloud
Webinar - QuerySurge and Azure DevOps in the Azure Cloud
 

More from RTTS

JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Automated Testing of Microsoft Power BI Reports
Automated Testing of Microsoft Power BI ReportsAutomated Testing of Microsoft Power BI Reports
Automated Testing of Microsoft Power BI Reports
RTTS
 
QuerySurge AI webinar
QuerySurge AI webinarQuerySurge AI webinar
QuerySurge AI webinar
RTTS
 
State of the Market - Data Quality in 2023
State of the Market - Data Quality in 2023State of the Market - Data Quality in 2023
State of the Market - Data Quality in 2023
RTTS
 
TestGuild and QuerySurge Presentation -DevOps for Data Testing
TestGuild and QuerySurge Presentation -DevOps for Data TestingTestGuild and QuerySurge Presentation -DevOps for Data Testing
TestGuild and QuerySurge Presentation -DevOps for Data Testing
RTTS
 
Creating a Project Plan for a Data Warehouse Testing Assignment
Creating a Project Plan for a Data Warehouse Testing AssignmentCreating a Project Plan for a Data Warehouse Testing Assignment
Creating a Project Plan for a Data Warehouse Testing Assignment
RTTS
 
RTTS Postman and API Testing Webinar Slides.pdf
RTTS Postman and API Testing Webinar  Slides.pdfRTTS Postman and API Testing Webinar  Slides.pdf
RTTS Postman and API Testing Webinar Slides.pdf
RTTS
 
How to Automate your Enterprise Application / ERP Testing
How to Automate your  Enterprise Application / ERP TestingHow to Automate your  Enterprise Application / ERP Testing
How to Automate your Enterprise Application / ERP Testing
RTTS
 
Creating a Data validation and Testing Strategy
Creating a Data validation and Testing StrategyCreating a Data validation and Testing Strategy
Creating a Data validation and Testing Strategy
RTTS
 
Implementing Azure DevOps with your Testing Project
Implementing Azure DevOps with your Testing ProjectImplementing Azure DevOps with your Testing Project
Implementing Azure DevOps with your Testing Project
RTTS
 
Data Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical IndustryData Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical Industry
RTTS
 
Completing the Data Equation: Test Data + Data Validation = Success
Completing the Data Equation: Test Data + Data Validation = SuccessCompleting the Data Equation: Test Data + Data Validation = Success
Completing the Data Equation: Test Data + Data Validation = Success
RTTS
 
QuerySurge for DevOps
QuerySurge for DevOpsQuerySurge for DevOps
QuerySurge for DevOps
RTTS
 
Leveraging HPE ALM & QuerySurge to test HPE Vertica
Leveraging HPE ALM & QuerySurge to test HPE VerticaLeveraging HPE ALM & QuerySurge to test HPE Vertica
Leveraging HPE ALM & QuerySurge to test HPE Vertica
RTTS
 
Whitepaper: Volume Testing Thick Clients and Databases
Whitepaper:  Volume Testing Thick Clients and DatabasesWhitepaper:  Volume Testing Thick Clients and Databases
Whitepaper: Volume Testing Thick Clients and Databases
RTTS
 
Query Wizards - data testing made easy - no programming
Query Wizards - data testing made easy - no programmingQuery Wizards - data testing made easy - no programming
Query Wizards - data testing made easy - no programming
RTTS
 
Case study: Open Source Automation Framework using Selenium WebDriver
Case study: Open Source Automation Framework using Selenium WebDriverCase study: Open Source Automation Framework using Selenium WebDriver
Case study: Open Source Automation Framework using Selenium WebDriver
RTTS
 
Enterprise Business Intelligence & Data Warehousing: The Data Quality Conundrum
Enterprise Business Intelligence & Data Warehousing: The Data Quality ConundrumEnterprise Business Intelligence & Data Warehousing: The Data Quality Conundrum
Enterprise Business Intelligence & Data Warehousing: The Data Quality Conundrum
RTTS
 
Improve the Health of Your Data
Improve the Health of Your DataImprove the Health of Your Data
Improve the Health of Your Data
RTTS
 
RTTS - the Software Quality Experts
RTTS - the Software Quality ExpertsRTTS - the Software Quality Experts
RTTS - the Software Quality Experts
RTTS
 

More from RTTS (20)

JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Automated Testing of Microsoft Power BI Reports
Automated Testing of Microsoft Power BI ReportsAutomated Testing of Microsoft Power BI Reports
Automated Testing of Microsoft Power BI Reports
 
QuerySurge AI webinar
QuerySurge AI webinarQuerySurge AI webinar
QuerySurge AI webinar
 
State of the Market - Data Quality in 2023
State of the Market - Data Quality in 2023State of the Market - Data Quality in 2023
State of the Market - Data Quality in 2023
 
TestGuild and QuerySurge Presentation -DevOps for Data Testing
TestGuild and QuerySurge Presentation -DevOps for Data TestingTestGuild and QuerySurge Presentation -DevOps for Data Testing
TestGuild and QuerySurge Presentation -DevOps for Data Testing
 
Creating a Project Plan for a Data Warehouse Testing Assignment
Creating a Project Plan for a Data Warehouse Testing AssignmentCreating a Project Plan for a Data Warehouse Testing Assignment
Creating a Project Plan for a Data Warehouse Testing Assignment
 
RTTS Postman and API Testing Webinar Slides.pdf
RTTS Postman and API Testing Webinar  Slides.pdfRTTS Postman and API Testing Webinar  Slides.pdf
RTTS Postman and API Testing Webinar Slides.pdf
 
How to Automate your Enterprise Application / ERP Testing
How to Automate your  Enterprise Application / ERP TestingHow to Automate your  Enterprise Application / ERP Testing
How to Automate your Enterprise Application / ERP Testing
 
Creating a Data validation and Testing Strategy
Creating a Data validation and Testing StrategyCreating a Data validation and Testing Strategy
Creating a Data validation and Testing Strategy
 
Implementing Azure DevOps with your Testing Project
Implementing Azure DevOps with your Testing ProjectImplementing Azure DevOps with your Testing Project
Implementing Azure DevOps with your Testing Project
 
Data Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical IndustryData Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical Industry
 
Completing the Data Equation: Test Data + Data Validation = Success
Completing the Data Equation: Test Data + Data Validation = SuccessCompleting the Data Equation: Test Data + Data Validation = Success
Completing the Data Equation: Test Data + Data Validation = Success
 
QuerySurge for DevOps
QuerySurge for DevOpsQuerySurge for DevOps
QuerySurge for DevOps
 
Leveraging HPE ALM & QuerySurge to test HPE Vertica
Leveraging HPE ALM & QuerySurge to test HPE VerticaLeveraging HPE ALM & QuerySurge to test HPE Vertica
Leveraging HPE ALM & QuerySurge to test HPE Vertica
 
Whitepaper: Volume Testing Thick Clients and Databases
Whitepaper:  Volume Testing Thick Clients and DatabasesWhitepaper:  Volume Testing Thick Clients and Databases
Whitepaper: Volume Testing Thick Clients and Databases
 
Query Wizards - data testing made easy - no programming
Query Wizards - data testing made easy - no programmingQuery Wizards - data testing made easy - no programming
Query Wizards - data testing made easy - no programming
 
Case study: Open Source Automation Framework using Selenium WebDriver
Case study: Open Source Automation Framework using Selenium WebDriverCase study: Open Source Automation Framework using Selenium WebDriver
Case study: Open Source Automation Framework using Selenium WebDriver
 
Enterprise Business Intelligence & Data Warehousing: The Data Quality Conundrum
Enterprise Business Intelligence & Data Warehousing: The Data Quality ConundrumEnterprise Business Intelligence & Data Warehousing: The Data Quality Conundrum
Enterprise Business Intelligence & Data Warehousing: The Data Quality Conundrum
 
Improve the Health of Your Data
Improve the Health of Your DataImprove the Health of Your Data
Improve the Health of Your Data
 
RTTS - the Software Quality Experts
RTTS - the Software Quality ExpertsRTTS - the Software Quality Experts
RTTS - the Software Quality Experts
 

Recently uploaded

How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 

Recently uploaded (20)

How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 

Testing Big Data: Automated Testing of Hadoop with QuerySurge

  • 1. built by Bill Hayduk CEO/President RTTS Testing Big Data: Automated ETL Testing of Hadoop Jeff Bocarsly, Ph.D. Chief Architect QuerySurge Division, RTTS built by QuerySurge™ Automate your Data Warehouse & Big Data Testing and Reap the Benefits
  • 2. built by QuerySurge™ Today’s Agenda • About Big Data and Hadoop • Data Warehouse refresher • Hadoop and DWH Use Case • How to test Big Data • Demo of QuerySurge & Hadoop AGENDA Topic: Testing Big Data: Automated ETL Testing of Hadoop Host: RTTS Date: Thursday, January 30, 2014 Time: 1:00 pm, Eastern Standard Time (New York, GMT-05:00) Session number:630 771 732
  • 3. built by QuerySurge™ About FACTS Founded: 1996 Locations: New York (HQ), Atlanta, Philadelphia, Phoenix Strategic Partners: IBM, Microsoft, HP, Oracle, Teradata, HortonWorks, Cloudera, Amazon Software: QuerySurge RTTS is the leading provider of software & data quality for critical business systems
  • 4. built by Facebook handles 300 million photos a day and about 105 terabytes of data every 30 minutes. - TechCrunch The big data market will grow from $3.2 billion in 2010 to $32.4 billion in 2017. - Research Firm IDC 65% of…advanced analytics will have Hadoop embedded (in them) by 2015. -Gartner built by QuerySurge™
  • 5. ETL Business Intelligence (BI) software CxOs are using Business Intelligence & Analytics to make critical business decisions – with the assumption that the underlying data is fine. “The average organization loses $8.2 million annually through poor Data Quality.” - Gartner Data Architecture The Executive Office and Big Data potential problem areas
  • 6. Big data – defined as too much volume, velocity and variety to work on normal database architectures. Size Defined as 5 petabytes or more 1 petabyte = 1,000 terabytes 1,000 terabytes = 1,000,000 gigabytes 1,000,000 gigabytes = 1,000,000,000 megabytes about Big Data built by built by QuerySurge™
  • 7. Big Data Impact Handles more than 1 million customer transactions every hour. • data imported into databases that contain > 2.5 petabytes of data • the equivalent of 167 times the information contained in all the books in the US Library of Congress. Facebook handles 40 billion photos from its user base. Google processes 1 Terabyte per hour Twitter processes 85 million tweets per day eBay processes 80 Terabytes per day others built by QuerySurge™
  • 8. Requires exceptional technologies to efficiently process large quantities of data within tolerable elapsed times. Technologies include: • massively parallel processing (MPP) databases • data warehouses • Data mining grids • distributed file systems • distributed databases • cloud computing platforms • the Internet, and • scalable storage system Big Data Solutions built by QuerySurge™
  • 9. built by QuerySurge™ What is ? • easily deals with complexities of high of data Hadoop is an open source project that develops software for scalable, distributed computing. • is a of large data sets across clusters of computers using simple programming models. from single servers to 1,000’s of machines, each offering local computation and storage. • detects and at the application layer
  • 10. built by QuerySurge™ Key Attributes of Hadoop • Redundant and reliable • Extremely powerful • Easy to program distributed apps • Runs on commodity hardware
  • 11. Top Vendors built by QuerySurge™ “Spending on Hadoop software and subscriptions will increase to approximately $677 million by the end of 2017, with overall big data market anticipated to reach the $50 billion mark.” - Wikibon
  • 12. built by QuerySurge™ MapReduce (Task Tracker) HDFS (Data Node) Basic Hadoop Architecture MapReduce – processing part that manages the programming jobs. (a.k.a. Task Tracker) HDFS (Hadoop Distributed File System) – stores data on the machines. (a.k.a. Data Node) machine
  • 13. built by QuerySurge™ Cluster Add more machines for scaling – from 1 to 100 to 1,000 Job Tracker accepts jobs, assigns tasks, identifies failed machines Name Node Coordination for HDFS. Inserts and extraction are communicated through the Name Node. Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Name Node Basic Hadoop Architecture (continued)
  • 14. built by QuerySurge™ MapReduce (Task Tracker) HDFS (Data Node)HiveQLHiveQL HiveQLHiveQL HiveQL Apache Hive - a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Hive provides a mechanism to query the data using a SQL-like language called HiveQL that interacts with the HDFS files • create • insert • update • delete • select Apache Hive
  • 16. about Data Warehouses… Data Warehouse • typically a relational database that is designed for query and analysis rather than for transaction processing • a place where historical data is stored for archival, analysis & security purposes. • contains either raw or formatted data • combines data from multiple sources: o sales o salaries o operational data o human resource data o inventory data o web logs o social networks o internet text and docs o other built by QuerySurge™
  • 17. Data Warehouse: the ETL process ETL: Extract, Transform, Load Why ETL? Need to load the data warehouse regularly (daily/weekly) so that it can serve its purpose of facilitating business analysis. Extract - data from one or more OLTP systems and copied into the warehouse Extract Transform – removing inconsistencies, assemble to a common format, adding missing fields, summarizing detailed data and deriving new fields to store calculated data. Transform Load – map the data and load it into the DW Load built by QuerySurge™
  • 18. Data Warehouse: the Marketplace “The data warehousing market will see a compound annual growth rate of 11.5% …to reach a total of $13.2 billion in revenue.” - consulting specialist The 451 Group Data Warehouse size Small data warehouses: < 5 TB Midsize data warehouses: 5 TB - 20 TB Large data warehouses: >20 TB - Analyst firm Gartner Leaders in Data Warehouse Data Management Systems       - Analyst firm Gartner’s ‘Magic Quadrant for Data Warehouse Database Management Systems’ built by QuerySurge™
  • 19. Extract built by QuerySurge™ Legacy DB CRM/ERP DB Finance DB Testing the Data Warehouse: the ETL process Source Data ETL Process Target Data Warehouse Transform Load
  • 21. built by QuerySurge™ Data Warehouse & Hadoop: 2 Use Cases: Data Warehouse Hadoop NoSQL Hadoop Data Warehouse
  • 22. built by QuerySurge™ USE CASE 1*** Use Hadoop as a landing zone for big data & raw data 1) bring all raw, big data into Hadoop 2) perform some pre-processing of this data 3) determine which data goes to Data Warehouse 4) Extract, transform and load (ETL) pertinent data into Data Warehouse Use Case #1: Data Warehouse & Hadoop ***Source: Vijay Ramaiah, IBM product manager, datanami magazine, June 10, 2013 built by QuerySurge™
  • 23. Recommended functional test strategy: Test every entry point in the system (feeds, databases, internal messaging, front-end transactions). The goal: provide rapid localization of data issues between points test entry point built by Business Intelligence software ETL Source Data Source Hadoop ETL Process Target DWH built by QuerySurge™ Use Case #1: Data Warehouse & Hadoop test entry point test entry points
  • 24. Use Case #2: MongoDB, Hadoop, DWH & Relational DB & Data WarehousingSource Data @ BI, Analytics & ReportingIngestion built by ™ ™ test entry point test entry point test entry point test entry point test entry point
  • 25. built by QuerySurge™ Testing Big Data: 3 Big Issues - we need to verify more data and to do it faster - we need to automate the testing effort - We need to be able to test across different platforms We need a testing tool!
  • 27. built by QuerySurge™ What is QuerySurge™? the collaborative Big Data Testing solution that finds bad data & provides a holistic view of your data’s health built by
  • 28. the QuerySurge advantage built by QuerySurge™ Automate the entire testing cycle  Automate kickoff, tests, comparison, auto-emailed results Create Tests easily with no SQL programming  ensures minimal time & effort to create tests / obtain results Test across different platforms  Hadoop, data warehouses, NoSQL, database, flat file, XML Collaborate with team  Data Health dashboard, shared tests & auto-emailed reports Verify more data & do it quickly  verifies up to 100% of all data up to 1,000 x faster Integrate for Continuous Delivery  Integrates with most Build, ETL & QA management software
  • 29. QuerySurge™ Architecture Web-based… Installs on... Linux Connects to… …or any other JDBC compliant data source built by QuerySurge™ QuerySurge Controller QuerySurge Server QuerySurge Agents Flat Files
  • 30. built by QuerySurge™ QuerySurge™ Modules Design Library SchedulingDeep-Dive Reporting Run Dashboard Query Wizards Data Health Dashboard
  • 31. Fast and Easy. No programming needed. built by QuerySurge™ QuerySurge™ Modules • Perform 80% of all data tests - no SQL coding needed • Opens up testing to novices & non-technical team members • Speeds up testing for skilled SQL coders • provides a huge Return-On-Investment
  • 32. Design Library • Create Query Pairs (source & target SQLs) • Great for team members skilled with SQL QuerySurge™ Modules Scheduling  Build groups of Query Pairs  Schedule Test Runs built by QuerySurge™
  • 33. Deep-Dive Reporting  Examine and automatically email test results Run Dashboard  View real-time execution  Analyze real-time results QuerySurge™ Modules built by QuerySurge™
  • 34. built by QuerySurge™ • view data reliability & pass rate • add, move, filter, zoom-in on any data widget & underlying data • verify build success or failure QuerySurge™ Modules
  • 35. (1) Trial in the Cloud of QuerySurgeTM, including self-learning tutorial that works with sample data for 3 days (2) Downloaded Trial of QuerySurgeTM, including self-learning tutorial with sample data or your data for 15 days for more information on our Trials, please visit: www.querysurge.com/compare-trial-options TRIAL IN THE CLOUD built by QuerySurge™ Free Trials & TrainingQuerySurge™ http://www.rttsweb.com/training/courses/big-data-testing-courses Big Data Testing Courses Filled with examples and labs, this hands-on training teaches concepts and HQL techniques used in Big Data testing. For more information on our Big Data Testing classes, please visit:
  • 36. a last word about Hadoop… built by built by QuerySurge™ To see the video of this webinar please visit: http://www.querysurge.com/solutions/testing-big-data/big-data-testing-for-hadoop Big Data and Hadoop are on the verge of revolutionizing enterprise data management architectures. - DeZyre

Editor's Notes

  1. Largest known cluster is 4500 nodes
  2. Designing and maintaining the ETL process is often considered one of the most difficult and resource-intensive portions of a data warehouse project. Many data warehousing projects use ETL tools to manage this process. Other data warehouse builders create their own ETL tools and processes, either inside or outside the database. Besides the support of extraction, transformation, and loading, there are some other tasks that are important for a successful ETL implementation as part of the daily operations of the data warehouse and its support for further enhancements.
  3. Informatica’s software is the premier used for ETL, but was not mentioned in Gartner’s report because they don’t have DW software.
  4. QuerySurge provides insight into the health of your data throughout your organization through BI dashboards and reporting at your fingertips. It is a collaborative tool that allows for distributed use of the tool throughout your organization and provides for a sharable, holistic view of your data’s health and your organization’s level of maturity of your data management.
  5. Your distributed team from around the world can use any of these web browsers: Internet Explorer, Chrome, Firefox and Safari. Installs on operating systems: Windows & Linux. QS connects to any JDBC-compliant data source. Even if it is not listed here.