SlideShare a Scribd company logo
1 of 50
Page1
Applying Testing Techniques to
Hadoop Development
Mark Johnson
markfjohnson@gmail.com
Page2
Who Am I?
• Too many years of programming experience
• Created lots of bugs in my career…but hate for them
to be found by others
Mark Johnson
Regional Director Services -
Hortonworks
Page3
Who are You?
• Technologist
• Programmer
• Want to
produce low
defect or
defect free
code in
Hadoop
Page4
Good Hadoop Development is
Hard!
Page5
Lots of Data !
Page6
• Volume
• Velocity
• Variety
Page7
Fast Release Pressure!
Page8
BUGS!
Page9
Will anyone really notice that
little problem which affects only
1% of the rows?
Page10
YES!
1% of the financial transactions on
Wall Street is a lot of money!
Big Data can amplify small
problems!
Page11
Problem #1: Speed of Light
Page12
Problem #2: Data Permutations
Page13
Problem #3: Processing distributed
across multiple nodes
Page14
How are you currently
preventing BUGS?
Page15
Consequences
The Longer you wait to find
and fix a problem the more
time it will take to fix it!
Page16
Verify “DONE”
Page17
What 4 things do we need to
Verify?
Page18
1.Does it run?
Page19
2.) Functional Correctness
to Requirements
• Data Filters
• Loops and branches
• Calculations
• Output
Page20
3. Resources Correctly
Referenced
• Missing files
• Bad file names
• Etc.
Page21
4. Exceptions and errors
properly handled
• System left in a safe state
• User / SysAdmin informed of the failure
• Idempotency
Page22
Materiality Rule
Materiality Rule: Start with High value tests (easy to test
AND valued by business)
Light & Fast : Don’t create an overly heavy test process
Positive Tests:
• Tests should contain one general data condition
• Individual tests only for those determined by specific
data conditions
Negative Tests:
• Data not present
Page23
Data
1Sample
Uncollected data
Page24
Traditional Sampling Problems
• Sample error reflects the risk that, purely by chance, a
randomly chosen sample of opinions does not reflect the
true views of the population. The “margin of error”
reported in opinion polls reflects this risk and the larger
the sample, the smaller the margin of error
• sampling bias is a bias in which a sample is collected in
such a way that some members of the intended
population are less likely to be included than others.
(wikipedia)
Page25
PIG SAMPLE
Page26
Manual Sampling
• Build the sample dataset based on your program’s
logic.
• Does not need to be more than a few rows for each
test.
• Can get embedded within your test program to
facilitate test management.
Page27
• MapReduce testing framework
• Accepts a small record sample for each test
• Test one thing per test
• Does not reference HDFS directly
• Tests Map and Reduce methods separetly
Page28
Example: Word Count Program -
Mapper
• Does the tokenize work properly
• Does the Mapper properly filter ‘stop’ words
• Is the counter properly initialized
Page29
Example: Word Count - Reducer
• Keys counter values grouped properly
• Counter values are properly aggregated
Page30
Setup MRUnit test suite
• MapDriver:
• ReduceDriver:
• MapReduceDriver:
Page31
MRUnit Mapper test
Page32
Test Results
Page33
MRUnit Reducer Test
Page34
Test Results
Page35
Traditional Pig Functional Testing tools
DUMP
ILLUSTRATE
DESCRIBE
EXPLAIN
Page36
PigUnit to test Pig scripts
• Only one LOAD operation
per test script.
• PigUnit overrides STORE,
LOAD and DUMP
Page37
PigUnit – Setup
• PigTest references
your unmodified pig
script for testing.
• Input:
• Include just the rows required
to test logic
Page38
PigUnit: Assert
Page39
Page40
PigUnit: Output Job Order
Page41
BeeTest
Developed by: Adam Kawa
GitHub Project: https://github.com/kawaa/Beetest.git
Beetest provides a simple ‘unit’ test capability on a given
Hadoop Hive script which runs on a small cluster to
validate a Hive script
Page42
Beetest: Inputs and Outputs
Directory containing the following files defining the
test process:
setup.hql – The HQL script to setup the environment
select.hql – The HQL script to test
Input.tsv – input data
Expected.txt – the script’s expected output
variables.properties – the Beetest properties
Page43
Setting up BeeTest
Page44
BeeTest: Executing the test
• ${table} – value defined in variables.properties
Page45
BeeTest – Test Results
Page46
BeeTest: Test failures
Page47
Getting started with your test initiative
1. Start Simple
2. Materiality rule: Focus on Highest value and easiest
tests first
3. Data Sampling: Use the smallest datasets possible
4. Keep tests “light and fast”
5. Keep Hadoop code and tests in a Source Code
Management system
6. Use Automated test environment (Jenkins, AntHill
Pro, etc.)
7. Maintain and publicly publish historical test results
Page48
Hadoop Test Tool Wrap Up
• Tools and Techniques
– Data Sampling
– MRUnit
– PigUnit
– Beetest
Testing environment still imature but good enough to start
using now
Page49
Mark Johnson
Regional Director Services
Hortonworks
markfjohnson@gmail.com
mjohnson@hortonworks.com
Linkedin:markfjohnson
Twitter: markfjohnson
Source:
https://github.com/mfjohnson/HadoopTesting.
git
Page50

More Related Content

What's hot

Data Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical IndustryData Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical Industry
RTTS
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Databricks
 

What's hot (20)

Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of Hadoop
 
Big Data Testing: Ensuring MongoDB Data Quality
Big Data Testing: Ensuring MongoDB Data QualityBig Data Testing: Ensuring MongoDB Data Quality
Big Data Testing: Ensuring MongoDB Data Quality
 
Completing the Data Equation: Test Data + Data Validation = Success
Completing the Data Equation: Test Data + Data Validation = SuccessCompleting the Data Equation: Test Data + Data Validation = Success
Completing the Data Equation: Test Data + Data Validation = Success
 
Testistanbul 2016 - Keynote: "Enterprise Challenges of Test Data" by Rex Black
Testistanbul 2016 - Keynote: "Enterprise Challenges of Test Data" by Rex BlackTestistanbul 2016 - Keynote: "Enterprise Challenges of Test Data" by Rex Black
Testistanbul 2016 - Keynote: "Enterprise Challenges of Test Data" by Rex Black
 
Whitepaper: Volume Testing Thick Clients and Databases
Whitepaper:  Volume Testing Thick Clients and DatabasesWhitepaper:  Volume Testing Thick Clients and Databases
Whitepaper: Volume Testing Thick Clients and Databases
 
Test Automation for NoSQL Databases
Test Automation for NoSQL DatabasesTest Automation for NoSQL Databases
Test Automation for NoSQL Databases
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
 
Data Warehousing in Pharma: How to Find Bad Data while Meeting Regulatory Req...
Data Warehousing in Pharma: How to Find Bad Data while Meeting Regulatory Req...Data Warehousing in Pharma: How to Find Bad Data while Meeting Regulatory Req...
Data Warehousing in Pharma: How to Find Bad Data while Meeting Regulatory Req...
 
the Data World Distilled
the Data World Distilledthe Data World Distilled
the Data World Distilled
 
Data Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical IndustryData Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical Industry
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing Webinar
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in production
 
Continuous Integration & Continuous Delivery
Continuous Integration & Continuous DeliveryContinuous Integration & Continuous Delivery
Continuous Integration & Continuous Delivery
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInDataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
 
Semantic Image Logging Using Approximate Statistics & MLflow
Semantic Image Logging Using Approximate Statistics & MLflowSemantic Image Logging Using Approximate Statistics & MLflow
Semantic Image Logging Using Approximate Statistics & MLflow
 
Building Data Science into Organizations: Field Experience
Building Data Science into Organizations: Field ExperienceBuilding Data Science into Organizations: Field Experience
Building Data Science into Organizations: Field Experience
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
 
Data Security and Protection in DevOps
Data Security and Protection in DevOps Data Security and Protection in DevOps
Data Security and Protection in DevOps
 

Similar to Applying Testing Techniques for Big Data and Hadoop

Verification Bug Metrics: A Different Approach
Verification Bug Metrics: A Different ApproachVerification Bug Metrics: A Different Approach
Verification Bug Metrics: A Different Approach
DVClub
 
An Agile Approach to Machine Learning
An Agile Approach to Machine LearningAn Agile Approach to Machine Learning
An Agile Approach to Machine Learning
Randy Shoup
 
Driving application development through behavior driven development
Driving application development through behavior driven developmentDriving application development through behavior driven development
Driving application development through behavior driven development
Einar Ingebrigtsen
 

Similar to Applying Testing Techniques for Big Data and Hadoop (20)

Small is Beautiful- Fully Automate your Test Case Design
Small is Beautiful- Fully Automate your Test Case DesignSmall is Beautiful- Fully Automate your Test Case Design
Small is Beautiful- Fully Automate your Test Case Design
 
Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014
 
Can we induce change with what we measure?
Can we induce change with what we measure?Can we induce change with what we measure?
Can we induce change with what we measure?
 
Bugday bkk-2014 nitisak-auto_perf
Bugday bkk-2014 nitisak-auto_perfBugday bkk-2014 nitisak-auto_perf
Bugday bkk-2014 nitisak-auto_perf
 
Verification Bug Metrics: A Different Approach
Verification Bug Metrics: A Different ApproachVerification Bug Metrics: A Different Approach
Verification Bug Metrics: A Different Approach
 
MyHeritage - End 2 End testing Infra
MyHeritage - End 2 End testing InfraMyHeritage - End 2 End testing Infra
MyHeritage - End 2 End testing Infra
 
Continuous Deployment To The Cloud
Continuous Deployment To The CloudContinuous Deployment To The Cloud
Continuous Deployment To The Cloud
 
TLC2018 Thomas Haver: The Automation Firehose - Be Strategic and Tactical
TLC2018 Thomas Haver: The Automation Firehose - Be Strategic and TacticalTLC2018 Thomas Haver: The Automation Firehose - Be Strategic and Tactical
TLC2018 Thomas Haver: The Automation Firehose - Be Strategic and Tactical
 
Test workload otochkin_ppt
Test workload otochkin_pptTest workload otochkin_ppt
Test workload otochkin_ppt
 
Testing in Production - presentation & webinar by Amber Race
Testing in Production - presentation & webinar by Amber RaceTesting in Production - presentation & webinar by Amber Race
Testing in Production - presentation & webinar by Amber Race
 
An Agile Approach to Machine Learning
An Agile Approach to Machine LearningAn Agile Approach to Machine Learning
An Agile Approach to Machine Learning
 
Load testing with Visual Studio and Azure - Andrew Siemer
Load testing with Visual Studio and Azure - Andrew SiemerLoad testing with Visual Studio and Azure - Andrew Siemer
Load testing with Visual Studio and Azure - Andrew Siemer
 
Application Quality Gates in Continuous Delivery: Deliver Better Software Fas...
Application Quality Gates in Continuous Delivery: Deliver Better Software Fas...Application Quality Gates in Continuous Delivery: Deliver Better Software Fas...
Application Quality Gates in Continuous Delivery: Deliver Better Software Fas...
 
Tool up your lamp stack
Tool up your lamp stackTool up your lamp stack
Tool up your lamp stack
 
Tool Up Your LAMP Stack
Tool Up Your LAMP StackTool Up Your LAMP Stack
Tool Up Your LAMP Stack
 
Driving application development through behavior driven development
Driving application development through behavior driven developmentDriving application development through behavior driven development
Driving application development through behavior driven development
 
Introduction to Unit Tests and TDD
Introduction to Unit Tests and TDDIntroduction to Unit Tests and TDD
Introduction to Unit Tests and TDD
 
Lecture #6. automation testing (andrey oleynik)
Lecture #6. automation testing (andrey oleynik)Lecture #6. automation testing (andrey oleynik)
Lecture #6. automation testing (andrey oleynik)
 
The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark
The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and SparkThe Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark
The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark
 
BTV PHP - Building Fast Websites
BTV PHP - Building Fast WebsitesBTV PHP - Building Fast Websites
BTV PHP - Building Fast Websites
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Applying Testing Techniques for Big Data and Hadoop

Editor's Notes

  1. Example:23 mb/sec File Size: 500mb -> 22 seconds File Size: 1,000 -> 43 seconds
  2. More data often means more possible data permutations.
  3. Define standard sample problems…then transition to our relevant sample errors
  4. Random dataset…Only parameter is % of the dataset Testing though sometime needs to reference the data datasets!