Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Applying Testing Techniques for Big Data and Hadoop

2,900 views

Published on

Testing “Big Data” can mean big time investment; several hours often spent just realize you made a simple typo. You fix the typo and then wait another couple hours for your script to hopefully this time run to completion. Even if the Big Data script or program ran to completion are you sure your data analysis is correct? Getting programs to run to completion and to assure functional accuracy per the requirements are some of the biggest hidden problems in big data today.

During this overview presentation we will first introduce unit and functional testing techniques and high level concepts to consider in the Hadoop Ecosystem. The second half of the presentation we will explore real testing examples using tools such as PigUnit, JUnit for UDF testing, BeeTest and Hive limited test data set testing.

Published in: Technology
  • Be the first to comment

Applying Testing Techniques for Big Data and Hadoop

  1. 1. Page1 Applying Testing Techniques to Hadoop Development Mark Johnson markfjohnson@gmail.com
  2. 2. Page2 Who Am I? • Too many years of programming experience • Created lots of bugs in my career…but hate for them to be found by others Mark Johnson Regional Director Services - Hortonworks
  3. 3. Page3 Who are You? • Technologist • Programmer • Want to produce low defect or defect free code in Hadoop
  4. 4. Page4 Good Hadoop Development is Hard!
  5. 5. Page5 Lots of Data !
  6. 6. Page6 • Volume • Velocity • Variety
  7. 7. Page7 Fast Release Pressure!
  8. 8. Page8 BUGS!
  9. 9. Page9 Will anyone really notice that little problem which affects only 1% of the rows?
  10. 10. Page10 YES! 1% of the financial transactions on Wall Street is a lot of money! Big Data can amplify small problems!
  11. 11. Page11 Problem #1: Speed of Light
  12. 12. Page12 Problem #2: Data Permutations
  13. 13. Page13 Problem #3: Processing distributed across multiple nodes
  14. 14. Page14 How are you currently preventing BUGS?
  15. 15. Page15 Consequences The Longer you wait to find and fix a problem the more time it will take to fix it!
  16. 16. Page16 Verify “DONE”
  17. 17. Page17 What 4 things do we need to Verify?
  18. 18. Page18 1.Does it run?
  19. 19. Page19 2.) Functional Correctness to Requirements • Data Filters • Loops and branches • Calculations • Output
  20. 20. Page20 3. Resources Correctly Referenced • Missing files • Bad file names • Etc.
  21. 21. Page21 4. Exceptions and errors properly handled • System left in a safe state • User / SysAdmin informed of the failure • Idempotency
  22. 22. Page22 Materiality Rule Materiality Rule: Start with High value tests (easy to test AND valued by business) Light & Fast : Don’t create an overly heavy test process Positive Tests: • Tests should contain one general data condition • Individual tests only for those determined by specific data conditions Negative Tests: • Data not present
  23. 23. Page23 Data 1Sample Uncollected data
  24. 24. Page24 Traditional Sampling Problems • Sample error reflects the risk that, purely by chance, a randomly chosen sample of opinions does not reflect the true views of the population. The “margin of error” reported in opinion polls reflects this risk and the larger the sample, the smaller the margin of error • sampling bias is a bias in which a sample is collected in such a way that some members of the intended population are less likely to be included than others. (wikipedia)
  25. 25. Page25 PIG SAMPLE
  26. 26. Page26 Manual Sampling • Build the sample dataset based on your program’s logic. • Does not need to be more than a few rows for each test. • Can get embedded within your test program to facilitate test management.
  27. 27. Page27 • MapReduce testing framework • Accepts a small record sample for each test • Test one thing per test • Does not reference HDFS directly • Tests Map and Reduce methods separetly
  28. 28. Page28 Example: Word Count Program - Mapper • Does the tokenize work properly • Does the Mapper properly filter ‘stop’ words • Is the counter properly initialized
  29. 29. Page29 Example: Word Count - Reducer • Keys counter values grouped properly • Counter values are properly aggregated
  30. 30. Page30 Setup MRUnit test suite • MapDriver: • ReduceDriver: • MapReduceDriver:
  31. 31. Page31 MRUnit Mapper test
  32. 32. Page32 Test Results
  33. 33. Page33 MRUnit Reducer Test
  34. 34. Page34 Test Results
  35. 35. Page35 Traditional Pig Functional Testing tools DUMP ILLUSTRATE DESCRIBE EXPLAIN
  36. 36. Page36 PigUnit to test Pig scripts • Only one LOAD operation per test script. • PigUnit overrides STORE, LOAD and DUMP
  37. 37. Page37 PigUnit – Setup • PigTest references your unmodified pig script for testing. • Input: • Include just the rows required to test logic
  38. 38. Page38 PigUnit: Assert
  39. 39. Page39
  40. 40. Page40 PigUnit: Output Job Order
  41. 41. Page41 BeeTest Developed by: Adam Kawa GitHub Project: https://github.com/kawaa/Beetest.git Beetest provides a simple ‘unit’ test capability on a given Hadoop Hive script which runs on a small cluster to validate a Hive script
  42. 42. Page42 Beetest: Inputs and Outputs Directory containing the following files defining the test process: setup.hql – The HQL script to setup the environment select.hql – The HQL script to test Input.tsv – input data Expected.txt – the script’s expected output variables.properties – the Beetest properties
  43. 43. Page43 Setting up BeeTest
  44. 44. Page44 BeeTest: Executing the test • ${table} – value defined in variables.properties
  45. 45. Page45 BeeTest – Test Results
  46. 46. Page46 BeeTest: Test failures
  47. 47. Page47 Getting started with your test initiative 1. Start Simple 2. Materiality rule: Focus on Highest value and easiest tests first 3. Data Sampling: Use the smallest datasets possible 4. Keep tests “light and fast” 5. Keep Hadoop code and tests in a Source Code Management system 6. Use Automated test environment (Jenkins, AntHill Pro, etc.) 7. Maintain and publicly publish historical test results
  48. 48. Page48 Hadoop Test Tool Wrap Up • Tools and Techniques – Data Sampling – MRUnit – PigUnit – Beetest Testing environment still imature but good enough to start using now
  49. 49. Page49 Mark Johnson Regional Director Services Hortonworks markfjohnson@gmail.com mjohnson@hortonworks.com Linkedin:markfjohnson Twitter: markfjohnson Source: https://github.com/mfjohnson/HadoopTesting. git
  50. 50. Page50

×