Bug bites Elephant?Test-driven Quality Assurancein Big Data Application DevelopmentDr. Dominik Benz, inovex GmbH2013/06/03...
? TDD!???????Write/execute tests, specifyacceptance criteria, …2Who speaks… … the Elephant language?Class AextendsMapper…R...
3The road… … to Big Data QAour Big DataQA problemthe FitNesseapproachtest datadefinition /selectionjob & workflowcontrolre...
4QA problem Web Intelligence @ 1&1DWHHadoop Cluster~ 1 billion logevents / day,~ 1 TB (thrift)logfileschains of MR jobs,ru...
5QA problem An exemplary workflowLog Files(thrift)Log Files(thrift)Log Files(thrift)Inter-mediateresult(avro)MRjob 1… DWH(...
method tests what? issues for our usecaseJUnit isolated functions no integration, Java syntaxMRUnit 1 mapper + 1 reducer „...
7The road… … to Big Data QABig Data QAis different!the FitNesseapproachtest datadefinition /selectionjob & workflowcontrol...
8FitNesse In a nutshell„fully integratedstandalone wiki andacceptance testingframework”• „executable“ Wiki-Pages (returnin...
9FitNesse Architecture Overviewscript |check |num results |3 |BrowserFitNesseServerpublic intnumResults{ ... }System under...
10FitNesse An Exemplary Test
11FitNesse Exemplary Test Source!path /home/inovex/lib/*.jar| script | Hadoop || upload | viewLog.csv | to hdfs | /testdat...
12FitNesse Hadoop Fixture Java Codepublic class Hadoop {public boolean uploadToHdfs(String localFile,String remoteFile) {....
13The road… … to Big Data QABig Data QAis different!Fitnesse Wikitest execution!test datadefinition /selectionjob & workfl...
14Test Data CSV
‣ Big Data: Efficient data transfer among heterogeneous sources‣ Define Interface via IDL, Compiler for many languages15Te...
‣ Dev/Test Hadoop Cluster: Identical Hardware like Prod, butfewer nodes‣ (random/biased) sampling e.g. on daily basis‣ Fee...
17The road… … to Big Data QABig Data QAis different!FitNesse Wikitest execution!Define CSV /thrift / real-world test data!...
‣ Execute arbitrary (shell) commands‣ Mainly a wrapper around apache.commons.exec.CommandLine18Job Control Swiss Army Knif...
‣ Hide complexity from test authors‣ „define“ appropriate test language via (Java) method names‣ re-use other fixtures (Sh...
‣ FitNesse allows to group tests into suites‣ Can be used to simulate MR processing chains‣ SetupSuite / TearDownSuite for...
21The road… … to Big Data QABig Data QAis different!FitNesse Wikitest execution!Define CSV /thrift / real-world data!Use s...
‣ Validate RDBMS contents (via JDBC)‣ E.g. for checking the final result‣ Or use Hive + Hive-Server to query raw data22Res...
‣ Execute arbitrary pig commands from Wiki page‣ Inspect e.g. binary intermediate results (avro, …)23Results Pig
public class PigConsole extends PigServer {public void loadAvroFileUsingAlias(Stringfilename, String alias) {this.register...
25Results Server InfrastructureFitnesse MasterTestEnvironmentsProjA ProjBTestConfigurationsProjA ProjBdev qs live dev qs l...
26Thank you! dominik.benz@inovex.deBig Data QAis different!FitNesse Wikitest execution!Define CSV /thrift / real-world dat...
27Want more? Inovex trains you!Android Developer Training (3 days, Karlsruhe/München)Hadoop Developer Training (3 days, Ka...
28Inovex @bbuzzStefan Kathrin BernhardJörgAndrewChristian Christian
Upcoming SlideShare
Loading in …5
×

Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Development

1,010 views
796 views

Published on

Around the currently available large piles of Big Data, there's happening quite a mixed gathering: Business Engineers define which insightswould be precious, Analysts build models, Hadoop programmers tame the flood of data, and Operations people setup machines and networks. It's exactly the interplay of all participants which is central to project success. This setup together with the distributed nature of processing poses new challenges to well-established models of assuring software artifact quality: How can non-programmers define acceptance criteria? How can functionalities be tested which depend on cluster execution, orchestration of, e.g., different hadoop jobs without delaying the development process? Which data selection is suited best for simulating the live environment? How can intermediate results in arbitrary serialization formats be inspected?

In this talk, experiences and best practices from approaching these problems in a large-scale log data analysis project will be presented. At 1&1, our team develops hadoop applications which process roughly 1 billion log events (~1 TB) per day. We will give an overview of the hard- and software setup of our quality assurance environment, which includes FitNesse as a wiki-style acceptance testing framework.Starting from a comparison with existing test frameworks like MRUnit, we will explain how we automate the parameterized deployment of our applications, choose test data sampling strategies, perform workflow management and orchestration of jobs / applications, and use Pig for inspection of intermediate results and definition of final acceptance criteria. Our conclusion is that test-driven development in the field of Big Data requires adaption of existing paradigms, but is crucial for maintaining high quality standards for the resulting applications.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,010
On SlideShare
0
From Embeds
0
Number of Embeds
100
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Development

  1. 1. Bug bites Elephant?Test-driven Quality Assurancein Big Data Application DevelopmentDr. Dominik Benz, inovex GmbH2013/06/03, Berlin Buzzwords
  2. 2. ? TDD!???????Write/execute tests, specifyacceptance criteria, …2Who speaks… … the Elephant language?Class AextendsMapper…ROI, $$, …apt-getinstall…
  3. 3. 3The road… … to Big Data QAour Big DataQA problemthe FitNesseapproachtest datadefinition /selectionjob & workflowcontrolresultinspection
  4. 4. 4QA problem Web Intelligence @ 1&1DWHHadoop Cluster~ 1 billion logevents / day,~ 1 TB (thrift)logfileschains of MR jobs,running on20 nodes / 8 cores /96 GB RAM (CDH)BI reporting, webanalytics, …
  5. 5. 5QA problem An exemplary workflowLog Files(thrift)Log Files(thrift)Log Files(thrift)Inter-mediateresult(avro)MRjob 1… DWH(RDBMS)MRjob 2create(sample)input datacreate(sample)input data? inspect(binary)formatsinspect(binary)formats? controlworkflowscontrolworkflows?
  6. 6. method tests what? issues for our usecaseJUnit isolated functions no integration, Java syntaxMRUnit 1 mapper + 1 reducer „little“ integration, JavasyntaxiTest hadoop jobs/workflows Java / Groovy syntaxScripts/CLI (manual) scripting/inspect. „script chaos“, syntax6QA problem Existing ApproachesFitNesse as suitable addition / solution!
  7. 7. 7The road… … to Big Data QABig Data QAis different!the FitNesseapproachtest datadefinition /selectionjob & workflowcontrolresultinspection
  8. 8. 8FitNesse In a nutshell„fully integratedstandalone wiki andacceptance testingframework”• „executable“ Wiki-Pages (returning testresults)• (almost) naturallanguage testspecification• connection to SUTvia (Java-)“Fixtures“
  9. 9. 9FitNesse Architecture Overviewscript |check |num results |3 |BrowserFitNesseServerpublic intnumResults{ ... }System under TestFixtures„calling java methods fromwiki“, compare return valuesIntegrates with REST, Jenkins…
  10. 10. 10FitNesse An Exemplary Test
  11. 11. 11FitNesse Exemplary Test Source!path /home/inovex/lib/*.jar| script | Hadoop || upload | viewLog.csv | to hdfs | /testdata/ || hadoop job from jar | viewLog.jar | [...] || show | job output || check | number of output files | 3 |
  12. 12. 12FitNesse Hadoop Fixture Java Codepublic class Hadoop {public boolean uploadToHdfs(String localFile,String remoteFile) {...}public boolean hadoopJobFromJar(String jar,String input, String output) {...}public String jobOutput() {...}public String numberOfOutputFiles() {...}}
  13. 13. 13The road… … to Big Data QABig Data QAis different!Fitnesse Wikitest execution!test datadefinition /selectionjob & workflowcontrolresultinspection
  14. 14. 14Test Data CSV
  15. 15. ‣ Big Data: Efficient data transfer among heterogeneous sources‣ Define Interface via IDL, Compiler for many languages15Test Data Thrift
  16. 16. ‣ Dev/Test Hadoop Cluster: Identical Hardware like Prod, butfewer nodes‣ (random/biased) sampling e.g. on daily basis‣ Feedback loop:‣ identify „special cases“ from real data‣ include them in (manual) data definition‣ Gradually increase test coverage / artefact quality16Test Data Real World Data
  17. 17. 17The road… … to Big Data QABig Data QAis different!FitNesse Wikitest execution!Define CSV /thrift / real-world test data!job & workflowcontrolresultinspection
  18. 18. ‣ Execute arbitrary (shell) commands‣ Mainly a wrapper around apache.commons.exec.CommandLine18Job Control Swiss Army Knife: Shell
  19. 19. ‣ Hide complexity from test authors‣ „define“ appropriate test language via (Java) method names‣ re-use other fixtures (Shell, …) internally19Job Control Hadoop Fixture
  20. 20. ‣ FitNesse allows to group tests into suites‣ Can be used to simulate MR processing chains‣ SetupSuite / TearDownSuite for creating /destroying test conditions‣ Tests can still be executed individually20Job Control Workflows & SuitesMR job 1MR job 2
  21. 21. 21The road… … to Big Data QABig Data QAis different!FitNesse Wikitest execution!Define CSV /thrift / real-world data!Use suites & fixturesfor jobs/workflows!resultinspection
  22. 22. ‣ Validate RDBMS contents (via JDBC)‣ E.g. for checking the final result‣ Or use Hive + Hive-Server to query raw data22Results Data Warehouse / Hive
  23. 23. ‣ Execute arbitrary pig commands from Wiki page‣ Inspect e.g. binary intermediate results (avro, …)23Results Pig
  24. 24. public class PigConsole extends PigServer {public void loadAvroFileUsingAlias(Stringfilename, String alias) {this.registerQuery(alias + "= LOAD" + filename + "USING" +AVRO_STORAGE_LOADER + ";");}}24Results Pig Fixture extends PigServer
  25. 25. 25Results Server InfrastructureFitnesse MasterTestEnvironmentsProjA ProjBTestConfigurationsProjA ProjBdev qs live dev qs liveImport / edittests remotelyQSProjA SlaveDevProjA SlaveLiveProjA SlaveProjAQSProjA SlaveDevProjA SlaveLiveProjA SlaveImport / editconfig remotelydev qs live
  26. 26. 26Thank you! dominik.benz@inovex.deBig Data QAis different!FitNesse Wikitest execution!Define CSV /thrift / real-world data!Inspect resultsvia Pig/HiveUse suites & fixturesfor jobs/workflows!
  27. 27. 27Want more? Inovex trains you!Android Developer Training (3 days, Karlsruhe/München)Hadoop Developer Training (3 days, Karlsruhe/Köln)Certified Scrum Developer Training (5 days, Köln)Pentaho Data Integration Training (4 days, München/Köln)Liferay Portal-Admin Training (3 days, Karlsruhe)Liferay Portal-Developer Training (4 days, Karlsruhe)information and registration atwww.inovex.de/offene-trainings
  28. 28. 28Inovex @bbuzzStefan Kathrin BernhardJörgAndrewChristian Christian

×