Sound Empirical Evidence in Software Testing


Published on

In this presentation I will show a set of important topics about Software Engineering Empirical Studies that can be useful for increasing quality on your thesis and monographs in general. You can read this presentation and to think about how to do a good experimentation by apply its objectives, validation methods, questions, answers expected, define metrics and measuring it.I will exhibit how the researchers selected the data for avoid case studies in a biased way using a GQM methodology to sort the study in a simpler view as well.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Sound Empirical Evidence in Software Testing

  1. 1. Jaguaraci Silva
  2. 2. Today´Today´s schedule:1 - SWE Empirical Studies (Brief Introduction)2 – Analyzing the Study using a GQM Methodology3 – Discuss its Implications for SWEexperimentation4 - Conclusion
  3. 3. SWE Empirical Studies (Very Brief Introduction)
  4. 4. SWE Empirical Studies (Very Brief Introduction)•Scientific method. Scientists develop a theory to explain aphenomenon; they propose a hypothesis and then test alternativevariations of the hypothesis. As they do so, they collect data to verify orrefute the claims of the hypothesis.•Engineering method. Engineers develop and test a solution to ahypothesis. Based upon the results of the test, they improve the solutionuntil it requires no further improvement.•Empirical method. A statistical method is proposed as a means tovalidate a given hypothesis. Unlike the scientific method, there may notbe a formal model or theory describing the hypothesis. Data is collectedto verify the hypothesis.•Analytical method. A formal theory is developed, and results derivedfrom that theory can be compared with empirical observations.
  5. 5. SWE Empirical Studies (Very Brief Introduction)Validation Methods
  6. 6. SWE Empirical Studies (Very Brief Introduction)Validation Methods -> Evaluation
  7. 7. SWE Empirical Studies (Very Brief Introduction)GQM Methodology
  8. 8. From this point, we will analyze the paper.Any questions?
  9. 9. Project PlanTo list the inspected papers and tools togetherwith statistics on their experiments;To explicit how many out of the consideredclasses are container classes, if this was clearlyspecified.To list how the evaluation classes were selected;
  10. 10. ProjectPlan
  11. 11. Defining the ObjectivesCompare the results in terms of the achievedbranch coverage;I/O interaction with the environment, multi-threading, etc.
  12. 12. What’s branch coverage?
  13. 13. Defining the ObjectivesTest generation for such code is unsafe asthe tested code might interact with itsenvironment in undesired ways;To find out how many unsafe operations areattempted during test generation;To discuss the potential implications of thechoice of case studies;
  14. 14. QuestionsRQ1: What is the probability distribution ofachievable branch coverage on open sourcesoftware?RQ2: How often do classes execute unsafeoperations?RQ3: What are the consequences ofchoosing a small case study in a biasedway?
  15. 15. Collecting the DataOut of 44 evaluations we considered inour literature survey, 29 selected theircase study programs from open sourceprograms;In total, there are 8784 classes and291,639 bytecode branches reported byEVOSUITE in this case study;
  16. 16. Collecting the DataWe based our selection on the datasetof all projects tagged as being writtenin the Java programming language;We therefore had to consider 321projects until we had a set of 100projects in binary format;
  17. 17. Collecting the Data
  18. 18. Measurement of the MetricsThe branch distance estimates how close a branchis to evaluating to true or false for a particular run;For each branch we consider the minimum branchdistance over all test cases of a test suite;The overall fitness of a test suite is the sum ofthese minimal values, such that an individual with100% branch coverage has fitness 0;
  19. 19. Achieving the objectivesAs context of our experiment we chose theEVOSUITE tool, which automaticallygenerates test suites for Java classes,targeting branch coverage;EVOSUITE uses genetic algorithm evolvescandidate individuals using operators (e.g.,selection, crossover and mutation);
  20. 20. Achieving the objectivesMutation of test cases may add,remove, or change the method calls ina sequence;Fitness is calculated with respect tobranch coverage, using a calculationbased on the well-established branchdistance measurement.
  21. 21. Analyzing the Answers
  22. 22. Analyzing the Answers
  23. 23. Analyzing the Answers
  24. 24. Analyzing the Answers
  25. 25. Analyzing the AnswersRQ1 - This shows that there is a large number ofclasses which can easily be fully covered by EVOSUITE(coverage 90%-100%), and also a large number ofclasses with problems (coverage 0%-10%), while therest is evenly distributed across the 10%-90% range.RQ2 - Consequently, interactions with theenvironment are a prime source of problems inachieving coverage. It is striking that 71% of allclasses lead to some kind of FilePermission.
  26. 26. Analyzing the AnswersRQ3 - Our empirical analysis has shown that 90.7%of classes may lead to interactions with theirenvironment. When there are no interactions withthe environment EVOSUITE can achieve an averagecoverage as high as 90%;
  27. 27. Analyzing the AnswersTo understand the problems in testgeneration better, we manually inspected:10 classes that had low coverage but nopermission problems10 classes that had file permission problems10 classes that had network problems10 classes with runtime permission problems.
  28. 28. Analyzing the AnswersManual Inspection - 10 classes that had lowcoverage but no permission problems, weidentified the following main reasons for lowcoverage:1) Complex string handling;2) Java generics and dynamic type handling3) branches in exception handling code;The other problems were linking up with theenvironment for testing classes as the majortechnical obstacle;
  29. 29. Threats to ValidityOur study was focused on obtaining insights on thechallenges of applying test generation tools inrealistic settings, our research questions did notdeal with comparisons of algorithms, and sostatistical tests were not required;A possible threat to the construct validity is how weevaluated whether there are unsafe operationswhen testing a class.
  30. 30. Threats to Validityif we encountered high kurtosis (is a measure ofthe "heaviness" of the tails of a distribution) in thenumber of classes per project (63) and branchesper class (429), median values are not particularlyaffected by extreme outliers;e.g.: A normal distribution: 66% within 1std, 95%within 2 std, and 99% within 3 std, has a kurtosisof 0. When the sample kurtosis deviates > 0 itmeans the tails are "heavier”.
  31. 31. Threats to ValidityTo reduce such threat to validity, we usedbootstrapping (is a method for assigning measuresof accuracy to sample estimates) to createconfidence intervals for some statistics data;Our results might not extend to all open sourceprojects, as other repositories (e.g., Google Code)might contain software with statistically differentdistribution properties (e.g., number of classes perproject, difficulty of the software from the point ofview of test data generation).
  32. 32. Conclusions23 projects for which EVOSUITE achieves on average acoverage higher than 80%. If we wanted to boast andpromote our research prototype EVOSUITE, we couldhave carried out an empirical analysis with only those23 projects;That would have resulted in a variegated and largeempirical analysis. In other words, any case study, inwhich the selection of artifacts is not justified and notdone in systematic way, tells very little about theactual performance of the analyzed techniques;
  33. 33. ConclusionsAs most of the research body in the software testingliterature seems to ignore these issues our empiricalanalysis is a valuable source of statistically validinformation to understand which are the realproblems that need to be solved by the softwaretesting research community;With this paper, we challenge the researchcommunity to develop novel testing techniques toachieve at least 80% of bytecode branch coverage onthis SF100 corpus, because it is a valid representativeof open source projects, and our EVOSUITE prototypeonly achieved 48% coverage on average;
  34. 34. Thank you very much!. much! Any questions? questions?