Search-Based Software Engineering is now a mature area with numerous techniques developed to tackle some of the most challenging software engineering problems, from requirements to design, testing, fault localisation, and automated program repair. SBSE techniques have shown promising results, giving us hope that one day it will be possible for the tedious and labour intensive parts of software development to be completely automated, or at least semi-automated. In this talk, I will focus on the problem of objective performance evaluation of SBSE techniques. To this end, I will introduce Instance Space Analysis (ISA), which is an approach to identify features of SBSE problems that explain why a particular instance is difficult for an SBSE technique. ISA can be used to examine the diversity and quality of the benchmark datasets used by most researchers, and analyse the strengths and weaknesses of existing SBSE techniques. The instance space is constructed to reveal areas of hard and easy problems, and enables the strengths and weaknesses of the different SBSE techniques to be identified. I will present on how ISA enabled us to identify the strengths and weaknesses of SBSE techniques in two areas: Search-Based Software Testing and Automated Program Repair. Finally, I will end my talk with future directions of the objective assessment of SBSE techniques.
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
Instance Space Analysis for Search Based Software Engineering
1. Keynote
On the Effectiveness of SBSE
Techniques through Instance
Space Analysis
Aldeida Aleti
Monash University, Australia
@AldeidaAleti aldeida.aleti@monash.edu
2. Effectiveness of SBSE - Status Quo
A large focus of SBSE research is in introducing new SBSE approaches
As part of the evaluation process, usually a set of experiments are conducted
- A benchmark is selected, e..g., Defects4J
- The new approach is compared against the state of the art
- Averages/medians are reported
- Some statistical tests are conducted
3. Instance Space Analysis
1. to understand and visualise the strengths and weaknesses of different approaches
2. to help with the objective assessment of different approaches
a. Scrutinising how approaches perform under different conditions, and stress testing them
5. Problem 1: How were the problem instances selected?
Common benchmark problems are important for fair comparison, but are they
- demonstrably diverse
- unbiased
- representative of a range of real world context,
- challenging
- discriminating
7. Motivation 2: Reporting averages/medians obscures
important information
A. Perera, A. Aleti, M. Böhme and B. Turhan, "Defect Prediction Guided Search-Based
Software Testing," 2020 35th IEEE/ACM International Conference on Automated Software
Engineering (ASE), 2020, pp. 448-460.
8. Problem 2: Performance is often problem dependent
(NFT)
- What are the strengths and weaknesses of the approaches?
- Which are the problem instances where an approach performs really well and
why?
- Which are the problem instances where an approach struggles and why?
- How do features of the problem instances affect the performance of the
approaches?
- Which features give an algorithm competitive advantage?
- Given a problem instance with particular features, which approach should I use?
Which algorithm is suitable for future problems?
9. Example
Which approach is better? SF110
C. Oliveira, A. Aleti, L. Grunske and K. Smith-Miles, "Mapping the Effectiveness of Automated Test Suite Generation
Techniques," in IEEE Transactions on Reliability, vol. 67, no. 3, pp. 771-785, Sept. 2018, doi: 10.1109/TR.2018.2832072.
10.
11. Open Questions
● What impacts the effectiveness of SBSE techniques?
○ How can features of problem instances help us infer what are the strengths and weaknesses of
different SBSE approaches?
○ How can we objectively assess different SBSE techniques
● How easy or hard are existing benchmarks? How diverse are they? Are they biased
towards a particular technique?
● Can we select the most suitable SBSE technique given a problem with particular
features?
12. Empirical Review of Program Repair Tools: A Large-Scale Experiment on 2 141 Bugs and 23 551
Repair Attempts. T. Durieux, F. Madeiral, M. Martinez, R. Abreu. ESEC/FSE Foundations of Software
Engineering (2019) doi: 10.1145/ 3338906.3338911.
14. Steps of ISA
1. Create the metadata
a. Features
b. SBSE performances
2. Create instance space
3. Visualise footprints
4. Explain strengths/weaknesses
18. Performance measure
● Branch coverage.
● An approach is considered superior if its branch coverage is at least 1% higher than
the other techniques; otherwise, we use the label “Equal.”
19. Approaches
● Whole Test Suite with Archive (WSA)
● Many Objective Sorting Algorithm (MOSA)
● Random Testing (RT)
20. Significant features
● coupling between object classes
○ the number of classes coupled to a given class (method calls, field accesses, inheritance,
arguments, return types, and exceptions)
● response for a class
○ number of different methods that can be executed when a method is invoked for that object
of a class
30. (F1) MOA: Measure of Aggregation.
(F2) CAM: Cohesion Among Methods
(F3) AMC: Average Method Complexity
(F4) PMC: Private Method Count
(F5) AECSL: Atomic Expression Comparison Same Left indicates the number of statements
with a binary expression that have more than an atomic expression (e.g., variable access).
(F6) SPTWNG: Similar Primitive Type With Normal Guard indicates the number of
statements that contain a variable (local or global) that is also used in another statement
contained inside a guard (i.e., an If condition).
(F7) CVNI: Compatible Variable Not Included is the number of local primitive type variables
within the scope of a statement that involves primitive variables that are not part of that
statement.
(F8) VCTC: Variable Compatible Type in Condition measures the number of variables within
an If condition that are compatible with another variable in the scope.
(F9) PUIA: Primitive Used In Assignment - the number of primitive variables in assignments.
31.
32. ● Little overlap between
IntroClassJava/Defects4J and the other
datasets
● Bugs.jar has the most diverse bugs
34. For ISA to reveal useful insights
● Diverse features
● Diverse instances
● Diverse approaches
● A good performance measure
35. So what
We have a responsibility to find the weaknesses of the approaches we develop
We need to make sure that the chosen problem instances are demonstrably diverse,
unbiased, representative of a range of real world context, challenging,
discriminating of approach performance
To understand which approach is suitable for future problems, we must understand
which features impact its performance