Keynote
On the Effectiveness of SBSE
Techniques through Instance
Space Analysis
Aldeida Aleti
Monash University, Australia
@AldeidaAleti aldeida.aleti@monash.edu
Effectiveness of SBSE - Status Quo
A large focus of SBSE research is in introducing new SBSE approaches
As part of the evaluation process, usually a set of experiments are conducted
- A benchmark is selected, e..g., Defects4J
- The new approach is compared against the state of the art
- Averages/medians are reported
- Some statistical tests are conducted
Instance Space Analysis
1. to understand and visualise the strengths and weaknesses of different approaches
2. to help with the objective assessment of different approaches
a. Scrutinising how approaches perform under different conditions, and stress testing them
Motivation 1: Are the problem instances adequate?
Problem 1: How were the problem instances selected?
Common benchmark problems are important for fair comparison, but are they
- demonstrably diverse
- unbiased
- representative of a range of real world context,
- challenging
- discriminating
ICSE 2022 review criteria
Motivation 2: Reporting averages/medians obscures
important information
A. Perera, A. Aleti, M. Böhme and B. Turhan, "Defect Prediction Guided Search-Based
Software Testing," 2020 35th IEEE/ACM International Conference on Automated Software
Engineering (ASE), 2020, pp. 448-460.
Problem 2: Performance is often problem dependent
(NFT)
- What are the strengths and weaknesses of the approaches?
- Which are the problem instances where an approach performs really well and
why?
- Which are the problem instances where an approach struggles and why?
- How do features of the problem instances affect the performance of the
approaches?
- Which features give an algorithm competitive advantage?
- Given a problem instance with particular features, which approach should I use?
Which algorithm is suitable for future problems?
Example
Which approach is better? SF110
C. Oliveira, A. Aleti, L. Grunske and K. Smith-Miles, "Mapping the Effectiveness of Automated Test Suite Generation
Techniques," in IEEE Transactions on Reliability, vol. 67, no. 3, pp. 771-785, Sept. 2018, doi: 10.1109/TR.2018.2832072.
Open Questions
● What impacts the effectiveness of SBSE techniques?
○ How can features of problem instances help us infer what are the strengths and weaknesses of
different SBSE approaches?
○ How can we objectively assess different SBSE techniques
● How easy or hard are existing benchmarks? How diverse are they? Are they biased
towards a particular technique?
● Can we select the most suitable SBSE technique given a problem with particular
features?
Empirical Review of Program Repair Tools: A Large-Scale Experiment on 2 141 Bugs and 23 551
Repair Attempts. T. Durieux, F. Madeiral, M. Martinez, R. Abreu. ESEC/FSE Foundations of Software
Engineering (2019) doi: 10.1145/ 3338906.3338911.
ISA
K. Smith-Miles et al. / Computers & Operations Research 45 (2014) 12–24
Steps of ISA
1. Create the metadata
a. Features
b. SBSE performances
2. Create instance space
3. Visualise footprints
4. Explain strengths/weaknesses
Features (56)
What makes the problem easy or hard?
Problem instances SF110
Performance measure
● Branch coverage.
● An approach is considered superior if its branch coverage is at least 1% higher than
the other techniques; otherwise, we use the label “Equal.”
Approaches
● Whole Test Suite with Archive (WSA)
● Many Objective Sorting Algorithm (MOSA)
● Random Testing (RT)
Significant features
● coupling between object classes
○ the number of classes coupled to a given class (method calls, field accesses, inheritance,
arguments, return types, and exceptions)
● response for a class
○ number of different methods that can be executed when a method is invoked for that object
of a class
SBST Footprints
SBST selection
E-APR
Metadata
Features (146)
Observation-based features (Yu et al. 2019)
Significant Features (9)
(F1) MOA: Measure of Aggregation.
(F2) CAM: Cohesion Among Methods
(F3) AMC: Average Method Complexity
(F4) PMC: Private Method Count
(F5) AECSL: Atomic Expression Comparison Same Left indicates the number of statements
with a binary expression that have more than an atomic expression (e.g., variable access).
(F6) SPTWNG: Similar Primitive Type With Normal Guard indicates the number of
statements that contain a variable (local or global) that is also used in another statement
contained inside a guard (i.e., an If condition).
(F7) CVNI: Compatible Variable Not Included is the number of local primitive type variables
within the scope of a statement that involves primitive variables that are not part of that
statement.
(F8) VCTC: Variable Compatible Type in Condition measures the number of variables within
an If condition that are compatible with another variable in the scope.
(F9) PUIA: Primitive Used In Assignment - the number of primitive variables in assignments.
● Little overlap between
IntroClassJava/Defects4J and the other
datasets
● Bugs.jar has the most diverse bugs
APR selection
For ISA to reveal useful insights
● Diverse features
● Diverse instances
● Diverse approaches
● A good performance measure
So what
We have a responsibility to find the weaknesses of the approaches we develop
We need to make sure that the chosen problem instances are demonstrably diverse,
unbiased, representative of a range of real world context, challenging,
discriminating of approach performance
To understand which approach is suitable for future problems, we must understand
which features impact its performance

Instance Space Analysis for Search Based Software Engineering

  • 1.
    Keynote On the Effectivenessof SBSE Techniques through Instance Space Analysis Aldeida Aleti Monash University, Australia @AldeidaAleti aldeida.aleti@monash.edu
  • 2.
    Effectiveness of SBSE- Status Quo A large focus of SBSE research is in introducing new SBSE approaches As part of the evaluation process, usually a set of experiments are conducted - A benchmark is selected, e..g., Defects4J - The new approach is compared against the state of the art - Averages/medians are reported - Some statistical tests are conducted
  • 3.
    Instance Space Analysis 1.to understand and visualise the strengths and weaknesses of different approaches 2. to help with the objective assessment of different approaches a. Scrutinising how approaches perform under different conditions, and stress testing them
  • 4.
    Motivation 1: Arethe problem instances adequate?
  • 5.
    Problem 1: Howwere the problem instances selected? Common benchmark problems are important for fair comparison, but are they - demonstrably diverse - unbiased - representative of a range of real world context, - challenging - discriminating
  • 6.
  • 7.
    Motivation 2: Reportingaverages/medians obscures important information A. Perera, A. Aleti, M. Böhme and B. Turhan, "Defect Prediction Guided Search-Based Software Testing," 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2020, pp. 448-460.
  • 8.
    Problem 2: Performanceis often problem dependent (NFT) - What are the strengths and weaknesses of the approaches? - Which are the problem instances where an approach performs really well and why? - Which are the problem instances where an approach struggles and why? - How do features of the problem instances affect the performance of the approaches? - Which features give an algorithm competitive advantage? - Given a problem instance with particular features, which approach should I use? Which algorithm is suitable for future problems?
  • 9.
    Example Which approach isbetter? SF110 C. Oliveira, A. Aleti, L. Grunske and K. Smith-Miles, "Mapping the Effectiveness of Automated Test Suite Generation Techniques," in IEEE Transactions on Reliability, vol. 67, no. 3, pp. 771-785, Sept. 2018, doi: 10.1109/TR.2018.2832072.
  • 11.
    Open Questions ● Whatimpacts the effectiveness of SBSE techniques? ○ How can features of problem instances help us infer what are the strengths and weaknesses of different SBSE approaches? ○ How can we objectively assess different SBSE techniques ● How easy or hard are existing benchmarks? How diverse are they? Are they biased towards a particular technique? ● Can we select the most suitable SBSE technique given a problem with particular features?
  • 12.
    Empirical Review ofProgram Repair Tools: A Large-Scale Experiment on 2 141 Bugs and 23 551 Repair Attempts. T. Durieux, F. Madeiral, M. Martinez, R. Abreu. ESEC/FSE Foundations of Software Engineering (2019) doi: 10.1145/ 3338906.3338911.
  • 13.
    ISA K. Smith-Miles etal. / Computers & Operations Research 45 (2014) 12–24
  • 14.
    Steps of ISA 1.Create the metadata a. Features b. SBSE performances 2. Create instance space 3. Visualise footprints 4. Explain strengths/weaknesses
  • 16.
    Features (56) What makesthe problem easy or hard?
  • 17.
  • 18.
    Performance measure ● Branchcoverage. ● An approach is considered superior if its branch coverage is at least 1% higher than the other techniques; otherwise, we use the label “Equal.”
  • 19.
    Approaches ● Whole TestSuite with Archive (WSA) ● Many Objective Sorting Algorithm (MOSA) ● Random Testing (RT)
  • 20.
    Significant features ● couplingbetween object classes ○ the number of classes coupled to a given class (method calls, field accesses, inheritance, arguments, return types, and exceptions) ● response for a class ○ number of different methods that can be executed when a method is invoked for that object of a class
  • 21.
  • 22.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
    (F1) MOA: Measureof Aggregation. (F2) CAM: Cohesion Among Methods (F3) AMC: Average Method Complexity (F4) PMC: Private Method Count (F5) AECSL: Atomic Expression Comparison Same Left indicates the number of statements with a binary expression that have more than an atomic expression (e.g., variable access). (F6) SPTWNG: Similar Primitive Type With Normal Guard indicates the number of statements that contain a variable (local or global) that is also used in another statement contained inside a guard (i.e., an If condition). (F7) CVNI: Compatible Variable Not Included is the number of local primitive type variables within the scope of a statement that involves primitive variables that are not part of that statement. (F8) VCTC: Variable Compatible Type in Condition measures the number of variables within an If condition that are compatible with another variable in the scope. (F9) PUIA: Primitive Used In Assignment - the number of primitive variables in assignments.
  • 32.
    ● Little overlapbetween IntroClassJava/Defects4J and the other datasets ● Bugs.jar has the most diverse bugs
  • 33.
  • 34.
    For ISA toreveal useful insights ● Diverse features ● Diverse instances ● Diverse approaches ● A good performance measure
  • 35.
    So what We havea responsibility to find the weaknesses of the approaches we develop We need to make sure that the chosen problem instances are demonstrably diverse, unbiased, representative of a range of real world context, challenging, discriminating of approach performance To understand which approach is suitable for future problems, we must understand which features impact its performance