On Parameter Tuning in Search-Based
Software Engineering:
A Replicated Empirical Study
Abdel Salam Sayyad
Katerina Goseva-Popstojanova
Tim Menzies
Hany Ammar
West Virginia University, USA
International Workshop on Replication in Software
Engineering Research (RESER)
Oct 9, 2013
Sound bites
Search-based Software Engineering
Is here… to stay.
A helper… Not an alternative to human SE

Randomness…
is an essential part of Search Algorithms
… hence the need for statistical examination (A lot to learn from Empirical SE)

Parameter Tuning
A real problem…
Default values (rules of thumb) do exist… and (sadly?) they are being followed

Default parameter values fail to optimize performance…
… As seen in the original study, and in this replication…
No Free Lunch Theorems for Optimization [Wolpert and Macready ‘97+
the same parameter values don’t optimize all algorithms for all problems.
2
Roadmap

①
②
③
④

Randomness of Search
The original study
The replication
Conclusion
Roadmap

①
②
③
④

Randomness of Search
The original study
The replication
Conclusion
Searching for what?
• Correct solutions…
– Conform to system relationships and constraints.

• Optimal solutions…
– Achieve user objectives/preferences…

• Complex problems have big Search spaces
– Exhaustive search not a practical idea.
5
Genetic Algorithm
• Start with a large population of candidate
solutions… (How large?)
• Evaluate the fitness of your solutions.
• Let your candidate solutions crossover –
exchange genes… (How often?)
• Mutate a small portion of your solutions.
(How small?)
• How do those choices affect performance?
6
Multi-objective Optimization

The Pareto Front

Higher-level
Decision Making

The Chosen Solution

7
Survival of the fittest
(according to NSGA-II [Deb et al. 2002])
Boolean dominance (x Dominates y, or does not):
- In no objective is x worse than y
- In at least one objective, x is better than y

Crowd
pruning

8
Indicator-Based Evolutionary
Algorithm (IBEA) [Zitzler and Kunzli ‘04+
1) For {old generation + new generation} do
– Add up every individual’s amount of dominance with
respect to everyone else

– Sort all instances by F
– Delete worst, recalculate, delete worst, recalculate, …

2) Then, standard GA (cross-over, mutation) on the
survivors  Create a new generation  Back to 1.
9
NSGA-II… the default algorithm
• Much prior work in SBSE (*)
Used NSGA-II

Didn’t state why!

-------------------------(*) Sayyad and Ammar, RAISE’13

10
Roadmap

①
②
③
④

Randomness of Search
The original study
The replication
Conclusion
The Original Study
• A. Arcuri and G. Fraser, "On Parameter Tuning in Search
Based Software Engineering," in Proc. SSBSE, 2011, pp.
33-47.
• A. Arcuri and G. Fraser, "Parameter Tuning or Default
Values? An Empirical Investigation in Search-Based
Software Engineering," Empirical Software Engineering,
Feb 2013.

• Problem: generating test vectors for objectoriented software.
• Fitness function: percentage of test coverage.
12
Results of original study
• Different parameter settings cause very large
variance in the performance.
• Default parameter settings perform relatively well,
but are far from optimal on individual problem
instances.

13
Roadmap

①
②
③
④

Randomness of Search
The original study
The replication
Conclusion
Feature–oriented domain analysis [Kang 1990]
• Feature models = a
lightweight method for
defining a space of options
• De facto standard for
modeling variability, e.g.
Software Product Lines
Cross-Tree Constraints

Cross-Tree Constraints
15
What are the user preferences?
• Suppose each feature had the following metrics:
1. Boolean USED_BEFORE?
2. Integer DEFECTS
3. Real
COST
• Show me the space of “best options” according to the objectives:
1. That satisfies most domain constraints (0 ≤ #violations ≤ 100%)
2. That offers most features
3. Maximize overall feature that were used before. (promote re-use)
4. Minimize overall known defects.
5. Minimize cost.

16
Previous Work *Sayyad et al. ICSE’13+
• IBEA (continuous dominance criterion) beats NSGA-II
and a host of other algorithms based on Boolean
dominance criterion.
• Especially with a high number of objectives.
• Quality indicators:
– Percentage of conforming (useable) solutions
• We’re interested in 100% conforming solutions.

– Hypervolume (how close to optimal?)
– Spread (how diverse?)

17
Setup

18
What are “default settings”?
• Population size = 100
• Crossover rate = 80%
– 60% < Crossover rate < 90%
• [A. E. Eiben and J. E. Smith, Introduction to Evolutionary
Computing.: Springer, 2003.]

• Mutation rate = 1/Features
• [one bit out of the whole string]
19
Research Questions

20
Results [10 sec / algorithm / FM]

21
Answer to RQ1
• RQ1: How Large is the Potential Impact of a
Wrong Choice of Parameter Settings?
• We confirm Arcuri and Fraser’s conclusion:
“Different parameter settings cause very large
variance in the performance.”

22
Answer to RQ2
• RQ2: How Does a “Default” Setting Compare to the
Best and Worst Achievable Performance?
• Arcuri and Fraser concluded that: “Default parameter
settings perform relatively well, but are far from
optimal on individual problem instances.”
• We make a stronger conclusion: “Default parameter
settings perform generally poorly, but might perform
relatively well on individual problem instances.”
23
Answer to RQ3
• RQ3: How does the performance of IBEA’s
best tuning compare to NSGA-II’s best
tuning?

• Our results show that “IBEA’s best tuning
performs generally much better than NSGA-II’s
best tuning.”

24
RQ4: Parameter Training
• Find best tuning for a group of problem instances, apply it
to a new problem instance, would it be best tuning for the
new problem?
• Arcuri and Fraser concluded that: “Tuning should be done
on a very large sample of problem instances. Otherwise, the
obtained parameter settings are likely to be worse than
arbitrary default values.”
• Our conclusion: “Tuning on a sample of problem instances
does not, in general, result in the best parameter values for
a new problem instance, but the obtained setting are
generally better than the defaults settings.”
25
Roadmap

①
②
③
④

Randomness of Search
The original study
The replication
Conclusion
Conclusion
• Default parameter values fail
to optimize performance…

• And, sadly, many SBSE
researchers choose “default”
algorithms (e.g. NSGA-II) along
with “default” parameters.
• Alternatives?
– A long way to go!

Acknowledgment
This research work
was funded by the
Qatar National
Research Fund under
the National Priorities
Research Program

• Parameter control
• Adaptive parameter control
27

On Parameter Tuning in Search-Based Software Engineering: A Replicated Empirical Study

  • 1.
    On Parameter Tuningin Search-Based Software Engineering: A Replicated Empirical Study Abdel Salam Sayyad Katerina Goseva-Popstojanova Tim Menzies Hany Ammar West Virginia University, USA International Workshop on Replication in Software Engineering Research (RESER) Oct 9, 2013
  • 2.
    Sound bites Search-based SoftwareEngineering Is here… to stay. A helper… Not an alternative to human SE Randomness… is an essential part of Search Algorithms … hence the need for statistical examination (A lot to learn from Empirical SE) Parameter Tuning A real problem… Default values (rules of thumb) do exist… and (sadly?) they are being followed Default parameter values fail to optimize performance… … As seen in the original study, and in this replication… No Free Lunch Theorems for Optimization [Wolpert and Macready ‘97+ the same parameter values don’t optimize all algorithms for all problems. 2
  • 3.
    Roadmap ① ② ③ ④ Randomness of Search Theoriginal study The replication Conclusion
  • 4.
    Roadmap ① ② ③ ④ Randomness of Search Theoriginal study The replication Conclusion
  • 5.
    Searching for what? •Correct solutions… – Conform to system relationships and constraints. • Optimal solutions… – Achieve user objectives/preferences… • Complex problems have big Search spaces – Exhaustive search not a practical idea. 5
  • 6.
    Genetic Algorithm • Startwith a large population of candidate solutions… (How large?) • Evaluate the fitness of your solutions. • Let your candidate solutions crossover – exchange genes… (How often?) • Mutate a small portion of your solutions. (How small?) • How do those choices affect performance? 6
  • 7.
    Multi-objective Optimization The ParetoFront Higher-level Decision Making The Chosen Solution 7
  • 8.
    Survival of thefittest (according to NSGA-II [Deb et al. 2002]) Boolean dominance (x Dominates y, or does not): - In no objective is x worse than y - In at least one objective, x is better than y Crowd pruning 8
  • 9.
    Indicator-Based Evolutionary Algorithm (IBEA)[Zitzler and Kunzli ‘04+ 1) For {old generation + new generation} do – Add up every individual’s amount of dominance with respect to everyone else – Sort all instances by F – Delete worst, recalculate, delete worst, recalculate, … 2) Then, standard GA (cross-over, mutation) on the survivors  Create a new generation  Back to 1. 9
  • 10.
    NSGA-II… the defaultalgorithm • Much prior work in SBSE (*) Used NSGA-II Didn’t state why! -------------------------(*) Sayyad and Ammar, RAISE’13 10
  • 11.
    Roadmap ① ② ③ ④ Randomness of Search Theoriginal study The replication Conclusion
  • 12.
    The Original Study •A. Arcuri and G. Fraser, "On Parameter Tuning in Search Based Software Engineering," in Proc. SSBSE, 2011, pp. 33-47. • A. Arcuri and G. Fraser, "Parameter Tuning or Default Values? An Empirical Investigation in Search-Based Software Engineering," Empirical Software Engineering, Feb 2013. • Problem: generating test vectors for objectoriented software. • Fitness function: percentage of test coverage. 12
  • 13.
    Results of originalstudy • Different parameter settings cause very large variance in the performance. • Default parameter settings perform relatively well, but are far from optimal on individual problem instances. 13
  • 14.
    Roadmap ① ② ③ ④ Randomness of Search Theoriginal study The replication Conclusion
  • 15.
    Feature–oriented domain analysis[Kang 1990] • Feature models = a lightweight method for defining a space of options • De facto standard for modeling variability, e.g. Software Product Lines Cross-Tree Constraints Cross-Tree Constraints 15
  • 16.
    What are theuser preferences? • Suppose each feature had the following metrics: 1. Boolean USED_BEFORE? 2. Integer DEFECTS 3. Real COST • Show me the space of “best options” according to the objectives: 1. That satisfies most domain constraints (0 ≤ #violations ≤ 100%) 2. That offers most features 3. Maximize overall feature that were used before. (promote re-use) 4. Minimize overall known defects. 5. Minimize cost. 16
  • 17.
    Previous Work *Sayyadet al. ICSE’13+ • IBEA (continuous dominance criterion) beats NSGA-II and a host of other algorithms based on Boolean dominance criterion. • Especially with a high number of objectives. • Quality indicators: – Percentage of conforming (useable) solutions • We’re interested in 100% conforming solutions. – Hypervolume (how close to optimal?) – Spread (how diverse?) 17
  • 18.
  • 19.
    What are “defaultsettings”? • Population size = 100 • Crossover rate = 80% – 60% < Crossover rate < 90% • [A. E. Eiben and J. E. Smith, Introduction to Evolutionary Computing.: Springer, 2003.] • Mutation rate = 1/Features • [one bit out of the whole string] 19
  • 20.
  • 21.
    Results [10 sec/ algorithm / FM] 21
  • 22.
    Answer to RQ1 •RQ1: How Large is the Potential Impact of a Wrong Choice of Parameter Settings? • We confirm Arcuri and Fraser’s conclusion: “Different parameter settings cause very large variance in the performance.” 22
  • 23.
    Answer to RQ2 •RQ2: How Does a “Default” Setting Compare to the Best and Worst Achievable Performance? • Arcuri and Fraser concluded that: “Default parameter settings perform relatively well, but are far from optimal on individual problem instances.” • We make a stronger conclusion: “Default parameter settings perform generally poorly, but might perform relatively well on individual problem instances.” 23
  • 24.
    Answer to RQ3 •RQ3: How does the performance of IBEA’s best tuning compare to NSGA-II’s best tuning? • Our results show that “IBEA’s best tuning performs generally much better than NSGA-II’s best tuning.” 24
  • 25.
    RQ4: Parameter Training •Find best tuning for a group of problem instances, apply it to a new problem instance, would it be best tuning for the new problem? • Arcuri and Fraser concluded that: “Tuning should be done on a very large sample of problem instances. Otherwise, the obtained parameter settings are likely to be worse than arbitrary default values.” • Our conclusion: “Tuning on a sample of problem instances does not, in general, result in the best parameter values for a new problem instance, but the obtained setting are generally better than the defaults settings.” 25
  • 26.
    Roadmap ① ② ③ ④ Randomness of Search Theoriginal study The replication Conclusion
  • 27.
    Conclusion • Default parametervalues fail to optimize performance… • And, sadly, many SBSE researchers choose “default” algorithms (e.g. NSGA-II) along with “default” parameters. • Alternatives? – A long way to go! Acknowledgment This research work was funded by the Qatar National Research Fund under the National Priorities Research Program • Parameter control • Adaptive parameter control 27