Viktor Kajml diploma thesis

343 views
240 views

Published on

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
343
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Viktor Kajml diploma thesis

  1. 1. Czech Technical University in Prague Faculty of Electrical Engineering DIPLOMA THESIS Bc. Viktor Kajml Black box optimization: Restarting versus MetaMax algorithm Department of Cybernetics Project supervisor: Ing. Petr Posik, Ph.D. Prague, 2014
  2. 2. Abstrakt Tato diplomová práce se zabývá vyhodnocením nového perspektivního optimaliza£ního algoritmu, nazvaného MetaMax. Hlavním cílem je zhodnotit vhodnost jeho pouºití pro °e²ení problém· optimalizace £erné sk°í¬ky se spojitými parametry, obzvlá²t¥ v porovnání s ostatními metodami b¥ºn¥ pouºívanými v této oblasti. Za tímto ú£elem je MetaMax a vybrané tradi£ní restartovací strategie, podrobn¥ otestován na rozsáhlé sad¥ srovnávacích funkcí, za pouºití r·zných algoritm· lokálního prohledávání. Takto nam¥°ené výsledky jsou poté porovnány a vyhodnoceny. Druhotným cílem je navrhnout a implementovat modikace algoritmu MetaMax v jistých oblastech, kde je prostor pro zlep²ení jeho výkon·. Abstract This diploma thesis is focused on evaluating a new promising multi-start optimization algorithm called MetaMax. The main goal is to assess its utility it in the area of black-box continuous parameter optimization, especially in comparison with other strategies commonly used in this area. To achieve this, MetaMax and a selection of traditional restart strategies are thoroughly tested on a large set of benchmark problems and using multiple dierent local search algorithms. Their results are then compared and evaluated. An additional goal is to suggest and implement modications of the MetaMax algorithm, in certain areas where it seems that there could be a potential room for improvement.
  3. 3. I would like to thank: Mr. Petr Po²ík for his help on this thesis The Centre of Machine perception at the Czech Technical University in Prague for providing me with access to their computer grid My friends and family for their support
  4. 4. Contents 1 Introduction 1 2 Problem description and related work 3 2.1 Local search algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Multi-start strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3 MetaMax algorithm and its variants 3.1 4 Suggested modications . . . . . . . . . . . . . . . . . . . . . . . . . Experimental setup 9 13 16 4.1 18 Used metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.3 5 Used multi-start strategies . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Results 25 5.1 25 Nelder-Mead method . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.3 BFGS 36 5.4 CMA-ES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.5 6 Compass search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Additional results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion 48 A Used local search algorithms 51 A.1 Compass search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 A.2 Nelder-Mead algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 51 A.3 BFGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 A.4 CMA-ES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 B CD contents 56 C Acknowledgements 56 i
  5. 5. List of Tables 1 Benchmark function groups . . . . . . . . . . . . . . . . . . . . . . . 17 2 Algorithm specic restart strategies . . . . . . . . . . . . . . . . . . . 20 3 Tested multi-start strategies . . . . . . . . . . . . . . . . . . . . . . . 21 4 Compass search - best restart strategies for each dimensionality . . . 26 5 Compass search - results of restart strategies . . . . . . . . . . . . . . 26 6 Compass search - results of MetaMax(k) and corresponding xed restart strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 7 Compass search - results of MetaMax strategies . . . . . . . . . . . . 30 8 Nelder-Mead - best restart strategies for each dimensionality . . . . . 31 9 Nelder-Mead - results of restart strategies . . . . . . . . . . . . . . . . 32 10 Nelder-Mead - results of MetaMax(k) and corresponding xed restart strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 11 Nelder-Mead - results of MetaMax strategies . . . . . . . . . . . . . . 35 12 BFGS - best restart strategies for each dimensionality . . . . . . . . . 36 13 BFGS - results of restart strategies 37 14 BFGS - results of MetaMax(k) and corresponding xed restart strategies 38 15 BFGS - results of MetaMax strategies . . . . . . . . . . . . . . . . . . 40 16 CMA-ES - best restart strategies for each dimensionality . . . . . . . 41 17 CMA-ES - results of restart strategies . . . . . . . . . . . . . . . . . . 42 18 CMA-ES - results of MetaMax(k) and corresponding xed restart strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 19 CMA-ES - results of MetaMax strategies . . . . . . . . . . . . . . . . 45 20 CD contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 ii
  6. 6. List of Figures 1 Restart condition based on function value stagnation 2 Example of monotone transformation of 3 MetaMax selection mechanisms 4 Example ECDF graph 5 Compass search - ECDF comparing MetaMax(k) with an equivalent . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 6 Compass search - ECDF of MetaMax variants using 100 instances . . 29 7 Compass search - ECDF of MetaMax variants using 50d instances . . 31 8 Nelder-Mead - ECDF comparing MetaMax(k) strategies . . . . . . . 32 9 BFGS - ECDF of the best restart strategies . . . . . . . . . . . . . . 37 10 BFGS - ECDF of MetaMax variants using 50d instances . . . . . . . 39 11 CMA-ES - ECDF of function value stagnation based restart strategies 41 12 CMA-ES - ECDF comparison of MetaMax variants using 50d instances 44 13 MetaMax timing measurements 14 ECDF comparing MetaMax strategies using dierent instance selec- xed restart strategy tion methods 15 . . . . . . . . . 7 . . . . . . . . . . . . . . 15 . . . . . . . . . . . . . . . . . . . . . 16 . . . . . . . . . . . . . . . . . . . . . . . . . . 24 f (x) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nelder-Mead algorithm in 2D 47 48 . . . . . . . . . . . . . . . . . . . . . . 52 1 Typical structure of a local search algorithm . . . . . . . . . . . . . . . 2 2 Variable neighbourhood search . . . . . . . . . . . . . . . . . . . . . . 9 3 MetaMax(k) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4 MetaMax(∞) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 5 MetaMax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 6 Compass search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 7 Nelder-Mead method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 8 BFGS algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 9 CMA-ES algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 List of Algorithms iii
  7. 7. 1 Introduction The goal of this thesis is to implement and evaluate the performance of the MetaMax optimization algorithm, particularly in comparison with other commonly used optimization strategies. MetaMax was proposed by György and Kocsis in [GK11] and the results they present seem very interesting and suggest that MetaMax might be a very competitive algorithm. Our goal is to more closely evaluate its performance on problems from the area of black-box continuous optimization, by performing a series of exhaustive measurements and comparing the results with those of several commonly used restart strategies. This text is organized as follows: Firstly there is a short overview of the subjects of mathematical, continuous and black-box optimization, local search algorithms and multi-start strategies. This is meant as an introduction for readers who might not be familiar with these topics. Readers who already have knowledge of these elds might wish to skip forward to the following sections, which describe the MetaMax algorithm, the experimental setup, used optimization strategies and the software implementation. In the last two sections, the measured results are summed up and evaluated. The mathematical optimization problem is dened as selecting the best element, according to some criteria, from a set of feasible elements. Most common form of x1,opt , . . . , xd,opt , where d is the problem dimension, for which a value of a given objective function f (x1 , . . . , xd ) is minimal, that is f (x1,opt , . . . , xd,opt ) ≤ f (x1 , . . . , xd ) for all possible values of x1 , . . . , xd . the problem is nding a set of parameters Within this eld of mathematical optimization, it is possible to dene several subelds based on the properties of the parameters information available about the objective function x1 , . . . , x d and the amount of f. The set of all possible solutions (possible combid nations of the parameter values) is nite. Usually some subset of N . Combinatorial optimization: Integer programming: All of the parameters are restricted to be integers: x1 , . . . , xd ∈ N. Can be considered to be a subset of combinatorial optimization. Mixed integer programming: Some parameters are real-valued and some are integers. Continuous optimization: The set of all possible solutions is innite. Usually x1 , . . . , x d ∈ R . black-box optimization: about f Assumes that only a bare minimum of information is given. It can evaluated at an arbitrary point tion value f (x), but besides that, no other properties of x, returning the funcf are known. In order to solve this kind of problems, we have to resolve to searching (The exact techniques are be described in more detail later in this text). Furthermore we are almost never guaranteed to nd the exact solution, just one that is suciently close to it, and there is almost always a non-zero probability that even an approximate solution might not be found at all. 1
  8. 8. White box optimization knowledge about f, deals with problems where we have some additional for example its gradient, which can obviously be very useful when looking for its minimum. In this text we will almost exclusively with black-box continuous optimization problems. For a practical example of a black-box optimization problem, imagine the process of trying to design an airfoil, which should have certain desired properties. It is possible to describe the airfoil by a vector of variables representing its various parameters - length, thickness, shape, etc. This will be the parameter vector run aerodynamic simulation with the airfoil described by x, x. Then, we can evaluate how closely it matches the desired properties, and based on that, assign a function value f (x) the parameter vector. In this way, the simulator becomes the black-box function and the problem is transformed into task of minimizing the objective function We can then use black-box optimization methods to nd the parameter vector to f f. xopt which will give us an airfoil with the desired properties. This example hopefully suciently illustrates the fact that, black-box optimization can be a very powerful tool, as it allows us to nd reasonably good solutions even for problems which we might not be able to, or would not know how to, solve otherwise. As already mentioned, the usual method for nding optima (the best possible set of parameters xopt ) in continuous mathematical optimization is searching. The structure of a typical local search algorithm is as follows: Algorithm 1: Typical structure of a local search algorithm 1 Select a starting solution x0 somehow (most commonly randomly) from the set of feasible solutions. 2 3 4 5 6 7 8 9 10 11 12 xc ← x0 f (xc ). Set current solution: Get function value while Stop condition not met do Generate a set of neighbour solutions Evaluate f at each Xn similar to xc xn ∈ X n Find the best neighbour solution ∗ if f (x ) f (xc ) then Update the current solution x∗ = argmaxxn ∈Xn f (x) xc ← x∗ else Modify the way of generating neighbour solutions return xc In the case of continuous optimization, a solution is represented simply by a d point in R . There are various ways of generating neighbour solutions. In general, two neighbouring solutions should be dierent from each other, but in some sense also similar. In continuous optimization, this usually means that the solutions are close in terms of Euclidean distance, but not identical. 2
  9. 9. The algorithm described above has the property that it always creates neighbour solutions close to the current solution and moves the current solution in the direction of decreasing f (x). This makes it a greedy algorithm, which works well in cases where the objective function is unimodal (has only one optimum), but for multimodal functions (functions with multiple local optima), the resulting behaviour will not be ideal. The algorithm will move in the direction of the nearest optimum (the optimum x0 ), with basin of attraction containng but when it gets there it will not move any further, as at this point, all the neighbour solutions will be worse than the current solution. Such algorithm can therefore be relied on to nd the nearest local optimum, but there is no guarantee that it will also be the global one. The global optimum will be found only when x0 happens to land in its basin of attraction. The method which is most commonly used to overcome this problem, is to run multiple instances of the local search algorithm from dierent starting positions x0 . Then it is probable that at least one of them will start in the basin of attraction of the global optimum and will be able nd it. There are various dierent multi-start strategies which implement this basic idea, with MetaMax, the main subject of this thesis, being one of them. More thorough description of the local search algorithms problem of getting stuck in a local optimum is described in section 2. Detailed description of the MetaMax algorithm and its variations is given in section 3. Structure of the performed experiments is described in section 4. Finally, the measured results are presented and evaluated in section 5. 2 Problem description and related work As mentioned in the previous section, local search algorithms have problems nding the global optimum of functions with multiple optima (also called multimodal functions). In this section we focus on this problem more thoroughly. We describe several common types of local search algorithms in more detail and discuss their susceptibility to getting stuck in a local optimum. Next, we will describe several methods to overcome this problem. 2.1 Local search algorithms Following are the descriptions of four commonly used kinds of local search algorithms, which we hope will give the reader a more concrete idea about the functioning of local search algorithms, than the very basic example described in algorithm 1. Line search algorithms try to solve the problem of minimizing a d-dimensional function f by using a series of one-dimensional minimization tasks, called line searches. During each step of the algorithm, an imaginary line is created starting at the current solution xc and going in a suitably chosen direction σ . Then, x, with the minimal value of f (x), and the current solution is updated: xc ← x. In this way, the algorithm will eventually converge on a nearest local optimum of f . the line is searched for a point 3
  10. 10. The question remains - How to chose the search direction σ? The most simple algorithms just use a preselected set of directions (usually vectors in an orthonormal positive d-dimensional base) and loop through them on successive iterations. This method is quite simple to implement, but it has trouble coping with ill-conditioned functions. An obvious idea might be to use information about the functions gradient for determining the search direction. However, this turns out not to be much more eective than simple alternating algorithms. The best results are achieved when information about both the functions gradient and its Hessian is used. Then, it is possible to get quite robust and well performing algorithms. Note, that for black-box optimization problems, it is necessary to obtain the gradient by estimation, as it is not explicitly available. Examples of this kind of algorithm are: Symmetric rank one method, gradient descent algorithm and Broyden-Fletcher-GoldfarbShanno algorithm Pattern search algorithms closely t the description given in algorithm 1. They generate the neighbour solutions relative to the current solution xn ∈ X n in dened positions (a pattern) xc . If any of the neighbour solutions is found to be better than the current one, it then becomes the new current solution, the next set of neighbour solutions is generated around it and so on. If none of the neighbour solutions is found to be better (an unsuccessful iteration), then the pattern is contracted so that in the next step the neighbour solutions are generated closer to xc . In this way the algorithm will converge to the nearest local optimum (for proof, please see [KLT03]). Advanced pattern search algorithms use patterns, which change size shape according to various rules, both on successful and unsuccessful iterations. Typical algorithms of this type are: Compass search (or coordinate search), Nelder-Mead simplex algorithm and Luus-Jakola algorithm. Population based algorithms keep track of a number of solutions, also called individuals, at one time, which together constitute a population. A new generation of solutions is generated each step, based on the properties of a set of selected (usually the best) individuals from the previous generation. Dierent algorithms vary in the exact implementation of this process. For example, in the family of genetic algorithms, this process is designed to emulate natural evolution: Properties of each individual (in case of continuous optimization, this means its position) are encoded into a genome and new individuals are created by combining parts of genomes of successful individuals from the previous generation, or by random mutation. Unsuccessful individuals are discarded, in an analogy with the natural principle of survival of the ttest. Other population based algorithms, such as CMA-ES take a somewhat more mathematical approach: New generations are populated by sampling a multivariate normal distribution, which is in turn updated every step, based on the properties of a number of best individuals from the previous generation. 4
  11. 11. Swarm intelligence algorithms are based on the observation, that it is possible to get quite well performing optimization algorithms by trying to emulate natural behaviours, such as ocking of birds or sh schools. Each solution represents one member of a swarm and moves around the search space according to a simple set of rules. For example, it might try to keep certain minimal distance form other ock members, while also heading in the direction with the best values of f (x). The specic rules vary a great deal between dierent algorithms, but in general even a simple individual behaviour is often enough to result in quite complex collective emergent behaviour. Because swarm intelligence algorithms keep track of multiple individuals/solutions during each step, they can also be considered to be a subset of population based algorithms. Some examples of this class of algorithms are the Particle swarm optimization algorithm and the Fish school search algorithm. Pattern search and line search algorithms, have the property that they always choose neighbour solutions close to the current solution and they move in direction of decreasing f (x). Thus, as was already described in the previous section, they are able to nd only the local optimum, which is nearest to their starting position x0 . Population based and swarm intelligence algorithms might be somewhat less susceptible to this behaviour, in the case where the initial population is spread over a large area of the search space. Then there is a chance that some individuals might land near to the global optimum, and eventually pull the others towards it. There are several modications of local search algorithms specically designed to overcome the problem of getting stuck in a local optimum. We shall now describe two basic ones - Simulated annealing and Tabu search. The main idea behind them, is to limit the local search algorithms greedy behaviour by sometimes taking steps other than those, which lead to the greatest decrease of f (x). Simulated annealing implements the above mentioned idea in a very straightfor- ward way: During each step, the local search algorithm may select any of the generated neighbour solutions with a non-zero probability, thus possibly not selecting the best one. The probability of f (xc ), f (xn ), P of choosing a particular neighbour solution and s, where s xn is a function is the number of steps already taken by the algorithm. Usually, it increases with the value of ∆f = f (xc ) − f (xn ), so that the best neighbour solutions are still likely to be picked the most often. The probability of choosing a neighbour solution other than the best one also usually decreases as s increases, so that the algorithm behaves more randomly in the beginning and then, as the time goes on, settles down to a more predictable behaviour and converges to the nearest optimum. This is somewhat similar to the metallurgical process of annealing, from where the algorithm takes its name. It is possible to apply this method to almost any of the previously mentioned local search algorithms, simply by adding the possibility of choosing neighbour solutions, which are not the best. In practice, the exact form of 5 P (f (xc ), f (xn ), s)
  12. 12. has to be ne-tuned for a given problem in order to get good results. Therefore, this algorithm is of limited usefulness in the area of black-box optimization. Tabu search works by keeping list of previously visited solutions, which is called the tabu list. It selects potential moves only from the set of neighbour solutions, which are not on the this list, even if it means choosing a solution, which is worse than the current one. The selected solution is then added to the tabu list and the oldest entry in the Tabu list is deleted. The list therefore works in a way similar to a cyclic buer. This method has been originally designed for solving combinatorial optimization problems and it requires certain modications in order to be useful in the area of continuous parameter optimization. At the very least, it is necessary to modify the method to not only discard neighbour solutions which are on the tabu list, but also solutions which are close to them. Without this, the algorithm would not work very well, as the probability of generating the exact d same solution twice in R is quite small. There is a multitude of advanced variations of this basic method, for example it is possible to add aspiration rules, which override the tabu status of solutions that would lead to a large decrease in f (x). For a detailed description of tabu search adapted for continuous optimization, please see [CS00]. 2.2 Multi-start strategies Multi-start strategies allow eectively using local search algorithms on functions with multiple local optima without making any modication to the way they work. The basic idea is, that if we run a search algorithm multiple times, each time from a dierent starting position x0 , then it is probable that at least one of the starting positions will be in the basin of attraction of the global optimum and thus the corresponding local search algorithm will be able to nd it. Of course, the probability of this depends on the number of algorithm instances that are run, relative to the number and properties of the functions optima. It is possible to think about multistart strategies as meta-heuristics, running above, and controlling multiple instances of local search algorithm sub-heuristics. Restart strategies are a subset of multi-start strategies, where multiple instances are run one at a time in succession. The most basic implementation of a restart strategy is to take the total amount of allowed resource budget (usually a set number of objective function evaluations), evenly divide it into multiple slots, and use each of them to run one instance of a local search algorithm. A very important choice is deciding the length of a single slot. The optimal length largely depends on the specic problem and type of used algorithm. If the length is set too low, then the algorithm might not have enough time to converge to its nearest optimum. If it is too long, then there is a possibility that resources will be wasted on running instances which are stuck in local optima and can no longer improve. Of course, all of the time slots do not have to be of the same length. A good strategy for black-box optimization is to start with low length and keep increasing 6
  13. 13. it for each subsequent slot. In this way, a reasonable performance can be achieved even if we are unable to choose the most suitable slot length for a given problem in advance. A dierent restart strategy is to keep each instance going as long as it needs to until it converges to an optimum. The most universal way to detect convergence is to look for stagnation of values of the objective function over a number of past function evaluations (or past local search algorithm iterations). If the best objective function value found so far does not improve by at least the limit tf over the last hf function evaluations, then the current algorithm instance is terminated and new one is started. For convenience, in the subsequent text we will call hf the function value history length and tf the function value history tolerance. An example of this restart condition is given in gure 1: The best solution found after v function evaluations ∗ ∗ is marked as xv and its corresponding function value as f (xv ). In the gure, we see that the restart condition is triggered because at the last function evaluation m, the ∗ ∗ following is true: f (xm−h ) ≤ f (xm ) + tf f m−hf 3000 f(xv ) 2500 2000 1500 1000 5000 5 10 15 20 v 25 30 35 ∗ f(xm ) + tf ∗ f(xm ) Figure 1: Restart condition based on function value stagnation Displays the objective function value f (xv ) (dashed black line) of evaluation v , and the best objective function value reached after v function evaluations f (x∗ ) (solid black line), over the interval 0, m function s evaluations. The values f (x∗ ), f( x∗ ) + tf and m − hf are highlighted. m m It is, of course, necessary to choose specic values of hf and tf but usually it is not overly dicult to nd a combination which works well for a large set of problems. Various dierent ways of detecting convergence and corresponding restart conditions can be used. For example, reaching zero-gradient for line-search algorithms, 7
  14. 14. reaching minimal size of pattern for pattern search algorithms, etc. There are also various ways of choosing the starting position search algorithm instances. The simplest one is to choose x0 x0 for new local by sampling random uniform distribution over the set of all feasible solutions. This is very easy to implement and often gives good results. However, it is also possible to use information gained by the previous instances, when choosing x0 for a new one. A simple algorithm, which utilizes this idea is the Iterated search: The rst instance i1 is started from an arbitrary position and it is run until it converges (or until it exhausts certain amount of resources) and returns the best solution it has ∗ found xi1 . Then, the starting position for the next instance is selected from the neigh∗ bourhood N of xi . Note, that N is a qualitatively dierent neighbourhood, than 1 what the instance i1 might be using to generate neighbourhood solutions each step. It is usually much larger, with the goal being to generate the new starting point for instance i2 by perturbing the best solution of i1 enough, to move it to a dierent x∗2 better than x∗1 , then the i i ∗ ∗ ∗ next instance is started from the neighbourhood N (xi ). If f (xi ) ≥ f (xi ) and a basin of attraction. If the new instance nds a solution 2 2 1 better solution is not found, then the next instance is started from the neighbour∗ hood N (xi ) again. This is repeated until a stop condition is triggered. An obvious 1 assumption, that this method makes, is that the minima of the objective function are grouped close together. If this is not the case, then it might be better to use uniform random sampling. The big question is, how to choose the size of the neighbourhood N? Too small, and the new instance might fall into the same basin of attraction as the previous one. Too big, and the results will be similar to choosing the starting position uniformly randomly. Another method, called the Variable neighbourhood search, which can, in a way, be considered to be an improved version of the iterated search, tackles this N1 , . . . , Nk of varying sizes, N1 is the smallest and the following neighbourhoods are successively larger, with Nk being the largest. The restarting procedure is the same as with iterated search, with the following modication: If a local search algorithm instance ik , started ∗ from the neighbourhood N1 (xi ) does not improve the current best solution, then k−1 ∗ the algorithm tries starting the next instance from N2 (xi ), then N3 (x∗k−1 ), and so i k−1 problem by using multiple neighbourhood structures where on. The structure of a basic variable neighbourhood search, as given in [HM03], page 10, is described in algorithm 2. This algorithm can also be used as a description of iterated search, if the set of neighbourhood structures contains only one element. Yet another group of methods, which aim to prevent local search algorithms from getting stuck in local optima is based on the idea that it is not necessary to run multiple local search algorithm instances one after another, but they can be run at the same time. Then, it is possible to evaluate the expected performance of each instance based on the results it obtained so far and allocate the resources to the best (or most promising) ones. This is somewhat similar to the well known multi-armed bandit problem. The basic implementation of this idea is called the explore and exploit strategy. It involves initially running all of its k algorithm instances until a certain fraction of its resource budget is expended. This is the exploration phase. Then, the best 8
  15. 15. Algorithm 2: Variable neighbourhood search : initial position input x0 , set of neighbourhood structures N1 , . . . , Nk of increasing size 1 2 3 4 5 6 7 8 9 10 11 x∗ ← local_search(x0 ) k←1 while Stop condition not met Generate random point y ∗ ← local_search(x ) ∗ ∗ if f (y ) f (x ) then ∗ ∗ x do from Nk (x∗ ) x ←y k←1 else k ←k+1 return x∗ instance is selected and run until the rest of the resource budget is used up - The exploitation phase. There is, again, an obvious trade o between the amount of resources allocated to each phase. The exploration phase should be long enough, so that, when it ends, it is possible to reliably identify the best instance. On the other hand, it is necessary to have enough resources left in the exploitation phase in order for the selected best instance to converge to the optimum. In practice, it is actually not that dicult to nd balance between these two phases, that gives good results for a wide range of problems. Methods like this, which run multiple local search algorithm instances at the same time, belong into the group of portfolio algorithms. We should, however, note that portfolio algorithms are usually used in a somewhat dierent way than described here. Most commonly, they run multiple instances of dierent local search algorithms, each of which is well suited for a dierent kind of problem. This allows the portfolio algorithm to select instances of that algorithm, which is able to solve the given problem the most eciently, even without knowing its properties a priori. The MetaMax algorithm, which is the main subject of this thesis, is also an portfolio algorithm. However, we use it only running one kind of local search algorithm at a time, to allow for a more fair and direct comparison with restart strategies, which typically use only one kind of local search algorithm. 3 MetaMax algorithm and its variants The MetaMax algoithm is a multi-start portfolio strategy presented by György and Koscis in [GK11]. There are, actually, three versions of the algorithm, which dier in certain details. They are called MetaMax(k), MetaMax(∞) and MetaMax and they will be described in detail in this section. Please note, that while in this text we usually presume all optimization prob- 9
  16. 16. lems to be minimization problems, the text in [GK11] assumes a maximization task. Therefore, while describing the workings of MetaMax algorithm in this section, we will keep to the convention in [GK11], but in the rest of the text we will refer to minimization tasks as usual. Our implementation of MetaMax was modied to work with minimization tasks. György and Kocsis demonstrate ([GK11], page 413, equation 2) that convergence of an instance of a local search algorithm, after s steps, can be optimistically esti- mated with large probability as: lim f (x∗ ) ≤ f (x∗ ) + gσ (s) s s (1) s→∞ Where f ( x∗ ) s is the best function value obtained by the local search algorithm in- stance up until the step s and gσ (s) is a non increasing, non negative function with lims→∞ g(s) = 0. Note, that the notation used here is a little dierent than in [GK11], but the meaning is the same. In practice, the exact form of g(s) is not known, so the right side of equation 1 has to be approximated as: f (x∗ ) + ch(s) s Where (2) c is an unknown constant and h(s) is a positive, monotone, decreasing function with the following properties: h(0) = 1, lims→∞ h(s) = 0 (3) h(s) = e−s . In the subsequent text, we One possible simple form of this function is shall call this function the estimate function. György and Kocsis do not use this name in their work. In fact, they do not use any name for this function at all and refer to it simply as h function. However, we think that this is not very convenient, hence we picked a suitable name. Based on equations 1 and 2, it is possible to create a strategy that allocates resources only to those instances, which are estimated to converge the most quickly and maximize the value of expression 2 for a certain range of the constant c. The problem of nding these instances can be solved eectively by transforming it into a problem of nding the upper right convex hull of a set of points in the following way: Ai si , the position xi,si of the best solution ∗ it has found so far and its corresponding function value f (xi,s ). If we represent the i set of the local search algorithm instances Ai , i = 1, . . . , k by a set of points: We assume, that there is k number of instances in total and that each instance keeps track of the number of steps it has taken P : {(h(si ), f (x∗ i )), i = 1, . . . , k} i,s (4) Then the instances which minimize the value of expression 2 for a certain range of c correspond to those points, which lie on the upper right convex hull of the set P. Because the term upper right convex hull is not quite standard, we should clarify that we understand it to mean an intersection of the upper convex hull and the right convex hull. 10
  17. 17. Note, that presumably for simplicity, the authors of [GK11] assumed only local search algorithms which use the same number of function evaluations every step. For algorithms where this is not true, it makes more sense to set of function evaluations used by the instance i si equal to the number so far instead. We believe that this is a better way to measure the use of resources by individual instances, which is also conrmed in [PG13]. György and Kocsis suggest using a form of estimate function, which changes based on the amount of resources used by all the local search algorithm instances, in order to encourage more exploratory behaviour as the MetaMax algorithm progresses. Therefore, in our implementation, we use the following estimate function, which is recommended in [GK11]: h(vi , vt ) = e−vi /vt Where vi (5) is the number of function evaluations used by instance i and vt is the total number of function evaluations used by all of the instances combined. The simplest of the three MetaMax variants is MetaMax(k). It uses k local search algorithm instances and is described in algorithm 3. For convenience and improved readability, we will use simplied notation, when describing MetaMax variants: vi for number of function evaluations used by local search algorithm instance i so far xi for position of the best solution found by instance fi for function value of i so far xi In the descriptions, we also assume that the estimate function h is a function of only one variable. Algorithm 3: MetaMax(k) : function to be optimized input f, number of algorithm instances monotone non-decreasing function h k and a with properties as given in equation 3 1 Step each of the variables 2 3 while v i , xi k local search algorithm instances and Ai and update their fi stop conditions not met do i = 1, . . . , k , select algorithm Ai if there exists c 0 so that: fi + ch(vi ) fj + ch(vj ) for all j = 1, . . . , k so that (vi , fi ) = (vj , fj ). If there are multiple algorithms with identical v and f , then select only one For of them at random. 4 5 6 7 Step each selected Ai and update its variables vi , xi and fi . b = argmin1,...,k (fi ). ∗ solution: x ← xb . Find the best instance: Update the best return x∗ As with a priori scheduled restart strategies, there is the question of choosing the right number of instances (parameter k) 11 to use. The other two versions of the
  18. 18. algorithm - MetaMax and MetaMax(∞) get around this problem by gradually increasing the number of instances, starting with a single one and adding a new one every round. Thus, the number of instances tends to innity as the algorithm keeps running. This allows to prove that the algorithm is consistent. That is, it will almost surely nd the global optimum if kept running for an innite amount of time. Please note, that in some literature, such as [Neu11], the term asymptotically complete is used, instead of consistent, but both of them mean the same thing. Also note, that we use the word round to refer to a step of the MetaMax algorithm, in order to avoid confusion with steps of local search algorithms. MetaMax and MetaMax(∞) are described in algorithms 5 and 4 respectively, also using the simplied notation. Algorithm 4: MetaMax( input ∞) : function to be optimized f, monotone non-decreasing function h with properties as given in equation 3 1 2 3 r←1 while stop conditions not met do Add a new local search algorithm instance Ar , step it once and initialize vr , xr and fr For i = 1, . . . , r , select algorithm Ai if there exists c 0 so that: fi + ch(vi ) fj + ch(vj ) for all j = 1, . . . , r so that (vi , fi ) = (vj , fj ). If there are multiple algorithms with identical v and f , then select only one its variables 4 of them at random. 5 6 7 8 9 Step each selected Ai and update its variables Find the best instance: vi , xi and fi . b = argmin1,...,r (fi ). x∗ ← x b . Update the best solution: r ←r+1 return x∗ MetaMax and MetaMax(∞) dier only in one point (lines 6 and 7 in algorithm 5): If, after stepping all selected instances, the best instance is a dierent one than in the previous round, MetaMax will step it until it overtakes the old best instance in terms of used resources. In [GK11] it is shown that MetaMax asymptotically approaches the performance of its best local search algorithm instance as the number of rounds increases. Theoretical analysis suggests that the number of instances increases at a rate of where √ Ω( vt ), vt is the total number of used function evaluations. However, practical results vt Ω( logvt ). Based on this, it can also be estimated ([GK11], page 439) that to nd the global optimum xopt , MetaMax needs only a logarithmic give a rate of growth only of factor more function evaluations than a local search algorithm instance, which would start in the basin of attraction of xopt . Note a small dierence in the way MetaMax and MetaMax(∞) are described in algorithms 5 and 4 from their descriptions in [GK11]. There, a new algorithm instance Ar is added with fr = 0 and sr = 0 and takes at most one step during the round that it is added. This is possible because in [GK11] a non-negative objective function f and a maximization task are assumed. Therefore, an algorithm instance 12
  19. 19. Algorithm 5: MetaMax input : function to be optimized f, monotone non-decreasing function h with properties as given in equation 3 1 2 3 r←1 while stop conditions not met do Ar , step it once and initialize vr , xr and fr For i = 1, . . . , r , select algorithm Ai if there exists c 0 so that: fi + ch(vi ) fj + ch(vj ) for all j = 1, . . . , r so that (vi , fi ) = (vj , fj ). If there are multiple algorithms with identical v and f , then select only one Add a new local search algorithm instance its variables 4 of them at random. 5 6 7 8 9 10 Step each selected Ai and update its variables vi , xi and fi . br = argmin1,...,r (fi ). If br = br−1 step instance Abr until vbr ≥ vbr−1 ∗ Update the best solution: x ← xb . r ←r+1 Find the best instance: return x∗ can be added without taking any steps rst, and assigned a function value fr = 0, which is guaranteed to not be better than any of the function values of the other instances. We are, however, dealing with a minimization problem with a known target value (see [Han+13b]) but no upper bound on of f. f and, consequently, no worst possible value Therefore, we made a little change and step the new instance Ar immediately after it is added. It can then also be stepped second time, during step 4 in algorithms 5 and 4. We believe, that this has no signicant impact on performance. 3.1 Suggested modications MetaMax and MetaMax(∞) will add a new instance each round as long as they are running, with no limit on the maximum number of instances. The authors of [GK11] state that the worst-case computational overhead of MetaMax and 2 MetaMax(∞) is O(r ), where r is the number of rounds. For the purpose of optimizing functions, where each function evaluation uses up a large amount of computational time (for which MetaMax was primarily designed), the overhead will be negligible compared to the time spent calculating function values and will not present a signicant problem. However, in comparison with restart strategies which have typically almost no overhead this is still a disadvantage for MetaMax. Therefore, it would be desirable to come up with some mechanism that would improve its computational complexity. An obvious solution would be to limit the total number of instances which can be added or slow down the rate at which they are added so that there will never be too many of them. However, this would make MetaMax and MetaMax(∞) behave basically in the same way as MetaMax(k) and lose their main property, which is the 13
  20. 20. consistency based on always generating new instances. A better solution would be to add a mechanism which would discard one of already existing instances every time a new one is added and therefore keep the total number of instances at any given time constant. The important question is: Which one of the existing instances should be discarded? We propose the following approach: Discard the instance which has not been selected for the longest time. If there are multiple instances which qualify, discard the one with the worst function value. The rationale behind this discarding mechanism is that MetaMax most often selects (allocates the most resources to) those instances, which have the best optimistic estimate of convergence. Therefore, the instances which are selected the least often will likely not give very good results in the future, and so make good candidates for deletion. An alternative, method may also be to discard the absolute worst instance (in terms of the best objective function value found so far). Which is even simpler, but we feel that it does not follow so naturally from the principles behind MetaMax. Therefore, for most of our experiments we will use the discarding of least selected instances. Another area where we think it might be benecial to modify the workings of MetaMax, is the mechanism of selecting instances to be stepped in each round. The original mechanism has two possible disadvantages: Firstly, it is not invariant to monotone transformation of the objective function values. By this we mean a f (x) → f (x), which itself is only a function of the value of f (x) and not the parameter vector x. The monotone property meaning, that if f (x1 ) f (x2 ) then f (x1 ) f (x2 ) for all possible x1 and x2 . Such a monotone transformation will not change the location of the optima of f (x). I will also not change the direction of gradient of f (x) for any x, but not necessarily its magnitude. An example of such mapping transformation is given in gure 2. Logically, it would not make much sense to require an optimization algorithm to be invariant to an objective function value transformation, which is not monotone, as it could change the position of the functions optima. The second possible disadvantage of the convex hull based instance selection mechanism is that it also behaves dierently based on the choice of the estimate function h. given, while This is not such a great disadvantage as the rst one, because h f (x) is can be chosen freely. However, it would still be benecial if we could entirely remove the need to choose h. To overcome these problems, we propose a new instance selection mechanism. It uses the same representation of local search algorithm instances as a set of points P, given in equation 4 but it select those instances, which correspond to non-dominated in the sense of maximizing fi and maximizing h(vi ) (or analogically fi and minimizing vi ). This method is clearly invariant to both monotone transformation of objective function values f → f and dierent choices of h, as determining non-dominated points depends only on their ordering along the axes fi and h(xi ), which will always be preserved due to the fact that both f → f and h are monotone. Moreover, the points which lie on the right upper convex hull of P , and thus maximize the optimistic estimate fi + ch(vi ), are always non-dominated, points of P maximizing and thus will always be selected. 14
  21. 21. 1 3 2 1 10 0 1 2 2 3 3 2 15 10 5 30 1 3 2 1 10 0 1 2 2 3 3 2 2 1 1 0 0 1 1 2 4000 3000 2000 1000 0 23 2 3 3 2 1 0 1 3 2 3 2 1 1 0 Figure 2: Example of monotone transformation of 2 f (x) Displays a 3D mesh plot of a Rastrigin like function f (x) in the top left, a transformed function f (x)3 in the top right and their respective contour plots on the bottom. It is clear, that the shape of the contours is the same, but their heights are not. A possible disadvantage of the proposed algorithm is, that at each round it selects many more points than the original convex hull mechanism. This might result in selecting instances with low convergence estimate too often, and not dedicating enough resources to the more promising ones. A visual comparison of the two selection mechanisms and demonstration of the inuence of choice of estimate function upon selection are presented in gure 3. 15
  22. 22. fi fi3 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 0 1 2 3 4 5 1e8 1e25 0.1 0.2 0.3 0.4 h(vi ) 0.1 0.2 0.3 0.4 h(vi ) Figure 3: MetaMax selection mechanisms Compares the original selection mechanism based on nding upper convex hull (left sub-gures), with the new proposed mechanism based on selecting non-dominated points (right sub-gures). Also demonstrates the eects of monotone transformation of the objective function values on the selection, with f (x) for the upper sub-gures and f (x)3 for those on the bottom. Selected points are marked as red diamonds, connected by a red line. Unselected points are marked as lled black circles. 4 Experimental setup All of the experiments were conducted using the COCO (Comparing continuous optimizers) framework [Han13a], which is an open-source set of tools for systematic evaluation and comparison of real-parameter optimization strategies. It provides a set of 24 benchmark functions of dierent types, chosen to thoroughly test the limits and capabilities of optimization algorithms. Also included are tools for running experiments on these functions and logging, processing and visualising the measured data. The library for running experiments is provided in versions for C, Java, R, Matlab and Python. The post processing part of the framework is available for Python only. The benchmark functions are divided into 6 groups, according to their properties. They are briey described in table 1. For detailed description, please see [Han+13a]. There are also multiple instances dened for each function, which are created by applying various transformations to the base formula. We shall now briey explain some of the functions properties mentioned in table 1. As already mentioned, the terms unimodal and multimodal refer to functions with 16
  23. 23. Name separ lcond hcond multi mult2 Functions Description 1-5 Separable functions 6-9 Functions with low or moderate conditionality 10-14 Unimodal functions with high conditionality 15-19 Multimodal structured functions 20-24 Multimodal functions with weak global structure Table 1: Benchmark function groups single optimum and multiple local optima respectively. Conditionality describes how much the functions gradient changes depending on direction. Simply put, functions with high conditionality (also called ill-conditioned functions), at certain points, grow rapidly in some directions but slowly in others. This often means that the gradient points away from the local optimum, which presents a dicult problem for some local search algorithms. To give a more visual description, one can imagine that 3D graphs of two-dimensional ill conditioned functions usually form sharp ridges, while those of well conditioned functions form gentle round hills. f (x1 , x2 , . . . , xd ) = f (x1 ) + f (x2 ) + ...+f (xd ), which means that they can be minimized by minimizing d one-dimensional functions, where d is the number of dimensions of the separable function. Separable functions have the following form: In order to exhaustively evaluate performance of the selected strategies, we decided to make the following series of measurements for each strategy: 1. Using four dierent local search algorithms - Compass search, Nelder-Mead method, BFGS and CMA-ES. In order to evaluate the eect of algorithm choice. 2. Using all of the 24 noiseless benchmark functions available in the COCO framework, to measure performance on a wide variety of dierent problems. 3. Using the following dimensionalities : d = 2, 3, 5, 10, 20. To see how much is the performance aected by the number of dimensions. 4. Using the rst fteen instances of each function. According to [Han+13b], this number is sucient to provide statistically sound data. Resource budget for minimizing a single function instance (a single trial) was set to 105 d, meaning 100000 times the number of dimensions of the instance. The reasons for choosing the four local search algorithms are: Compass search algorithm was chosen for its simplicity, in order to allow us to evaluate whether MetaMax can improve performance of such a basic algorithm. Nelder-Mead method was chosen as a more sophisticated representative of the group of pattern search algorithms, than compass search. BFGS was selected as a typical line search method. Finally, CMA-ES is there to represent population based algorithms. It is also the most advanced of the four algorithms and thus we expect that it will perform the best of the four selected algorithm. For a more detailed description of these algorithms, please see section A. 17
  24. 24. 4.1 Used multi-start strategies In this section, we describe the selected MetaMax and restart strategies, which were evaluated using the methods described above. For convenience, we assigned a shorthand name to each used strategy, so that we can write, for example csa-h-10d, instead of objective function stagnation based restart strategy with history length 10d using compass search algorithm, which is impractically verbose. The shorthand names have the following form: abbreviation of the used local search algorithm, dash, used multi-start strategy, dash, strategy parameters. A list of all used strategies and their shorthand names is given in table 3. We chose two commonly used restart strategies to compare with MetaMax: a xed restart strategy with a set number of resources allocated to each local search algorithm run, and a dynamic restart strategy with restart condition based on objective function value stagnation. Performance of these two strategies largely depends on the combination of the problem being solved and the strategy parameters. Therefore, we decided to use six xed restart strategies and six dynamic restart function stagnation strategies with dierent parameters: • Fixed restart strategies Run lengths: nf = 100d, 200d, 500d, 1000d, 2000d, 5000d evaluations. Shorthand names: • algorithm-f-nf Function value stagnation restart strategies hf = tf = 10−10 algorithm-h-hf Function value history lengths: 2d, 5d, 10d, 20d, 50d, 100d evaluations Function value tolerance: Shorthand names: Note, that the parameters depend on the number of dimensions of the measured function d. This is consistent with the fact that the total resource budget of the strategy also depends on d and that we can expect that for higher dimensionalities, the used local search algorithms will need longer runs to converge. The rationale behind choosing the used parameter values is the following: Using 5 the function evaluation budget of 10 d, run lengths longer than 5000d would give us less than 20 restarts per trial. This would result in a very low chance of nding the global optimum on most of the benchmark functions, some of which can have up to 10d optima. Also, it is probable that most local search algorithms will converge a long time before using up all 50000d function evaluations and then the rest of the allocated resources would be essentially wasted on running an instance which cannot improve any more. Conversely, run lengths smaller than 100d are probably not long enough to allow most local search algorithm instances to converge and so there would be little sense in using them. The choice of the upper bound of the function value history length hf as 100d is based on a similar idea: For values greater than 100d the restart condition would trigger too long after the local search algorithm has already converged, and so we would be needlessly wasting resources on it. The choice of the lower bound of depends on the used algorithm. For a restart strategy to function properly, 18 hf hf has to
  25. 25. be greater, or at least as much, as the number of function evaluations that the used local search algorithm uses during one step. The above stated value of hf = 2d is the minimal value for which the Nelder-Mead and BFGS algorithms work properly. For the other two algorithms, the minimal value is hf = 5d. We decided to base the function value history length on number of used function evaluations, rather than on number of taken steps, because it allows for a more direct comparison of performance of the same strategy using two dierent algorithms. Choosing the value of the function stagnation tolerance tf involved a little bit more guesswork. There is target function value dened for all of the benchmark functions, which is equal to the function value at their global optimum f (xopt ) plus −8 a tolerance value ftol = 10 . That is, the function instance is considered to be solved if we nd some point x f (x) ≤ f (xopt ) + ftol . We tf = 10−10 on ftol . with the function stagnation tolerance parameter tf one hundred times lower than ftol based our choice of Setting the value of should make it large enough to reliably detect convergence, while not being too large to trigger the reset condition prematurely, when the local search algorithm is still converging. The goal of using multiple strategies with dierent parameter values is to have at least one xed restart and one objective function value stagnation based strategy, that performs well on the set of all functions, for each measured dimensionality. For easier comparison of results of the xed restart strategies, we represent them all together, by choosing only the results of the best performing strategies for each dimensionality and collecting them into a best of collection of results, which we will refer to by the shorthand name algorithm-f-comb. This represents the results of running a xed restart strategy, which is able to choose the optimal run length (from the set of six used run lengths), based on dimensionality of the function being solved. The results of objective function value stagnation strategies are represented in an analogous way, under the name algorithm-h-comb. Besides the already mentioned restart strategies, we decided to add four more, each based on a restart condition specic to one of the used local search algorithms. Shorthand names for these strategies are algorithm-special. They are described in table 2. In order to save computing time, and as per recommendation in [Han+13b], we used an additional termination criterion that halts the execution of a restart strategy after 100 restarts, even if the resource budget has not yet been exhausted and the solution has not been found. This does not impact the accuracy of the measurements, as 100 restarts is enough to provide statistically signicant amount of data and the metrics which we use (see subsection 4.2) are not biased against results of runs which did not use up the entire resource budget. In fact, the xed restart strategies f-100d, f-200d and f-500d always reach 100 restarts before they can fully exhaust their resource budgets. The idea of using the original pure versions of MetaMax and MetaMax(∞) algorithms, which keep adding local search algorithm instances without limit, proved to be impractical due to its excessive computational resource requirements (for the length of experiments that were planned). Therefore, we performed measurements using only the modied versions of MetaMax and MetaMax(∞) with the added 19
  26. 26. Algorithm Compass search Description Restart when the variable a, which aects how far from the cur- rent solution the algorithm generates neighbour solutions, decreases −10 below 10 . It naturally decreases as the algorithm converges, so checking its value makes for a good restart condition. Nelder-Mead We chose a similar condition to the one mentioned above. Restart is triggered when distance between the two points of the simplex −10 which are the farthest apart from each other decreases below 10 . The rationale is similar as above: the simplex keeps growing smaller as the algorithm converges. It might be more mathematically proper to check the area (or volume, or hyper-volume, depending on the dimensionality) of the simplex, but we discarded this idea out of concern that it might be too computationally intensive. BFGS Restart condition is triggered if the norm of the gradient is smaller −10 than 10 . Since the algorithm already uses information about the gradient, it makes sense also to use it for detecting convergence. CMA-ES The recommended settings for CMA-ES given in [Han11] suggest using 9 dierent restart conditions. Here we use these recommended settings. Note that when using CMA-ES with the other restart strategies, we use only a single restart condition and the additional ones are disabled. In a sense, we are not using the algorithm to its full potential, but this allows for a more direct comparison with other local search algorithms. Table 2: Algorithm specic restart strategies mechanism (described in subsection 3.1) for limiting maximum number of instances. For all MetaMax strategies, we used the recommended form of estimate function: h = e−vi /vt . Measurements were performed using the following MetaMax strategies: 1. MetaMax(k), with k=20, k=50 and k=100. This gives the same total number of local search algorithm instances as when using xed restart strategies with run lengths equal to 5000d, 2000d and 1000d respectively. This makes it possible to evaluate the degree to which the MetaMax mechanism of selecting the most promising instances improves the performance over these corresponding restart strategies. The expectation is, that the success rate for MetaMax(k) will not increase, because the number of instances and thus the ability to explore the search space stays the same. However, MetaMax(k) should converge faster than the xed restart strategies, because it should be able to identify the best instances and allocate resources to them appropriately. 2. MetaMax and MetaMax(∞) with the maximum number of instances set to 100. This should allow us to asses the benets of the mechanism of adding new instances (and deleting old ones), by comparing the results with MetaMax(k), which uses the same number of instances each round, but does not add or delete any. Here, we would expect an increase in success rate on multimodal 20
  27. 27. functions, as the additional instances ,generated each round, should allow the algorithms to explore the search space more thoroughly. However, the limit of 100 instances will possibly still not be enough to get a good success rate for multimodal problems with high dimensionality. 3. MetaMax and MetaMax(∞) with maximum number of instances set to 50d. This should allow the algorithms to scale better with the number of dimensions and, hopefully, further improve their performance. The number 50d was chosen as a reasonable compromise between computation time and expected performance. We expect to get the best results here. algorithm-k-X for MetaMax(k), algorithm-m-X for MetaMax and algorithm-i-X for MetaMax(∞), where X is the maximum allowed number of instances (or, equivalently, the value of k for Shorthand names for MetaMax variants were chosen as MetaMax(k)). Fixed restart strategies f-100d f-200d f-500d f-1000d f-2000d f-5000d f-comb Run length = 100d evaluations Run length = 200d evaluations Run length = 500d evaluations Run length = 1000d evaluations Run length = 2000d evaluations Run length = 5000d evaluations Combined xed restart strategy Function value stagnation restart strategies h-2d h-5d h-10d h-20d h-50d h-100d h-comb History length = 2d evaluations History length = 5d evaluations History length = 10d evaluations History length = 20d evaluations History length = 50d evaluations History length = 100d evaluations Combined function value stagnation restart strategy Other restart strategies special Special restart strategy specic to each algorithm, see table 2 MetaMax variants k-20 k-50 k-100 k-50d m-100 m-50d i-100 i-50d MetaMax(k) with k=20 MetaMax(k) with k=50 MetaMax(k) with k=100 MetaMax(k) with k=50d MetaMax with maximum number of instances = 100 MetaMax with maximum number of instances = 50d MetaMax(∞) with maximum number of instances = 100 MetaMax(∞) with maximum number of instances = 50d Table 3: Tested multi-start strategies 21
  28. 28. There is a number of additional interesting aspects of the MetaMax variants, which would be worth testing and evaluating. For example: • Comparison of MetaMax and MetaMax(∞) with the limit on maximum number of instances and without it. • Performance of dierent methods of discarding old instances. • Inuence of dierent choices of estimate function on performance. • Performance of our proposed alternative method for selecting instances. However, it was not practically possible (mainly time-wise) to perform full sized 5 (10 d function evaluatoin budget) experiments which would test all of these features. Therefore, we decided to make a series of smaller measurements, with the maximum number of function evaluations per trial set to 5000d, using only dimensionalities d=5, d=10 and d=20 and using only the BFGS algorithm. This should allow us to test these features at least in a limited way and see if any of them warrant further attention. More specically, we made the following series of measurements: 1. MetaMax and MetaMax(∞) without limit on maximum number of instances 2. MetaMax and MetaMax(∞) with maximum instance limit 5d, 10d and 20d, discarding the most inactive instances 3. MetaMax and MetaMax(∞) with maximum instance limit 5d, 10d and 20d, discarding the worst instances 4. MetaMax(k) with k=5d, k=10d and k=20d These measurements were repeated three times, rst time using the recommended −v /v form of the estimate function h1 (vi , vt ) = e i t , second time with a simplied vi function h(vi ) = e and the third time using the proposed alternate instance selection method, based on selecting non-dominated points. 4.2 Used metrics In this section, we describe the various metrics that were used to compare results of dierent strategies. The simplest one is the success rate. For a set of trials U (usually of one strategy running on one or more benchmark functions) and a chosen target value t, it can be dened as: SR(U, t) = (6) |u : u ∈ U, fbest (u) ≤ t| is the number of trials which have found a solution at least as good as t. In the rest of this text we use a mean success rate, averaged over a set of target values T Where |U | |{u ∈ U : fbest (u) ≤ t}| |U | is the number of trials and SRm (U, T ) = 1 |T | 22 SR(U, t) t∈T (7)
  29. 29. The main metric used in the COCO framework is the expected running time, or ERT. It estimates the expected number of function evaluations that a selected strategy will take to reach a target function value trials U. for the rst time, over a set of It is dened as: ERT (U, t) = Where t evals(u, t) 1 evals(u, t) |{u ∈ U : fbest (u) ≤ t}| u∈U (8) u to reach u if it never reached t. Expression successful trials for target t. If there were is the number of function evaluations used by trial target t, or the total number of evaluations used by |{u ∈ U : fbest (U ) ≤ t}| is the number of no such trials, then ERT (U, t) = ∞. In the rest o this text we will use ERT averaged over a set of target values T , in a similar way to what is described in equation 7. We will also usually compute it using a set trials obtained by running the same strategy on multiple dierent functions, usually all functions in one of the function groups described in table 1. For comparing two or more strategies in terms of success rates and expected running times, we use graphs of the empirical cumulative distributive function of run lengths, or ECDF. Such a graph displays on the y-axis the percentage of trials for which ERT (averaged over a set of target values evaluations x, that for each where x x T) is lower than the number of is the corresponding value on the x-axis. It can also be said, it shows the expected average success rate, if a function evaluation budget equal to x was used. For easier comparison of ECDF graphs across dierent dimensionalities, the values on the x-axis are divided by the number of dimensions. The function displayed in the graph can then be dened as: y(x) = 1 |{t ∈ T : ERT (t, u) ≤ x}| d|T ||U | u∈U (9) An example ECDF graph, like ones that are used throughout the rest of the text, is given in gure 4. It shows ERTs of two sets of trials measured by running two dierent strategies on the set of all benchmark functions, for d=10 and averaged over a set of 50 target values. The target values are logarithmically distributed in −8 2 the interval 10 ; 10 . We use this same set of target values in all our ECDF graphs. The marker × denotes the median number of function evaluations of unsuccessful trials, divided by the number of dimensions. Values to the right of this marker are (mostly) estimated using bootstrapping (for details of the bootstrapping method, please refer to [Han+13b]). The fact that we use 15 trials for each strategy-function pair means, that the estimate is reliable only up to about fteen times the number of evaluations marked by ×. This is a fact that should be kept in mind when evaluating the results. The thick orange line in the plot is represents the best results obtained during the 2009 BBOB workshop for the same set of problems and is provided for reference. Since we are dealing with a very large amount of measured results, it would be desirable to have a method of comparing them, that is even more concise than ECDF graphs. To this end, we use metric called aggregate performance index (API), dened 23
  30. 30. Proportion of trials 1.0 f1-24,10-D best 2009 0.8 0.6 bfgs-k-100 0.4 0.2 nm-k-100 0.00 1 2 3 4 5 6 7 8 log10evaluations/D Figure 4: Example ECDF graph Comparison of the results of MetaMax(k), with k=100, using BFGS and Nelder-Mead local search algorithms, on the set of all benchmark functions. The strategy using BFGS clearly outperforms the other one, both in terms of success rate and speed of convergence. by Mr. Po²ík in a yet unpublished (at the time of writing this text) article [Po²13]. It is based on the idea that the ECDF graph of the results of an ideal strategy, which solves the given problem instantly, would be a straight horizontal line across the top of the plot. Conversely, for the worst possible strategy imaginable, the graph would be a straight line along the bottom. It is apparent, that the area above (or bellow) the graph makes for quite a natural measure of eectiveness of dierent strategies. Given a set of ERTs A, their aggregate performance index can be computed as: AP I(A) = exp 1 log a |A| a∈A 10 (10) For the purposes of computing API, the ERTs of unsuccessful trials which are by denition ∞ have to be replaced with a value that is higher than ERT of any suc- cessful trial. The choice of this value determines how much the unsuccessful trials are penalized and thus aects the nal ERT score. For our purposes, we chose the 8 value 10 d. Since we are computing API from the area above the graph this means that the lower its value the better the corresponding strategy performs. Using API essentially allows us to represent results of a set of trials by a single number and to easily compare performances of dierent optimization strategies. 4.3 Implementation details The software side of this project was implemented mostly in Python, with parts in C. The original plan was to write the project purely in Python, which was cho- 24
  31. 31. sen because of its ease of use and availability of many open-source scientic and mathematical libraries. However, during the project it was found out that a pure Python code performs too slowly and would not allow us to make all the necessary measurements. Therefore, parts of the program had to be changed over to C, which has improved performance to a reasonable level. The used implementations of BFGS and Nelder-Mead algorithms, are based on the code from open-source Scipy library. They were modied to allow running the algorithms in single steps, which is necessary in order for them to work with MetaMax. An open-source implementation of CMA-ES was used, available at [Han13b]. Implementation of MetaMax was written based on its description in [GK11]. It was however, necessary to make several little changes to it mainly because it is designed with a maximization task in mind but we needed to use it for minimization problems. For nding upper convex hulls we used Andrew's algorithm with some additional pre and post processing, to get the exact behaviour described in [GK11]. For description of the source code, please see the le source/readme.txt on the attached CD. 5 Results In this section we will evaluate results of the selected multi-start strategies. We decided to split the results into four subsections based on used local search algorithm. We present and compare the results mainly using tables, which list APIs and success rates for dierent groups of functions and dierent dimensionalities. For convenience, the best results are highlighted with bold text. We also show ECDF graphs to illustrate particularly interesting results. Results of the smaller experiments, described at the end of section 4.1 and results of timing measurements are summarized in subsection 5.5. The values of success rates and APIs, which are shown in this section, are com5 puted only using data bootstrapped up to the value of 10 d function evaluations. In our opinion, these values represent the real performance of the selected strategies better than if we were to use fully bootstrapped data, which are estimated to a large degree and therefore not so statistically reliable. In ECDF graphs, bootstrapped re7 sults are shown up to 10 d evaluations. All of the APIs and success rates are averaged over a set of multiple targets, as described in subsection 4.3. The measured data are provided in their entirety on the attached CD (see section B) in the form of tarballed Python pickle les, which can be processed using the BBOB post processing framework. It was not possible to provide the data in their original form, as text les, because their total size would be in the order of gigabytes, which would clearly not t on the attached medium. 5.1 Compass search Table 4 summarizes which of the used xed restart and function value stagnation restart strategies were best for each dimensionality and chosen for the best-of result collections cs-f-comb and cs-h-comb. Table 5 then compares these two sets of re- 25
  32. 32. sults together with results obtained by the compass search specic restart strategy cs-special. It is apparent, that for the best strategies the values of run length and function value history length increase with the number of dimensions. This is not unexpected as compass search uses 2d or 2d-1 function evaluations at each step. Dimensionality d=3 d=5 d=10 d=20 Stagnation based cs-f-100d cs-f-100d cs-f-200d cs-f-500d cs-f-500d d=2 Fixed cs-h-5d cs-h-5d cs-h-10d cs-h-10d cs-h-20d Table 4: Compass search - best restart strategies for each dimensionality The comparison of the best restart strategies suggests, that all of them have quite cs-h-comb being a little better than the others in cs-f-comb in terms of API. In the subsequent tables, we cs-f-comb for reference, as an exaple of a well tuned restart similar overall performance, with terms of success rate and will provide results of strategy. None of the strategies performs very well on multimodal and highly conditioned functions. This is to be expected, as the compass search algorithm is known to have trouble with ill conditioned problems and multimodal problems are dicult to solve mult2 multi hcond lcond separ all for any algorithm. cs-f-comb cs-h-comb cs-special cs-f-comb cs-h-comb cs-special cs-f-comb cs-h-comb cs-special cs-f-comb cs-h-comb cs-special cs-f-comb cs-h-comb cs-special cs-f-comb cs-h-comb cs-special 2D 3D 3.75 4.46 3.82 4.16 2.58 2.69 2.63 3.84 3.49 4.69 4.92 3.03 3.11 3.09 4.18 4.22 log10 API 5D 5.52 5.62 5.80 4.33 4.03 4.32 5.34 10D 6.31 20D 6.90 Success rate [%] 2D 3D 5D 10D 20D 85 74 53 41 34 6.28 6.86 6.91 5.50 85 74 100 100 48 69 4.86 5.43 100 100 84 5.43 100 100 100 84 98 6.39 4.93 4.95 6.35 7.44 5.15 5.97 6.99 56 44 43 66 39 37 62 66 63 66 63 84 72 51 69 63 50 4.18 5.38 100 100 72 5.38 5.99 6.63 7.03 82 52 43 35 33 5.17 6.59 7.34 7.80 80 26 10 44 74 18 17 66 14 76 70 52 4.10 4.71 3.86 3.36 3.89 5.21 6.08 4.48 4.74 4.89 6.50 6.45 6.95 6.99 5.31 5.38 5.76 6.91 6.89 7.53 7.61 6.30 6.08 6.20 7.31 7.30 7.98 7.96 6.85 6.63 6.65 47 37 79 72 80 100 85 35 32 63 66 72 29 31 55 Table 5: Compass search - results of restart strategies 26 64 26 3.63 6.07 6.20 7.26 84 68 4.19 5.40 5.83 4.29 6.27 78 28 28 11 10 42 51 40 26 26 7 8 36 50 50
  33. 33. Comparison of the results of three MetaMax(k) strategies with corresponding xed restart strategies which use the same total number of local search algorithm instances is given in table 6. They conrm our expectations, and show that, overall, MetaMax(k) converges faster than a comparable xed restart strategy. The only exception being the group separ. This can be explained by the fact that functions from this group are very simple and can be generally solved by a single, or only very few, runs of the local search algorithm. In this case, the MetaMax mechanism of selecting multiple instances each round is more of a hindrance than a benet. In terms of success rate, MetaMax(k) is always as good or even better than the comparable xed restart strategy, with the improvement being especially obvious on the groups lcond and mult2. Of the three tested variants of MetaMax(k), cs-m-100 ever, it is not better than a well tuned restart strategy like is the best overall. How- cs-f-comb. Figure 5 shows a behaviour which was observed across all function groups and dimensionalities when comparing MetaMax(k) with corresponding xed restart strategies: At rst, MetaMax(k) converges much more slowly than the restart strategy, as it is still in the phase of initialising all of its instances. However, as soon as this is nished, it starts converging quickly and overtakes the restart strategy for a certain interval. After that, its rate of convergence slows down again and it ends up with 5 success rate (for 10 d function evaluations) similar to that of the restart strategy. Proportion of trials This eect seems to get less pronounced with increasing number of dimensions. 1.0 f1-24,5-D best 2009 0.8 0.6 cs-f-2000d 0.4 0.2 cs-k-50 0.00 1 2 3 4 5 6 7 8 log10evaluations/D Figure 5: Compass search - ECDF comparing MetaMax(k) with an equivalent xed restart strategy Results of comparing cs-k-100, cs-m-100 and cs-i-100 are shown in table 7. It is apparent, that using the same number of instances at a time MetaMax and MetaMax(∞) clearly outperform MetaMax(k) on all function groups, both in terms of speed of convergence and success rates. In general, they also provide results at least as good, or better than the best 27
  34. 34. all separ lcond hcond multi mult2 cs-f-1000d cs-f-2000d cs-f-5000d cs-k-20 cs-k-50 cs-k-100 cs-f-comb cs-f-1000d cs-f-2000d cs-f-5000d cs-k-20 cs-k-50 cs-k-100 cs-f-comb cs-f-1000d cs-f-2000d cs-f-5000d cs-k-20 cs-k-50 cs-k-100 cs-f-comb cs-f-1000d cs-f-2000d cs-f-5000d cs-k-20 cs-k-50 cs-k-100 cs-f-comb cs-f-1000d cs-f-2000d cs-f-5000d cs-k-20 cs-k-50 cs-k-100 cs-f-comb cs-f-1000d cs-f-2000d cs-f-5000d cs-k-20 cs-k-50 cs-k-100 cs-f-comb 2D 4.60 4.82 5.18 4.55 4.16 4.00 3D 5.20 5.45 5.71 5.30 5.01 4.84 3.75 4.46 3.28 3.54 4.08 3.38 2.82 2.68 2.58 4.56 4.75 4.83 4.19 3.83 3.85 3.84 5.32 5.56 5.97 5.32 5.14 5.03 4.19 5.26 5.44 5.94 5.27 4.80 4.58 4.29 4.59 4.79 5.00 4.51 4.12 3.82 3.86 3.72 4.22 4.74 4.16 3.88 3.85 3.03 5.22 5.37 5.38 5.12 4.98 4.54 log10 API 5D 5.85 5.94 6.08 5.87 5.75 5.62 10D 6.36 6.46 6.59 6.42 6.37 6.29 4.58 4.61 4.86 4.62 4.63 4.71 6.31 5.01 5.08 5.18 5.21 5.24 5.29 4.33 4.93 5.52 5.75 5.84 5.96 5.70 5.58 5.45 6.33 6.45 6.53 6.27 7.11 7.03 4.18 5.34 6.29 6.39 6.52 6.35 6.23 6.13 5.38 5.99 6.63 6.18 6.34 6.51 6.15 5.93 5.81 6.81 6.87 7.00 6.76 6.62 6.47 4.99 5.23 5.66 5.09 4.48 6.59 5.80 5.94 6.05 5.88 5.66 4.25 5.30 5.17 4.48 5.31 6.90 5.44 5.49 5.51 5.75 5.81 5.80 5.50 7.42 7.36 7.48 6.28 6.28 6.35 6.66 6.89 6.93 6.83 6.79 6.70 5.92 6.06 6.21 5.95 5.77 5.67 20D 7.01 6.96 7.13 6.99 6.98 6.97 7.38 7.44 7.48 7.34 7.26 7.23 7.34 6.43 6.42 6.80 6.40 6.24 5.94 6.30 7.22 7.29 7.44 7.14 7.23 7.32 7.27 7.21 7.17 7.82 7.85 7.91 7.76 7.74 7.70 7.80 7.40 6.94 7.48 7.06 6.96 6.94 6.85 Success rate [%] 3D 5D 10D 20D 61 46 41 30 55 45 39 35 48 43 36 28 52 45 41 35 58 47 42 35 62 49 43 35 85 74 53 41 34 100 100 68 65 63 100 84 68 65 63 100 68 67 65 62 84 68 67 65 62 100 84 68 65 63 100 84 68 66 63 100 100 69 66 62 80 60 54 50 27 78 56 52 44 31 82 57 50 46 26 86 58 55 50 41 88 63 57 50 37 88 78 62 52 34 84 84 63 50 26 55 40 37 35 30 52 37 33 27 28 40 35 30 27 26 47 36 33 29 28 52 42 36 31 31 56 45 40 34 33 2D 74 72 67 67 74 75 82 62 59 45 50 62 62 80 72 70 70 70 72 73 80 52 35 31 26 27 31 34 63 70 69 54 69 70 70 74 43 22 21 17 18 21 25 26 52 52 52 52 54 54 66 35 14 13 12 12 13 33 10 10 9 9 10 14 10 14 10 44 47 34 50 50 51 42 18 41 18 34 34 34 36 Table 6: Compass search - results of MetaMax(k) and corresponding xed restart strategies 28
  35. 35. restart strategies. There is almost no dierence between the performance of and cs-i-100, cs-m-100 which corresponds with results presented in [GK11]. Dierences in performance seem to diminish with increasing dimensionality and, for d=10 and d=20. all of the MetaMax strategies which use 100 instances perform almost the same. ECDF graph in gure 6 shows an interesting behaviour, where cs-i-100 start converging right away and overtake cs-k-100 cs-m-100 and while it is still in the process of initializing all of its instances. After that, MetaMax(k) catches up and for a certain interval all of the strategies perform the same. Then, MetaMax(k) stops converging, the other two strategies overtake it again and ultimately achieve better success rates. The sudden stop in MetaMax(k) convergence presumably happens when all of its best instances have already found their local optima, after which there is no possibility of nding better solutions without adding new instances, which Proportion of trials MetaMax(k) cannot do. 1.0 f1-24,5-D best 2009 0.8 cs-m-100 0.6 0.4 cs-i-100 0.2 cs-k-100 0.00 1 2 3 4 5 6 7 8 log10evaluations/D Figure 6: Compass search - ECDF of MetaMax variants using 100 instances In the next set of measurements using cs-m-50d, cs-i-50d and cs-k-50d, it became apparent that the increased limit on maximum number of instances does not cause any noticeable increase in performance for MetaMax and MetaMax(∞). The performance of MetaMax(k) was somewhat improved, but overall it is still worse than the other two MetaMax variants and slightly worse than the best restart strategies. These results are also presented in table 7. cs-k-50d and cs-m50d compared with the collection of best xed restart strategy results cs-f-comb. We have omitted cs-i-50d as its performance is very similar to that of cs-m-50d. ECDF graph in gure 7 shows results of In conclusion, we can say that using the compass search algorithm MetaMax and MetaMax(∞) perform better than even well tuned restart strategies, and that increasing the maximum number of allowed instances does not have any signicant eect on their performance. 29
  36. 36. all separ lcond hcond multi mult2 cs-k-100 cs-m-100 cs-i-100 cs-k-50d cs-m-50d cs-i-50d cs-f-comb cs-k-100 cs-m-100 cs-i-100 cs-k-50d cs-m-50d cs-i-50d cs-f-comb cs-k-100 cs-m-100 cs-i-100 cs-k-50d cs-m-50d cs-i-50d cs-f-comb cs-k-100 cs-m-100 cs-i-100 cs-k-50d cs-m-50d cs-i-50d cs-f-comb cs-k-100 cs-m-100 cs-i-100 cs-k-50d cs-m-50d cs-i-50d cs-f-comb cs-k-100 cs-m-100 cs-i-100 cs-k-50d cs-m-50d cs-i-50d cs-f-comb 2D 4.00 3.33 3.34 3.97 3.35 3.35 3.75 2.68 2.44 2.42 2.67 2.45 2.45 2.58 3.85 3.14 3.18 4.05 3.21 3.20 3.84 5.03 4.14 4.22 5.00 4.16 4.24 4.19 4.58 3.61 3.58 4.32 3.59 3.58 4.29 3.82 3.30 3.28 3.84 3.30 3.27 3.86 3D 4.84 4.19 4.15 4.73 4.14 4.16 4.46 3.85 3.01 3.04 3.54 2.99 3.03 3.03 4.54 3.88 3.85 4.46 3.82 3.90 4.18 5.67 5.22 5.16 5.62 5.21 5.16 5.38 5.81 4.70 4.57 5.70 4.62 4.59 5.17 4.25 4.09 4.08 4.25 4.02 4.05 4.48 log10 API 5D 5.62 5.21 5.14 5.64 5.22 5.23 5.52 4.71 4.19 3.83 4.79 4.16 4.21 4.33 5.45 5.09 5.11 5.48 5.17 5.17 5.34 6.13 5.84 5.89 6.14 5.88 5.88 5.99 6.47 6.13 6.16 6.44 6.12 6.14 6.59 5.30 4.76 4.73 5.29 4.77 4.75 5.31 10D 6.29 6.15 6.16 6.29 20D 6.97 6.88 6.87 6.31 5.29 5.22 5.27 5.49 5.23 5.20 7.01 6.88 6.88 6.90 5.80 5.85 5.78 6.14 5.82 5.87 4.93 5.50 6.15 6.15 6.28 6.14 6.10 6.19 6.09 6.06 6.35 6.70 6.53 6.53 6.67 6.53 6.58 6.63 7.23 7.09 7.29 7.32 7.32 7.48 7.33 2D 75 92 91 76 91 91 85 74 84 100 100 100 100 100 100 100 100 100 100 100 100 100 78 69 84 84 69 62 100 92 70 88 99 84 100 7.03 82 7.68 90 7.17 7.68 7.08 7.68 89 68 89 7.68 90 7.08 7.09 7.34 5.94 5.78 5.80 5.94 5.77 5.80 6.30 7.69 62 7.80 6.94 6.51 6.47 6.53 6.49 6.49 6.85 80 73 66 64 100 7.44 7.17 7.12 7.18 7.29 7.17 7.15 7.70 79 50 61 61 53 68 84 98 84 56 80 77 56 79 77 7.29 Success rate [%] 3D 5D 10D 20D 62 49 43 35 78 61 46 39 79 79 91 80 90 91 84 45 57 58 46 58 37 36 52 34 68 74 74 90 75 72 80 48 74 60 90 90 62 68 68 63 40 47 41 47 47 43 25 35 34 26 63 70 74 74 72 74 90 70 74 34 26 54 71 71 55 71 71 66 Table 7: Compass search - results of MetaMax strategies 30 46 45 39 46 37 38 46 39 41 66 34 63 63 63 66 64 66 66 66 63 66 64 56 62 34 33 34 28 32 57 35 40 35 66 52 55 56 57 50 34 40 38 40 38 35 14 19 19 16 20 19 14 51 53 52 52 52 52 42 26 33 33 32 33 34 33 10 12 12 11 12 12 10 34 50 50 50 50 50 36
  37. 37. Proportion of trials 1.0 f1-24,10-D best 2009 0.8 cs-m-50d 0.6 0.4 cs-k-50d 0.2 cs-f-comb 0.00 1 2 3 4 5 6 7 8 log10evaluations/D Figure 7: Compass search - ECDF of MetaMax variants using 50d instances 5.2 Nelder-Mead method The best restart strategies for each dimensionality are listed in table 8 and their results are compared in table 9. For the xed restart strategies we see the expected behaviour, where run lengths of the best strategies increase with the number of dimensions. However, there seem to be only two best objective function stagnation based strategies: nm-h-10d and nm-h-100d. Interestingly enough, the switch between them occurs between d=5 and d=10, which is also the point where the overall performance of the Nelder-Mead algorithm decreases dramatically. Dimensionality d=2 d=3 d=5 d=10 d=20 Fixed nm-f-100d nm-f-100d nm-f-500d nm-f-1000d nm-f-5000d Stagnation based nm-h-10d nm-h-10d nm-h-10d nm-h-100d nm-h-100d Table 8: Nelder-Mead - best restart strategies for each dimensionality The algorithm performs very well for low number of dimensions - d=2, d=3 and to some extent also d=5. With results for these dimensionalities approaching those of the best algorithms from the 2009 BBOB conference. On the other hand, the performance for higher dimensionalities is very poor, especially on the group hcond. The three best-of restart strategies, compared in table 9, are all quite evenly matched with nm-special being the best overall by a small margin and nm-f-comb being the worst. The comparison of MetaMax(k) with corresponding xed restart strategies, given in table 10, shows that MetaMax(k) performs better on multimodal functions and 31
  38. 38. all separ lcond hcond multi mult2 nm-f-comb nm-h-comb nm-special nm-f-comb nm-h-comb nm-special nm-f-comb nm-h-comb nm-special nm-f-comb nm-h-comb nm-special nm-f-comb nm-h-comb nm-special nm-f-comb nm-h-comb nm-special 2D 3.03 2.91 2.95 2.55 2.51 2.46 2.54 2.35 2.45 3D 3.92 3.89 3.88 3.22 3.88 3.55 3.49 3.33 log10 API 5D 4.97 4.71 4.84 4.47 4.38 4.50 4.35 4.07 2.07 1.96 4.59 2.86 2.50 4.14 3.24 3.15 2.48 3.12 4.27 5.82 1.95 4.52 3.42 3.22 3.26 3.31 6.07 6.05 3.85 3.83 3.92 7.07 6.89 6.96 5.58 4.95 5.32 7.79 7.71 Success rate [%] 2D 3D 5D 10D 20D 92 81 62 45 18 92 76 67 45 19 92 79 63 51 24 100 100 64 44 40 100 68 66 44 40 100 84 66 55 41 100 82 78 52 6 100 84 81 56 4 100 86 80 56 10 100 100 100 68 18 100 100 100 68 20 7.20 100 10D 6.31 6.19 20D 7.71 7.61 6.04 7.49 6.02 5.99 5.68 5.95 5.80 5.87 5.65 5.40 5.01 7.62 7.55 7.57 6.24 6.13 6.02 7.07 6.80 6.83 8.16 8.19 8.06 8.04 7.99 8.02 7.60 7.47 7.46 74 76 100 41 46 16 20 85 84 18 55 86 84 72 86 84 78 41 100 56 82 10 11 41 7 7 10 51 51 20 52 20 7 16 Table 9: Nelder-Mead - results of restart strategies worse on the other function groups. It is also apparent that increasing the number of used instances for MetaMax(k) leads to a higher overall success rate and faster convergence on multi-modal problems but slower convergence on ill-conditioned functions, as apparent from the ECDF Proportion of trials graph in gure 8. 1.0 f10-14,10-D best 2009 0.8 nm-k-20 0.6 0.4 nm-k-50 0.2 0.00 1 2 3 4 5 6 7 8 9nm-k-100 log10evaluations/D Figure 8: Nelder-Mead - ECDF comparing MetaMax(k) strategies In fact, the performance of the tested MetaMax(k) strategies on ill-conditioned 32
  39. 39. functions is worse than that of the corresponding restart strategies. This is the opposite of what was observed when using the compass search algorithm and can be explained by the fact that the Nelder-Mead algorithm, unlike compass search, can handle ill-conditioned problems very quickly and with high success rate (at least for low dimensionalities). Therefore, there is no need for the MetaMax mechanism of selecting multiple instances each round, as almost any instance is capable of nding the global optimum. Selecting more than one at the same time only serves to decrease the rate of convergence. Overall, the three tested MetaMax(k) strategies perform only slightly better than the corresponding xed restart strategies and are clearly worse than the best restart strategies, such as nm-special. Table 11 shows the results of other tested MetaMax strategies. Unfortunately, measurements for all dimensionalities were not nished in time, before the deadline of this thesis, therefore table 11 contains only partial results for some strategies. For the dimensionalities where the results of all the strategies are available, it is apparent that nm-m-100 and nm-i-100 outperform nm-k-100, both in terms of success rates and API. There are no signicant dierences in performance between MetaMax variants using 100 and 50d local search algorithm instances, as well as no observable dierences between performance of MetaMax and MetaMax(∞). nm-special, MetaMax and MetaMax(∞) have better success rates on function groups separ, multi and mult2 and as a result, In comparison with the restart strategy a better overall success rate. MetaMax and MetaMax(∞) also converge faster on slower on lcond and hcond. multi and mult2, but are The overall result being that they are better than the best restart strategy ,in terms of API, for d=2 and d=3, but are worse for d=5. Unfortunately, we cannot make comparisons for higher dimensionalities, where the results for MetaMax and MetaMax(∞) are not available. However, based on the fact that the advantage in performance of MetaMax over the restart strategy is lower in d=3 than in d=2, and that the restart strategy is better in d=5, we can extrapolate that MetaMax would likely also perform worse for higher dimensionalities. Even if there was an improvement in performance, the fact remains, that the Nelder-Mead method has such a bad performance in higher dimensionalities, that it is unlikely that MetaMax could improve it to a practical level. 33
  40. 40. all separ lcond hcond multi mult2 nm-f-1000d nm-f-2000d nm-f-5000d nm-k-20 nm-k-50 nm-k-100 nm-special nm-f-1000d nm-f-2000d nm-f-5000d nm-k-20 nm-k-50 nm-k-100 nm-special nm-f-1000d nm-f-2000d nm-f-5000d nm-k-20 nm-k-50 nm-k-100 nm-special nm-f-1000d nm-f-2000d nm-f-5000d nm-k-20 nm-k-50 nm-k-100 nm-special nm-f-1000d nm-f-2000d nm-f-5000d nm-k-20 nm-k-50 nm-k-100 nm-special nm-f-1000d nm-f-2000d nm-f-5000d nm-k-20 nm-k-50 nm-k-100 nm-special 2D 3.64 3.82 4.03 3.96 3.76 3.58 3D 4.27 4.38 4.52 4.56 4.52 4.38 2.95 3.88 3.24 3.59 3.65 3.85 3.61 3.21 3.95 4.02 4.07 4.32 4.35 4.38 log10 API 5D 5.07 5.14 5.40 5.26 5.22 5.23 10D 6.31 6.33 6.52 6.54 6.40 6.45 20D 7.76 7.71 7.71 7.65 7.68 7.75 2D 81 75 71 72 81 85 4.84 6.04 7.49 92 4.59 4.56 4.74 4.86 4.93 4.98 6.02 6.01 6.08 6.05 6.14 6.24 7.27 7.04 7.07 6.91 7.03 7.30 Success rate [%] 3D 5D 10D 20D 70 61 45 14 69 61 45 17 65 56 42 18 65 60 38 17 66 61 45 17 71 62 44 15 79 63 51 24 100 67 66 65 65 66 68 84 69 68 69 84 64 63 62 62 63 64 44 46 49 46 45 43 29 40 40 40 40 31 2.46 3.55 4.50 5.68 6.83 100 84 66 55 41 2.45 3.31 4.14 5.87 8.06 100 86 80 56 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 68 68 69 68 68 67 10 100 100 100 82 41 10 8 3.01 3.05 3.13 3.46 3.40 3.44 1.92 1.99 2.00 2.59 2.74 2.85 1.96 5.77 6.11 6.56 6.00 5.36 5.15 4.52 4.12 4.24 4.62 3.82 3.61 3.22 3.26 3.49 3.50 3.60 3.98 4.03 4.00 2.47 2.46 2.44 3.10 3.24 3.33 2.48 6.63 6.78 6.91 6.59 6.40 6.25 6.05 4.68 4.95 5.42 4.70 4.50 3.86 3.92 4.44 4.60 4.86 4.70 4.77 4.81 3.26 3.30 3.51 3.77 3.93 4.02 3.12 7.14 7.21 7.32 7.00 6.93 6.88 6.96 5.81 5.93 6.46 5.84 5.46 5.36 5.32 5.95 5.94 6.15 6.25 6.34 6.42 5.65 5.59 5.74 5.77 5.84 5.96 5.01 7.62 7.64 7.69 7.50 7.46 7.43 7.57 6.24 6.41 6.87 7.07 6.19 6.18 6.02 8.24 8.20 8.16 8.20 8.19 8.21 7.99 7.87 7.79 7.87 7.93 8.00 7.20 7.98 8.02 8.04 7.92 7.88 7.87 8.02 7.42 7.52 7.60 7.47 7.48 7.45 7.46 84 82 80 80 84 84 54 40 23 28 52 55 78 85 84 84 84 84 84 86 80 80 77 78 80 80 24 20 18 17 21 25 41 83 83 68 68 68 83 84 77 77 76 76 77 78 16 14 11 13 14 16 18 53 53 36 52 53 55 56 52 53 52 52 54 53 2 4 6 4 4 4 9 14 18 14 12 9 10 7 7 7 7 10 8 10 8 9 10 51 50 34 18 50 51 52 7 22 18 16 18 18 20 20 Table 10: Nelder-Mead - results of MetaMax(k) and corresponding xed restart strategies 34
  41. 41. all separ lcond hcond multi mult2 nm-k-100 nm-m-100 nm-i-100 nm-k-50d nm-m-50d nm-i-50d nm-special nm-k-100 nm-m-100 nm-i-100 nm-k-50d nm-m-50d nm-i-50d nm-special nm-k-100 nm-m-100 nm-i-100 nm-k-50d nm-m-50d nm-i-50d nm-special nm-k-100 nm-m-100 nm-i-100 nm-k-50d nm-m-50d nm-i-50d nm-special nm-k-100 nm-m-100 nm-i-100 nm-k-50d nm-m-50d nm-i-50d nm-special nm-k-100 nm-m-100 nm-i-100 nm-k-50d nm-m-50d nm-i-50d nm-special 2D 3.58 2.89 2.94 3.58 3D 4.38 3.76 3.76 4.42 2.85 3.74 2.88 2.95 3.21 2.62 2.65 3.18 2.55 2.61 3.76 3.88 4.38 3.39 3.36 3.44 2.74 2.76 3.40 2.65 2.76 4.42 3.38 3.40 3.55 4.00 3.73 3.80 4.07 3.73 3.80 2.45 3.31 2.46 2.85 2.54 2.63 2.84 2.52 2.64 1.96 5.15 3.88 3.90 5.19 3.84 3.67 4.52 3.22 2.64 2.72 3.25 2.66 2.70 3.26 3.33 3.15 3.21 3.39 3.13 3.22 2.48 6.25 4.96 4.91 6.26 4.92 4.89 6.05 3.86 3.55 3.51 3.88 3.53 3.50 3.92 log10 API 5D 5.23 4.96 4.95 5.23 4.93 - 10D 6.45 6.55 - 20D 7.75 7.83 - 4.84 6.04 7.49 4.98 4.82 4.81 5.05 4.79 - 4.50 4.81 4.71 4.70 4.89 4.67 - 4.14 4.02 4.00 4.02 4.11 4.03 - 3.12 6.88 6.52 6.51 6.77 6.50 6.96 5.36 4.70 4.68 5.26 4.60 5.32 6.24 6.33 - 5.68 6.42 6.52 - 5.87 5.96 6.35 - 5.01 7.43 - 7.37 7.57 6.18 6.19 - 6.02 7.30 7.52 - 6.83 8.21 8.26 - 8.06 8.00 8.06 - 7.20 7.87 - 7.82 8.02 7.45 7.60 7.46 2D 85 96 96 85 96 98 92 100 100 100 100 100 100 100 84 100 100 86 100 100 Success rate [%] 3D 5D 10D 20D 71 62 44 15 89 69 89 69 71 62 43 13 89 69 89 79 63 51 24 68 64 43 31 100 67 100 66 68 64 44 28 100 66 100 84 66 55 41 80 78 53 4 86 80 85 80 80 78 55 2 85 80 86 - 100 86 80 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 55 82 82 55 82 90 78 84 25 72 41 12 9 16 26 26 73 86 86 82 24 100 84 100 73 100 100 - 67 56 - 10 25 18 73 41 83 85 85 83 85 100 56 84 18 55 73 74 55 76 56 Table 11: Nelder-Mead - results of MetaMax strategies 35 10 10 51 51 - 52 9 7 8 7 20 16 - 20
  42. 42. 5.3 BFGS Results of restart strategies bfgs-f-comb, bfgs-h-comb and bfgs-special are shown in table 13. Best xed and objective function stagnation based restart strategies for each dimension, which were used to make bfgs-hfcomb and bfgs-h-comb, are listed in table 12. Dimensionality d=2 d=3 d=5 d=10 d=20 Fixed Stagnation based bfgs-f-100d bfgs-f-100d bfgs-f-200d bfgs-f-1000d bfgs-f-1000d bfgs-h-2d bfgs-h-2d bfgs-h-2d bfgs-h-2d bfgs-h-2d Table 12: BFGS - best restart strategies for each dimensionality For the selected best xed restart strategies, we see an ordinary behaviour where run lengths increase with dimensionality. However, for the stagnation based restart strategies, bfgs-h-2d is apparently the best for all dimensionalities. This is quite unusual but, in hindsight, not entirely unexpected. It has to do with the way our implementation of BFGS works: At the begging of each step, the algorithm estimates gradient of the objective function by using the nite dierence method. This involves evaluating the objective function at a set of neighbouring solutions, which are very close to the current solution. The number of these neighbour solutions is always 2d one for each vector in a positive orthonormal basis of the search space. A very quick way to detect convergence is to check, if objective function values at these points are worse than function value at the current solution. As it turns out, this is precisely what bfgs-h-2d does and also the reason why it works so well. In contrast with the surprisingly good results of bfgs-h-2d, the special restart strategy, which is based on monitoring value of the norm of the estimated gradient, performs very poorly and is clearly the worst of all the tested restart strategies. The other two strategies with bfgs-h-comb bfgs-h-comb and bfgs-f-comb have a very similar performance, being slightly better. Overall, BFGS has excellent results on ill-conditioned problems, even exceeding the performance of the best algorithms from the BBOB 2009 conference for certain dimensionalities on the group hcond, which is illustrated in gure 9. However, it performs quite poorly on multimodal functions (multi and mult2). Table 14 sums up the results comparing MetaMax(k) with corresponding xed restart strategies. In terms of success rate, both types of strategies perform the same on all function groups. In terms of rate of convergence, expressed by values of API, the results are similar to those observed when using the Nelder-Mead method: MetaMax(k) strategies perform better on hcond and separ. multi and mult2, but worse on lcond, The overall performance of MetaMax(k) across all function groups is worse than that of the corresponding xed restart strategies and consequently also worse than performance of the best restart strategies. 36

×