SlideShare a Scribd company logo
1 of 242
Download to read offline
Non-linear PLS using Genetic Programming
Dominic Searson
School of Chemical Engineering and Advanced Materials
University of Newcastle upon Tyne
An online version of a thesis submitted to the Faculty of Engineering, University of
Newcastle upon Tyne in 2002, in partial fulfilment of the requirements for the
degree of Doctor of Philosophy.
© Dominic Searson 2002-2005
2
Preface to online version
This version is virtually identical to the original, but the formatting is wrong so it
may look a bit odd in places. I blame Word 2003 for this. Also, the page numbers
are different to the original.
Abstract
The economic and safe operation of modern industrial process plants usually
requires that accurate models of the processes are available. Unfortunately, detailed
mathematical models of industrial process systems are often time consuming and
expensive to develop. Consequently, the use of data based models is often the only
practical alternative. The need for effective methods to build accurate data based
models with a minimum of specialist knowledge has given impetus to the research
of automatic model development methods. One method, genetic programming (GP),
which is an evolutionary computational technique for automatically learning how to
solve problems, has previously been identified as a candidate for automatic non-
linear model development. GP has also been combined with a multivariate statistical
regression method called PLS (partial least squares) in order to improve its
performance (GP-PLS). One version of this method, called GP_NPLS2, was found
to give good performance but at a computational expense deemed too high for use as
a modelling tool. In this thesis, the GP-PLS framework is developed further. A novel
architecture, called team based GP-PLS, is proposed. This method evolves teams of
co-operating sub-models in parallel in an attempt to improve modelling performance
without incurring significant additional computational expense. The performance of
the team based method is compared with the original formulations of GP-PLS on
steady state data sets from three synthetic test systems. Subsequently, a number of
other modifications are made to the GP-PLS algorithms. These include the use of a
multiple gene sub-model representation and a novel training method used to
improve the ability of the evolved models to generalise to unseen data. Finally, an
extended team method that encodes certain PLS parameters (the input projection
weights) as binary team members is presented. The extended team method allows
the optimisation of the sub-models and the projection weights simultaneously
without recourse to computationally expensive iterative methods.
3
Table of Contents
Nomenclature ............................................................................................................ 7
1 Introduction...................................................................................................... 11
1.1 Background.................................................................................................. 12
1.2 Data based modelling .................................................................................. 13
1.2.1 Linear regression methods .................................................................... 13
1.2.2 Automatic model development............................................................. 13
1.2.3 Combining GP with PLS (GP-PLS) ..................................................... 14
1.3 Thesis aims and outline ............................................................................... 15
2 Evolutionary computation and genetic methods........................................... 18
2.1 Introduction ................................................................................................. 19
2.1.1 What is evolutionary computation? ...................................................... 19
2.2 EC algorithms- basic structure and functionality ........................................ 20
2.2.1 Historical perspective............................................................................ 21
2.2.2 Context of EC in engineering search and optimisation ........................ 22
2.2.3 Evolvability and the choice of representation....................................... 24
2.3 Selection mechanisms.................................................................................. 25
2.3.1 Fitness proportionate selection ............................................................. 26
2.3.2 Ranking selection.................................................................................. 28
2.3.3 Tournament Selection........................................................................... 29
2.4 Notes on evolutionary computational methodology.................................... 29
2.5 Genetic algorithms....................................................................................... 30
2.5.1 Background........................................................................................... 31
2.5.2 Reproduction operators in genetic algorithms ...................................... 31
2.5.3 Genetic algorithm flowsheet................................................................. 34
2.5.4 Genetic algorithms as function optimisers............................................ 36
2.5.5 Underlying processes: the schema theorem.......................................... 37
2.5.6 Towards flexible representation in genetic algorithms......................... 38
2.6 Genetic programming.................................................................................. 39
2.6.1 Program induction: parse trees as adaptable data structures................. 40
2.6.2 Reproduction operators in genetic programming ................................. 43
2.6.3 Specifying a genetic program ............................................................... 48
2.6.4 Terminals and functions: specifying the program components ............ 49
2.6.5 Handling constants in genetic programming ........................................ 50
2.6.6 Multiple populations............................................................................. 51
2.7 Summary...................................................................................................... 51
3 Genetic programming as a modelling tool..................................................... 53
3.1 Introduction ................................................................................................. 54
3.2 Representational approaches ....................................................................... 55
3.2.1 Steady state modelling.......................................................................... 55
3.2.2 Dynamic modelling............................................................................... 57
3.2.3 Process knowledge................................................................................ 59
3.2.4 Multigene representations..................................................................... 60
4
3.3 Improving the numerical properties of GP.................................................. 60
3.3.1 ERC optimisation methods ................................................................... 60
3.3.2 Code growth and parsimony pressure................................................... 63
3.3.3 Multi-objective fitness functions .......................................................... 65
3.4 Summary...................................................................................................... 66
4 Non-linear PLS using genetic programming ................................................. 67
4.1 Introduction ................................................................................................. 68
4.2 Linear regression methods........................................................................... 69
4.2.1 Multiple linear regression ..................................................................... 69
4.2.2 Principal component regression (PCR)................................................. 71
4.2.3 Projection to latent structures (PLS)..................................................... 75
4.3 The NIPALS algorithm ............................................................................... 81
4.3.1 Single output linear PLS....................................................................... 83
4.3.2 Determining the number of latent variables in linear PLS ................... 83
4.4 Non-linear PLS............................................................................................ 84
4.4.1 Non-linear polynomial PLS methods.................................................... 84
4.4.2 Non-linear artificial neural network PLS methods ............................... 88
4.4.3 Projection pursuit regression................................................................. 90
4.4.4 Overfitting in non-linear modelling...................................................... 91
4.5 GP-PLS........................................................................................................ 92
4.5.1 Introduction........................................................................................... 92
4.5.2 SGP-PLS configuration ........................................................................ 93
4.5.3 Team based GP-PLS............................................................................. 97
4.5.4 TGP-PLSa configuration .................................................................... 100
4.5.5 Determining the number of latent variables in GP-PLSa algorithms . 101
4.6 Summary.................................................................................................... 102
5 Comparing team and sequential GP-PLS.................................................... 106
5.1 Introduction ............................................................................................... 107
5.2 Test System 1: simulated cooking extruder............................................... 107
5.2.1 Experimental details............................................................................ 109
5.2.2 Interpretation of experimental data..................................................... 110
5.2.3 Qualitative comparisons between SGP-PLSa and TGP-PLSa models114
5.3 Test system 2: the non-linear Cherkassky function................................... 117
5.3.1 Interpretation of experimental data..................................................... 119
5.4 Test system 3: simulated pH process......................................................... 124
5.4.1 Interpretation of experimental data..................................................... 125
5.5 Use of validation data to avoid overfitting................................................ 129
5.5.1 Split-sample validation ....................................................................... 129
5.5.2 Retrospective early-stopping (RES) ................................................... 131
5.5.3 Further analysis of a GP-PLSa model................................................. 134
5.6 Analysis of computational costs................................................................ 139
5.7 Summary.................................................................................................... 140
6 Multigene GP-PLS ......................................................................................... 142
5
6.1 Introduction ............................................................................................... 143
6.1.1 The multigene concept........................................................................ 143
6.1.2 Multigene GP for system identification.............................................. 144
6.1.3 Multigene GP algorithm modifications .............................................. 146
6.2 Multigene GP-PLS .................................................................................... 149
6.2.1 Multigene SGP-PLS............................................................................ 150
6.2.2 Multigene TGP-PLS ........................................................................... 151
6.3 Comparison of single gene GP-PLSa with multigene GP-PLSa............... 153
6.3.1 Test system 1: simulated cooking extruder data................................. 153
6.3.2 Test system 2: Cherkassky function data............................................ 157
6.3.3 Test system 3: simulated pH process.................................................. 161
6.4 Analysis of computational costs................................................................ 164
6.5 Discussion.................................................................................................. 166
6.6 Summary.................................................................................................... 166
7 Dynamic data partitioning ............................................................................ 167
7.1 Introduction ............................................................................................... 168
7.1.1 SGP-PLS implementation of DDP...................................................... 170
7.1.2 TGP-PLS implementation of DDP ..................................................... 172
7.2 Experimental comparison of GP-PLS algorithms with and without DDP 174
7.2.1 Test system 1: simulated cooking extruder data................................. 174
7.2.2 Test system 2: Cherkassky function data............................................ 180
7.2.3 Test system 3: simulated pH process data .......................................... 185
7.3 Analysis of computational costs................................................................ 190
7.4 Discussion.................................................................................................. 192
7.5 Summary.................................................................................................... 192
8 Extended teams............................................................................................... 194
8.1 Introduction ............................................................................................... 195
8.1.1 Extended binary teams........................................................................ 196
8.1.2 Projection weight encoding................................................................. 197
8.1.3 Extended team algorithm details......................................................... 200
8.2 Experimental comparison of team based GP-PLS algorithms with and
without projection weight encoding .................................................................... 201
8.2.1 Test system 1: simulated cooking extruder data................................. 201
8.2.2 Test system 2: Cherkassky function data............................................ 206
8.2.3 Test system 3: simulated pH process data .......................................... 210
8.3 Discussion.................................................................................................. 215
8.3.1 Effects of projection weight encoding ................................................ 215
8.4 Analysis of computational costs................................................................ 217
8.5 Summary.................................................................................................... 219
9 Conclusions and further work ...................................................................... 220
9.1 Conclusions ............................................................................................... 221
9.1.1 Team based GP-PLS........................................................................... 221
9.1.2 The multigene approach to SGP-PLSa and TGP-PLSa...................... 222
6
9.1.3 Dynamic data partitioning................................................................... 224
9.1.4 Extended teams ................................................................................... 225
9.2 Further work .............................................................................................. 226
9.2.1 Highly multivariate systems ............................................................... 226
9.2.2 Multiple output variables .................................................................... 227
9.2.3 Final comments................................................................................... 229
References.............................................................................................................. 230
Acknowledgements ............................................................................................... 242
Nomenclature
Nomenclature
Acronyms
The following shorthand expressions are used to designate the various regression
methods and algorithms considered in this thesis. Where appropriate, the Chapters
containing experimental work with the algorithm are also cited.
CR Continuum regression.
EBNNPLS Error based neural network PLS.
EBQPLS Error based quadratic PLS.
EBRBFPLS Error based radial basis function PLS.
GP_NPLS1 The version of SGP-PLS implemented by Hiden (1998).
GP_NPLS2 The version of SGP-PLS implemented by Hiden (1998), which
incorporates optimisation of the input projection weights.
GP-PLS An umbrella term referring to GP based PLS methods.
INNPLS Integrated neural network based PLS.
LQPLS Linear quadratic PLS.
MLR Multiple linear regression.
MSGP-PLSa The default multiple gene implementation of SGP-PLS (Chapter 6).
MSGP-PLSb A multiple gene implementation of SGP-PLS that uses dynamic data
partitioning (Chapter 7).
MTGP-PLSa The default multiple gene implementation of TGP-PLS (Chapter 6).
MTGP-PLSb A multiple gene implementation of TGP-PLS that uses dynamic data
partitioning (Chapter 7).
MTGP-PLSc A multiple gene implementation of extended TGP-PLS that uses
dynamic data partitioning and a standard reflected binary Gray
coding of the input projection weights (Chapter 8).
NNPLS Neural Net PLS.
PCA Principal component analysis.
PCR Principal component regression.
PLS Partial least squares/ Projection to latent structures.
PLS1 Linear PLS with a single output variable.
PLS2 Linear PLS with multiple output variables.
PPR Projection pursuit regression
QPLS Quadratic PLS.
RBFPLS Radial basis function network PLS.
SGP-PLS Umbrella term for sequential GP-PLS algorithms.
SGP-PLSa The default single gene implementation of SGP-PLS used in this
thesis (Chapter 5).
SGP-PLSb A single gene implementation of SGP-PLS that uses dynamic data
partitioning (Chapter 7).
SPLINE-PLS A non-linear PLS algorithm that uses piecewise polynomials.
TGP-PLS Umbrella term for team based GP-PLS algorithms.
TGP-PLSa The default single gene implementation of TGP-PLS used in this
thesis (Chapter 5).
8
TGP-PLSb A single gene implementation of TGP-PLS that uses dynamic data
partitioning (Chapter 7).
TGP-PLSc A single gene implementation of extended TGP-PLS that uses
dynamic data partitioning and a standard reflected binary Gray
coding of the input projection weights (Chapter 8).
Symbols
α A function parameter.
β A function parameter.
jϕ The jth regressor vector in a basis function expression.
A A subset of the training data.
bi The ith linear model parameter in MLR. Also, the univariate
regression model coefficient for the ith latent variable stage in
PLS.
{b0,i,k , b1,i,k} Model coefficients for the kth GP individual and the ith inner
model in single gene GP-PLS.
B A subset of the training data.
B An (m × p) matrix of model coefficients.
Bq An (q × p) reduced matrix of PCR model coefficients.
{c0,i, …, c2,i} The quadratic coefficients for the ith inner model in QPLS.
corr(x,y) The correlation coefficient of x and y.
cov(x,y) The covariance of x and y.
dj The jth basis function coefficient.
e An (n × 1) error vector.
E An (n × p) error matrix.
f The number of basis functions.
)( kJf The raw fitness of the kth individual in a population.
)(' kJf The adjusted fitness of the kth individual in a population.
gj The jth basis function.
G The number of generations in a GP run.
Gj,k The jth gene in the kth multiple gene expression.
Gj,i,k In MSGP-PLS, the jth gene in the kth individual at the ith latent
variable stage.
In MTGP-PLS, the jth gene in the ith team member of the kth
team.
K Process gain.
Jk The kth individual in a population.
Ji,k In SGP-PLSa, the kth GP individual at the ith latent variable
stage. In TGP-PLSa, the ith team member in the kth team.
m The number of input (independent) variables.
n The number of measurements in the training set.
nA The number of measurements in training subset A.
nB The number of measurements in training subset B.
N The number of individuals in a GP population.
k
gN The number of genes in the kth multiple gene individual.
Ngm The maximum number of allowed genes in a multiple gene
individual.
9
Nlv,k The number of members in the kth team that are used to generate
the overall model.
Np The number of symbolic members in a team.
Nt Tournament size.
Nw The number of bits used to encode a projection weight in TGP-
PLSc and MTGP-PLSc (extended binary teams).
p The number of output (response) variables.
pi An (m × 1) vector of input loadings at the ith latent variable stage.
pi,k In TGP-PLS, an (m × 1) vector of input loadings at the ith latent
variable stage as calculated in the kth team evaluation.
Pc Probability of crossover.
Phigh Probability of high level crossover.
Pi The ith GP population.
Pm Probability of mutation.
Pr Probability of direct reproduction.
q The number of principal components retained in PCA.
qi A (p × 1) vector of PLS output loadings for the ith latent variable
stage.
r The radius of a circle.
rms(.) The root mean square value of (.).
R The set of ephemeral random constants.
s The Laplace operator.
Si The ith subpopulation.
t The generation index.
ti An (n × 1) vector of input scores for the ith latent variable stage.
ti(A) An (nA × 1) vector of input scores over the subset A for the ith
latent variable stage.
T An (n × m) matrix of PCA scores.
Tk The kth GP team in a population.
Tq An (n × q) reduced matrix of PCA scores.
ui An (n × 1) vector of output scores for the ith latent variable stage.
ui(A) An (nA × 1) vector of output scores over the subset A.
ui,k In TGP-PLS, the (n × 1) vector of the ith output scores vector as
calculated in the kth team evaluation.
ki,
ˆu In SGP-PLS, the (n × 1) vector of the prediction of the ith output
scores vector by the kth GP individual.
var(x) The variance of the elements in the vector x.
V An (n × m) matrix of PCA loadings.
Vq An (n × q) reduced matrix of PCA loadings.
wi An (m × 1) vector of input projection weights at the ith latent
variable stage.
wi,k In TGP-PLS, an (m × 1) vector of input projection weights at the
ith latent variable stage as calculated in the kth team evaluation.
Wi,k The ith binary team member in the kth team in TGP-PLSc and
MTGP-PLSc.
x An input (predictor) variable.
xj An (n × 1) vector of scaled measurements on the jth input
variable.
xt The value of the input variable at time t.
10
X An (n × m) input data matrix. The jth column contains scaled
measurements of the jth input variable.
Xˆ An (n × m) matrix of estimated values for X.
X(A) An (nA × m) matrix of input values over the subset A.
Xi The (n × m) deflated input matrix at latent variable stage i.
y An output (response) variable.
yt The value of the output variable at time t.
y A (n × 1) vector of scaled measurements on a single output
variable.
y(A) An (nA × 1) vector of output values over the subset A.
yj A (n × 1) vector of scaled measurements on the jth output
variable.
Y An (n × p) output data matrix. The jth column contains scaled
measurements of the jth output variable.
Yi The (n × p) deflated output matrix at latent variable stage i.
Chapter 1
Introduction
1 Introduction
Introduction
12
1.1 Background
Industrial plants can be operated most safely and economically when the engineer
has detailed fundamental knowledge of how the component processes work. Ideally,
every physical aspect of the each process is understood in detail, allowing it to be
efficiently controlled so that it performs in optimal conditions. In the real world,
however, this is rarely the case. Modern industrial processes tend be highly
complex, involving physical and chemical interactions that are often poorly
understood at the quantitative level. Sometimes the control of these processes is
based purely on rules gleaned from experience of operating the plant. This lack of
fundamental process knowledge precludes the use of detailed mathematical models,
which are, in any event, usually too time consuming and expensive to develop.
On the other hand, whilst fundamental knowledge of process behaviour is difficult
to obtain, process data is not. Plant instrumentation and computer systems routinely
collect and store data on hundreds of variables such as flow rates, pressures,
temperatures as well as measures of product quality. This data affords the possibility
of constructing entirely empirically determined models of how the process behaves
(models of this type are usually called “black box” models and allow the outputs of
the process, e.g. product quality variables, to be predicted using the inputs of the
plant, e.g. reactant flowrates, chemical composition etc). The drawbacks of black
box modelling are that the models do not usually have any physical interpretation
and they cannot safely extrapolate beyond the range of the data that were used to
train them.
Process data can also be used in conjunction with existing process knowledge, in
this case only certain relationships are determined from the data and these are
combined with a physically derived mathematical model. This approach is often
called “grey box” modelling.
However, in the absence of formal physical and chemical equations, methods that
rely solely on plant data to quickly develop cost-effective and accurate models are
needed. The concept of automatic model development: i.e. methods that allow good
Introduction
13
data based models to be built with minimum expert knowledge, is of particular
interest.
1.2 Data based modelling
1.2.1 Linear regression methods
Methods that assume a linear relationship between the process inputs and outputs are
the traditional tool of the engineer because they are fairly simple in structure and can
sometimes result in models that have some physical interpretation. These are usually
in the form of regression models, such as multiple linear regression (MLR) and,
more recently, principal component regression (PCR) and partial least squares
regression (PLS; Wold, 1975)1
. The latter two methods are developments of MLR
that have certain properties that make them useful for finding linear relationships
between output variables and large numbers of highly correlated input variables.
PLS works by projecting the input and output data onto low dimensional subspaces
and then fitting univariate regression models between the projections. It has proved
to be an effective way to develop multivariate process models from noisy, high
dimensional and correlated data (e.g. see Wise and Gallagher, 1996). A drawback
with the PLS regression method is that it assumes a linear relationship between
inputs and outputs, whereas the behaviour of industrial processes is frequently non-
linear. There are ways of extending the PLS framework to capture non-linear
relationships, however, and these are described in Chapter 4.
1.2.2 Automatic model development
Methods to develop models that can effectively capture non-linear relationships in
process data with a minimum of expert knowledge have been extensively
researched. The most well known of these techniques is the artificial neural network
(ANN). This method has had a good degree of success in the process industries (e.g.
see Willis et al., 1992, Lennox, 1996). There are certain disadvantages to using
ANNs, however, not least of which is that the user must make informed decisions
1
PLS is also known as “projection to latent structures” for reasons that will become apparent in
Chapter 4.
Introduction
14
about the size, topology and training method as well as the selection of appropriate
network inputs (McKay et al., 1997). Of course, it is arguable that a meta-level
optimisation layer for performing these choices can be implemented (although these
may be computationally costly) or established rules of thumb may be used to
determine network topologies, but Sarle (1997) points out that many of these rules
are ‘nonsense’ and states that “in most situations, there is no way to determine the
best number of hidden units without training several networks and estimating the
generalisation error of each”.
Another relatively recent technique that shows promise in the automatic
development of process models is that of genetic programming (GP; Koza, 1992).
GP is an evolutionary search method and it was originally designed as way of
automatically learning how to solve problems by applying artificial selection and
reproduction to populations of solutions that are encoded as variable length tree
structures. GP was identified as a good candidate for automatic model development
because it appeared to be able to automatically select the appropriate input variables
as well as discover the model structure and parameters simultaneously (this process
is known as symbolic regression). Whilst this is true to a certain extent, some
research at the University of Newcastle has shown that the standard form of GP (i.e.
the original form of GP using a single population of trees constructed from
arithmetic and simple non-linear functions with no use of advanced architectures or
representations) does not generally perform any better than feedforward artificial
neural networks with sigmoidal activation functions (Hiden, 1998).
It is also debatable whether the GP approach is any more ‘automatic’ than the ANN
approach since a number of user decisions must be made with regards to population
size, architecture, selection method, choice of primitives etc. However, it remains to
be seen if GP is, in practice, more amenable to automatic model development.
1.2.3 Combining GP with PLS (GP-PLS)
In an attempt to improve the ability of GP to model steady state non-linear
multivariate systems, a hybrid of GP and the PLS modelling method was proposed
Introduction
15
(GP_NPLS1; Hiden, 1998, Hiden et al., 1998). This method, in common with other
non-linear PLS methods, sequentially supplies a series of non-linear univariate
models to fit the relationships between the data projections. The GP_NPLS1 method
was found to increase the accuracy of the evolved models, in terms of the prediction
errors on unseen data, but was not found to give any better performance than an
equivalent neural network based PLS method.
A variant on this method, GP_NPLS2, was also proposed. GP_NPLS2 retains the
same basic architecture as GP_NPLS1 but it incorporates an iterative non-linear
least squares routine that optimises the data projection directions. GP_NPLS2 gave
better results than GP_NPLS1 and outperformed the equivalent neural network PLS
approach (i.e. neural network PLS with optimised projection directions). Despite
this improvement, GP_NPLS2 was deemed unacceptable for use as a modelling tool
due to the extremely high computational cost requirements of the external optimiser.
However, it was felt that the GP-PLS concept had unexplored potential and there
might be ways of modifying it so that better results could be attained without
resorting to the use of “brute force” optimisation methods, which are both time
consuming and aesthetically unappealing from an engineering standpoint. This lead
to the starting point of the work described in this thesis, the outline of which is
detailed in the next section.
1.3 Thesis aims and outline
The primary aim of this work is to demonstrate that the GP-PLS method has
potential as a viable process systems modelling tool by improving its performance
without recourse to methods that would greatly increase the computation load, such
as iterative optimisation routines. This is accomplished by testing some novel GP-
PLS architectures and evaluating training and validation methods on some simple
steady state test systems. This thesis begins, however, by introducing the field of
evolutionary computation (focussing particularly on genetic algorithms and genetic
programming) before reviewing the use of genetic programming as a modelling
(system identification) tool with industrial process applications. A brief abstract of
each of the following Chapters is provided below:
Introduction
16
Chapter 2: Evolutionary computation and genetic methods
The underlying principles of evolutionary computation are briefly introduced as well
as a more detailed discussion of the mechanisms of genetic algorithms and genetic
programming.
Chapter 3: Genetic programming as a modelling tool
A review of the use of GP as modelling tool is provided, with an emphasis on
applications for process systems modelling. Representational and numerical issues
relevant to the model building capabilities of GP are discussed.
Chapter 4: Non-linear PLS using genetic programming
The mechanisms underlying linear PLS regression and related approaches are
discussed as well as non-linear extensions to the PLS framework, such as neural net
based PLS and the sequential GP-PLS algorithms proposed by Hiden (1998). A
novel GP-PLS architecture, team based GP-PLS, is proposed. This evolves a
population of co-operating teams of models to solve the modelling task in parallel,
unlike the original GP-PLS method in which the models are supplied sequentially by
consecutive GP runs.
Chapter 5: Comparing team and sequential GP-PLS
A comparative study of the team based GP-PLS architecture with the sequential
architecture proposed by Hiden (1998) is described. This study is based on the use
of data obtained from three synthetic test systems: a simulated cooking extruder, a
non-linear mathematical function and a simulated pH process. Some properties of
the evolved models are described and an attempt is made to improve the
generalisation performance of the evolved models by use of a split-sample
validation method.
Chapter 6: Multigene GP-PLS
The combination of the “multigene” GP method (Hinchliffe et al., 1996) with both
team based and sequential GP-PLS methods is described. This method decomposes
GP models into modular substructures in order to improve their evolvability
Introduction
17
properties. A comparative study of the multigene GP-PLS algorithms and the single
gene algorithms on the three synthetic test systems is described.
Chapter 7: Dynamic data partitioning
A novel method for GP-PLS training is proposed and combined with the team and
sequential algorithms. This method, provisionally called dynamic data partitioning,
is intended to improve the generalisation of the evolved models by reducing model
overfitting. A comparative study of the various GP-PLS methods with and without
the dynamic data partitioning method on the three synthetic test systems is
described.
Chapter 8: Extended teams
A novel team based GP-PLS architecture whereby the data projection directions are
encoded as binary team members and evolved in parallel with the GP-PLS models is
proposed. A comparative study (on the three test systems) of the extended team
algorithm with the algorithms developed in Chapter 7 is described and some
properties of the evolved models are discussed.
Chapter 9: Conclusions and further work
A number of comments and conclusions on the development of the GP-PLS
framework in this thesis are offered. Suggestions for further work in the area are
provided.
Chapter 2
Evolutionary computation and genetic methods
2 Evolutionary computation and genetic methods
Evolutionary computation and genetic methods
19
2.1 Introduction
This thesis is concerned with the application of genetic search to the projection
based regression method of partial least squares (PLS). Genetic search methods are
members of a closely related family of procedures called evolutionary computation
(EC). The purpose of this chapter is to introduce EC, and subsequently genetic
algorithms and genetic programming by exposition of the underlying principles of
simulated evolution. Concepts that are of importance in EC (and in the work that is
discussed in the following chapters) such as evolvability and representational issues
are outlined. The chapter concludes with a discussion of various features of genetic
programming as a prelude to the use of GP in a system identification framework.
2.1.1 What is evolutionary computation?
Evolutionary computational methods are a class of iterative learning algorithms that
imitate the natural processes of biological evolution in order to solve science and
engineering problems. EC methods utilise a set of concepts and arguments that are
essentially identical to those that underpin the modern theoretical framework of
evolutionary biology. This framework is known as neo-Darwinism as it builds on
the ideas of “survival of the fittest” and cumulative selection first proposed
coherently by Darwin (1859). If one combines Darwin’s ideas with the notion of
differences in genetic encoding (the genotype) mapping to differences in physical
attributes (the phenotype) then evolution can be regarded as a statistical process
operating on complex data structures. It is then possible to view evolution as an
open-ended optimisation process that can be formalised, modelled and exploited.
The basic requirements for evolution to occur in a biological context are: (e.g.
Darwin, 1859)
• There is a finite population of individuals.
• The individuals can reproduce and pass on their traits to their offspring.
• There should be variety of traits within the population of individuals.
• The traits of the individuals should be related to their ability to survive (i.e.
Evolutionary computation and genetic methods
20
the variety of the individuals should enable them to compete for the right to
be selected for reproduction).
In addition to these points there should be added the explicit requirement of
encoding of the traits:
• The salient characteristics of the individual (i.e. those characteristics which
impart upon the individual the ability to reproduce in its environment) should
be partly transmissible to its offspring via some sort of encoding system.
Furthermore, the transmission of the traits should not be error free otherwise
no new traits can develop.
It is now almost universally accepted that, in nature, it is predominantly the DNA
code that determines the variability of individuals. Thus, it is the complex
interactions between the genetic information and the processes of selection in a
finite population that give rise to the evolutionary driving force. In nature this
evolutionary pressure is not directed towards a particular goal, it is open ended,
whereas in simulated evolutionary processes (for the most part) the evolution is
directed towards solving a specific problem.
2.2 EC algorithms- basic structure and functionality
EC algorithms work by iteratively processing a population of individuals, each of
which forms a candidate solution to some problem. At the beginning of the EC
algorithm, it would not be expected that any individual would constitute a good
solution, since the initial population is randomly generated. The population is then
forced (by means of a process analogous to natural selection) to evolve with the goal
of producing better and better candidate solutions. Different EC algorithms use
different representations of candidate solutions. Representations range from real-
valued vectors in evolutionary programming (Fogel, Owens and Walsh, 1966) and
evolutionary strategies (e.g. see Schwefel, 1995) to bit strings and symbolic tree
structures in genetic algorithms (Holland, 1975) and genetic programming (Koza,
1992).
Evolutionary computation and genetic methods
21
The level of performance of any particular individual can be ascribed some
numerical value, this is frequently referred to as its fitness. Assigning the fitness
involves evaluating the individual against some problem dependent objective
function (called a fitness function in EC literature) and is, for non-trivial
applications, the most time consuming part of the algorithm. The rate of replication
for any individual is determined by its fitness value and the fitness values of the
other individuals in the population. Exactly how the replication occurs is determined
by the selection scheme used to pick the individuals for replication and the genetic
operators used to perform the replications (possibly with modifications to the
individuals; analogous to mutation and sexual reproduction in biological evolution).
Those individuals that perform well, i.e. those of above average fitness, must have a
selection rate higher than those that perform relatively poorly in order for their
genetic information to successfully penetrate, and remain tenable in, the population.
The algorithm is terminated according to some pre-specified termination criteria.
This is generally dependent on the type of algorithm and the application. The most
frequently used method is to allow a pre-set number of iterations to elapse before
termination although in some situations, where there is some a priori quantitative
knowledge of the problem solution, it is possible to terminate the procedure when a
member of the population gives an acceptable solution.
2.2.1 Historical perspective
The roots of EC can be traced back to the late 1950s and early 1960s with works
published in the general area of machine learning by a number of contributors e.g.
Friedberg (1958), Holland (1962). Interest in the use of evolutionary methods for
performing adaptation continued throughout the 1970s but was mostly restricted to a
relatively small number of researchers with access to suitable computer hardware,
and who published only in a narrow spectrum of journals. This situation persisted
for a number of years and it was not until the early 1990s that the previously
disparate components of (what is now termed) EC formally cohered (Bäck et al.,
1997).
Evolutionary computation and genetic methods
22
The mid 1980s saw the beginning of a more widespread interest in EC methods.
Largely, this was catalysed by the availability of relatively cheap, high performance
computing to researchers in a variety of technical disciplines. Difficult optimisation
problems (i.e. those posed in noisy, uncertain and highly constrained domains), in
particular, have become popular candidates for the application of EC techniques. As
greater computing power becomes available, however, it is expected that EC
methods will be increasingly be used for design purposes.
Most of the evolutionary algorithms around today can be loosely classified as
belonging to one of the following three categories: genetic algorithms (subsuming
genetic programming and classifier systems), evolutionary programming and
evolutionary strategies. These approaches are highly related but, historically, they
were developed independently (Fogel, 1997).
2.2.2 Context of EC in engineering search and optimisation
A number of search and optimisation methods have been developed for wide
ranging uses in the fields of science, engineering and economics. They are typically
applied when the solution (or solutions) to the problem being examined cannot be
readily expressed in a neat, closed analytical form. This is usually the case in the
majority of real-world problems: often the available information is not sufficient for
a simple solution to be deduced, or the mathematical analysis may be intractable.
Hence, further techniques to search for a satisfactory solution are usually required.
Calculus driven and enumerative methods form the traditional base of search and
optimisation techniques. Exhaustive enumerative methods, e.g. dynamic
programming (Bellman, 1957), directly evaluate the objective function for possible
solutions point by point. The regions of search are progressively refined and
explored (e.g. using geometrical considerations) so that the number of points
evaluated does not become too large and degenerate the procedure into a random
search. However, these techniques are not efficient and they “break down on
problems of moderate size and complexity” (Goldberg, 1989).
Evolutionary computation and genetic methods
23
Calculus driven techniques assume that the space to be searched can be treated as an
analytically well-behaved surface with extrema that can be located using derivative
functions. The efficacy of such a search is highly dependant on the topography of
the optimisation surface and the initial conditions. Again, the assumptions that need
to be made about the behaviour of the search space are quite strong and, in general,
are not satisfied by the majority of real-world problems. The difficulty in solving
these problems has been the main driving force behind the development of
stochastic (including evolutionary) methods.
It is possible to employ algorithms that do not rely on the search space being
continuous and well behaved. Evolutionary algorithms are included in this class of
search methods, as is the (non-population based) class of algorithms known as
simulated annealing (Metropolis et al., 1953). This method searches points in the
space of possible solutions in a probabilistic manner. Previous candidate solutions
are perturbed according to a statistical schedule analogous to the annealing method
in metal cooling. Initially, when the “temperature” is high, perturbations to previous
solutions are accepted with high probability. As the algorithm continues, a cooling
schedule is imposed so that future perturbations are accepted with ever decreasing
probability. The mechanisms involved in a simulated annealing optimisation are
similar to those occurring in certain evolutionary algorithms, the main differences
are in the physical analogy used and the use of a population of search points in the
evolutionary case.
One of the frequently cited advantages of evolutionary algorithms over traditional
methods is that the use of a population of search points is less sensitive to the initial
conditions of the search, and that the explicit parallelism of the algorithms makes
them less likely to become trapped at local extrema in a multi-modal search space.
Another reason for their popularity is that it is not necessary to evaluate or estimate
derivatives during the search: no auxiliary information other than objective function
evaluations is required.
It is important to remember that, despite the recent excitement, evolutionary
Evolutionary computation and genetic methods
24
algorithms are not voodoo. They, like any other search methods, have their
limitations. It is, in general, not possible to establish convergence proofs for
evolutionary methods (although this can be done in certain cases, e.g. Rudolph,
1994) and it is somewhat unclear (unlike calculus based techniques) under what
circumstances evolutionary algorithms perform poorly or what representation is
most appropriate for a given task. Another problem is that, as the complexity of the
target problems becomes greater, the computational demands of these algorithms
begin to outstrip the available resources. Hence, the cost of repeated experimentation
can be prohibitive. This can often limit their usefulness for certain purposes. Indeed,
the perceived high computational cost to performance ratio involved in the best
performing GP-PLS algorithm proposed by Hiden (1998) was the prime motivation
for much of the work tackled later in this thesis.
2.2.3 Evolvability and the choice of representation
In addition to the requirements outlined in Section 2.1.1, it is necessary for the
population as a whole to have a property known as evolvability, i.e. “the ability of
random variations to sometimes produce improvement” (Wagner and Altenberg,
1996). Without this “hidden” criterion, the requirements listed earlier are not
sufficient to ensure evolution. Evolvability is, in general, a complex property of the
way that the genotype (i.e. the coding of the individual as an abstract mathematical
entity) maps to the phenotype (i.e. the “physical” structure of the individual, which
ultimately dictates its behaviour). In EC designs this is often, mistakenly, considered
to be a purely representational problem, with the fitness function regarded as a self-
evident and immutable goal, rather than an as functional ingredient of the
evolutionary process (Jakobi, 1996). This means a somewhat more lateral approach
to fitness function design might be needed in EC designs than is normally required
for traditional search techniques.
Evolvability is a necessary property of a successful EC application but how does
one, from a practical standpoint, go about achieving it? Actually, up to a point, this
may not be as difficult as it sounds: common EC representations (e.g. genetic
algorithm bit strings) are common because they are structures that, empirically, have
Evolutionary computation and genetic methods
25
shown to exhibit good evolvability properties in a number of situations.
Furthermore, the form of the fitness function is, to some degree, pre-determined by
the nature of the application domain. The skill of the designer then lies in modifying
these basic components in order to further improve the evolvability of the
population, e.g. by augmenting the fitness functions with penalty functions (e.g.
Searson et al., 1998) or by modularising the representation by ensuring that, as far as
possible, functionally independent phenotypic effects are represented by
syntactically independent genotype structures (Altenberg, 1994).
So, whilst there are some general avenues of exploration open to the designer of an
EC application wishing to maximise evolvability, there are no hard and fast rules for
accomplishing this and so the designer frequently must utilise an iterative, heuristic
procedure, based on the recommendations of the available literature and experience
of similar applications. The methodology of EC designs is discussed further in
Section 2.4
2.3 Selection mechanisms
The selection mechanism is central to the successful operation of evolutionary
algorithms. It must improve the average fitness of the next generation by giving
individuals with a high relative fitness a high probability of being selected. Then,
reproduction operators (such as mutation and crossover in the case of genetic
algorithms) can be applied to the selected individuals to create new individuals,
thereby investigating new regions of the search space. Thus, the selection
mechanism allows the exploitation of genetic material currently contained within the
population, with a view to its further improvement in future generations by means of
evolutionary reproduction operators. This is in stark contrast to traditional “hill
climbing” techniques that focus only on transforming the current best solution into a
better one, ignoring the possibilities of previous partial solutions, and leaving the
approach susceptible to being trapped in a local optimum. By allocating trials to
inferior solutions, evolutionary algorithms delay the immediate moderate payoff in
expectation of a higher future payoff. A balance is struck between the exploitation of
individuals with a higher than average fitness in the population and the exploration
Evolutionary computation and genetic methods
26
of individuals that are not quite as good (but may contain genetic information that
could be useful when mutated to a slightly different form or suitably combined with
other individuals). The weighting of this balance is determined by the selection
pressure over the population.
The term “selection pressure” is frequently used in an informal manner1
to indicate
the probability that individuals with a given fitness value have of being picked by
the selection process. This term can also be applied to a population as a whole. If it
said that there is a high selection pressure over a population it usually means that the
selection mechanism is heavily biased towards individuals of high relative fitness,
with the advantage of greatly raising the average fitness of the next generation. If too
high a selection pressure is applied, however, this could have the undesirable effect
of causing a loss of diversity in the next generation and premature convergence of
the algorithm to an unsatisfactory solution. Conversely, too little selection pressure
and the algorithm stagnates and, in the degenerate case, becomes little better than a
random search. The choice of the level of selection pressure to exert on the
population throughout the course of an evolutionary algorithm is a major
consideration. In most applications, however, the designer does not have sufficient a
priori information to gauge the effect of a given selection scheme on the success of
the evolutionary algorithm and so must usually opt for mechanisms that have proved
successful in the past or have been recommended in the literature.
2.3.1 Fitness proportionate selection
Fitness proportionate selection (often known as roulette wheel selection) is probably
the simplest selection mechanism to implement and is the method that was originally
chosen for use with the earliest genetic algorithms by Holland (1975). It may be
stated simply as: the selection probability p(Jk) of the kth individual Jk in the current
population P(t) = { J1, J2, …, JN } at generation t is directly proportional to the
fitness value f(Jk) of the individual.
1
Additionally, there are a number of formal measures of selection pressure (or ‘selection intensity’).
Blickle and Thiele (1995) define it as the difference between the population average fitness before
and after selection, normalised by the mean variance of the pre-selection population fitness. They use
this selection intensity measure as a means of quantitatively comparing different selection schemes.
Evolutionary computation and genetic methods
27
The constant of proportionality is the inverse of the sum of the fitness values of the
individuals in the current population, it serves to normalise the sum of the individual
probabilities to one (Equation 2.1).
∑=
= N
k
k
k
k
Jf
Jf
Jp
1
)(
)(
)(
2.1
Here it is assumed that that all N fitness values are greater than zero, and that larger
fitness values correspond to better individual performance. If smaller fitness values
correspond to better individual performance (e.g. when minimising prediction errors
in data modelling) then the following scaling is often used (e.g. McKay et al., 1996).
)(1
1
)(
k
k
Jf
Jf
+
=′
2.2
This adjusted fitness value can then be used in place of f(Jk) in Equation 2.1.
Blickle and Thiele (1995) point out that there are properties of fitness proportionate
selection that make it undesirable for general use as a selection mechanism in
evolutionary algorithm applications. The main problem is that it is not translation
invariant with respect to the raw fitness values. This means that as the evolutionary
algorithm progresses it is difficult to ascertain, with any certainty, the level of
selection pressure imposed on the population.
There are also problems associated with the use of the inversion function of
Equation 2.2 when the goal of the search is lower fitness values. In particular, when
the f(Jk) values are small (< 0.1), Equation 2.2 has the effect of compacting the
)( kJf ′ values into the interval [0.9,1]. The problem is then that the selection
probabilities tend to equalise as the algorithm progresses and the raw fitness values
become smaller, reducing the driving force behind the algorithm so that it is very
Evolutionary computation and genetic methods
28
difficult to exploit the better individuals preferentially to the poorer ones.
Appropriate pre-scaling of the raw fitness values can, in principle, be used to remove
this problem but, in general, the use of fitness proportionate selection is fraught with
difficulties and is best avoided.
2.3.2 Ranking selection
The problems associated with the use of fitness proportionate selection can be
overcome by the use of ranking selection mechanisms (Grefenstette and Baker,
1989). Once the N raw fitness values have been calculated for each individual in the
population they are sorted so that the best individual has the rank N and the worst
the rank 1. The rank values can then be used in place of the raw fitness values in
Equation 2.1. This has the effect of imposing a selection pressure over the
population that varies in a linear manner and is independent of the absolute values of
the fitness measurements.
One problem that can occur with this method is that multiple individuals with the
same raw fitness value are ranked differently. The rank assigned to these individuals
is then an artefact of the sorting algorithm used. This could seriously bias the
selection procedure in cases where there are a relatively large number of individuals
in the population with equal fitnesses. The ranking method can, however, be
modified so that individuals with equal fitness values are given the same rank. This
is accomplished by performing the normal linear ranking procedure and then, for the
individuals with equal fitness (or for each group of individuals that exhibit equal
fitnesses), assigning the mean rank of that group to each of the individuals within
that group. For example, in a population of 10 individuals with unique raw fitness
values, the best individual would be assigned rank 10 and the worst, rank 1. If
however, the individuals with ranks 8,7 and 6 actually had equal raw fitness then
these individuals would be assigned the modified rank of
3
678 ++
= 7. This form of
modified ranking is the selection mechanism adopted for the work described in this
thesis.
Evolutionary computation and genetic methods
29
2.3.3 Tournament Selection
An alternative selection method that has proved popular is tournament selection. It is
similar to ranking selection in that it also overcomes the problems associated with
fitness proportionate selection by decoupling the distribution of selection
probabilities from the absolute distribution of raw fitness values. The method is
analogous to that sometimes observed in nature where individuals directly compete
for the right to mate.
Rather than directly sorting all of the individuals in the population according to
fitness, a tournament group of size Nt is formed by randomly selecting individuals
from the population. The tournament group is then ranked according to fitness and
the best individual in the group is then selected. Tournament selection can be
regarded as a probabilistic version of ranking selection (Koza, 1992) and in the case
of Nt = 2 the two techniques are mathematically equivalent (Blickle and Thiele,
1995). Larger tournament sizes increase the selection pressure on the best
individuals in the population, e.g. in the degenerate case of Nt = N the best individual
in the population is always selected, leading to a massive loss of diversity in the next
generation of the evolutionary algorithm. In the EC literature, tournament sizes of 4-
6 are commonly reported. In some applications of evolutionary algorithms, e.g.
algorithms involving a high degree of parallelisation over a number of processing
nodes, tournament selection is preferred over ranking selection because no
centralised sorting procedure is required.
2.4 Notes on evolutionary computational methodology
Back et al. (1997), in a recent review of the status and history of EC, contend that it
can often be useful to view EC as a general framework of related concepts that can
be tailored to the user’s application, rather than a pre-defined collection of
algorithms that can be bolted on to a domain specific problem without consideration
of the issues involved. This is worth bearing in mind when attempting to describe
any particular subgroup of evolutionary algorithms: ultimately the form of the
algorithm and the problem representation that is used are not uniquely defined by the
application. An adaptive, incremental approach to the design is required as well as a
Evolutionary computation and genetic methods
30
willingness to utilise heuristic and qualitative arguments in the construction of the
solution.
Barto (1990) states that whilst traditional engineering methods tend to deal with
quantities and concepts that are of low-dimensionality and natural to the engineer,
connectionist methods (e.g. artificial neural networks) tend to employ “expansive”
representations, in which the representation of the problem is apparently of a higher
dimension that the problem requires. This property is also shared by a number of
representations common in EC, such as genetic programming. The expansionist
representation is an underdetermined one and therefore researchers have a large
amount of freedom in implementing a design. There is, however, the accompanying
burden that there are no clearly defined procedures to an EC design and the
formulation used does not necessarily have the regular mathematical properties, such
as linearity, determinism, stability and convergence that traditional engineering
methods display.
2.5 Genetic algorithms
Genetic algorithms are, perhaps, the best known type of evolutionary algorithm.
They have gained a reputation for being both robust and relatively easy to
implement (Goldberg, 1989). This is borne out by the degree of use that genetic
algorithms have recently seen in a number of diverse research areas: e.g., genetic
algorithms have been used to optimise the design of plastic extruder dies where
gradient based techniques were found too inefficient (Chung and Hwang, 1997).
Moros et al. (1996) used genetic algorithms to generate initial parameter estimates
for kinetic models of a methane dehydrodimerisation process. They found that this
reduced overall computing time and increased the reliability of the model parameter
solutions. Genetic algorithms have also been applied to a number of medical
imaging problems with a good deal of success; e.g. Handels et al. (1999) report on
different methods to recognise malignant melanomas automatically by extracting
features from skin surface profiles. The genetic algorithm method performed best
with a 97.7% successful classification performance on unseen skin profiles.
Evolutionary computation and genetic methods
31
In view of the fact that the focus of this thesis, genetic programming, is seen by
many as an extension of the basic genetic algorithm, fundamentally employing the
same mechanisms but with greater representational flexibility, the following sections
summarise the basic concepts of genetic algorithms and the theories behind their
efficacy.
2.5.1 Background
Although there had been interest in the modelling and simulation of population
genetics around the same time as the general field of evolutionary computation was
founded it was not until John Holland published the landmark text “Adaptation in
Natural and Artificial Systems” (Holland, 1975) that the advantages of using
genetics as a general model for adaptation in non-biological systems became
apparent to a wider audience.
In the most widely used form of the genetic algorithm, the standard binary crossover
genetic algorithm (SGA), each individual within the population consists of a string
of binary digits. This bit string is a discrete combinatorial representation of a
solution to the problem being examined, meaning that the entire search space can be
represented by the (finite) available combinations of the bits.
In the simplest case, the bit string is usually a direct binary encoding of a real valued
parameter. However, other more mechanistic representations are possible, wherein
the order of the bits represents the nature of the interactions in some entity with
modular characteristics, e.g. in genetic algorithm based classifier systems. The
following sections introduce the basic mechanisms of genetic algorithms.
2.5.2 Reproduction operators in genetic algorithms
For evolution to occur in there must be cumulative selection over a number of
generations coupled with the property that small variations in the genotype
sometimes produce improvements in the individual. The selection mechanisms
(fitness proportionate selection, ranking etc.) are largely independent of the
representation of the individual, but the reproduction operators used must be
Evolutionary computation and genetic methods
32
designed appropriately. The reproduction operators most often used in binary bit-
string genetic algorithms; direct reproduction, point mutation and single point
crossover, are inspired by the recombinative processes that enable adaptation in the
natural world. A number of alternative reproduction operators have been proposed
for use with binary genetic algorithms, e.g. multi-point crossover (De Jong, 1975),
but they are generally simple adaptations or hybrids of the basic single point
crossover and mutation methods and, as a rule, have not been adopted by the bulk of
GA practitioners.
In constructing a new population, the reproduction operator to be used is picked
based on the probabilities Pc (probability of crossover), Pm (probability of mutation)
and Pr (probability of direct reproduction) where Pc + Pm + Pr = 1. These are
algorithm control parameters and must be set by the user. (The rate of crossover
tends to dominate the recombination process in most applications, followed by direct
reproduction and mutation being used as “background” operators). The selection
mechanism is then used to select an individual (or two individuals in the case of
crossover) and the appropriate reproduction operation is performed. The parent(s)
are left in the current population and are available for reselection. The offspring are
inserted into the new population.
2.5.2.1 Single point crossover
The single point crossover operator is analogous to the exchange of genetic
information (stored on chromosomes) that occurs during sexual reproduction in
nature. A new individual is created by recombining two complementary fragments
of the parent bit strings, thereby testing new individuals that retain characteristics of
both parents. Because the standard genetic algorithm operates over fixed length
linear vectors, the fragment sizes must be constrained so that their combination
results in an individual of the same length. This is accomplished by randomly
picking a crossover point, and applying it to both parents to create two new
offspring. Figure 2.1 depicts this process.
Evolutionary computation and genetic methods
33
Randomly selected
crossover point
Generation t
Generation t+1
Offspring 1 Offspring 2
1 1 1 0 0 0 0 0 0 11 0 0 0 0 0 1 1 1 1 1 1 0 01 1 1 1
1 1 1 0 0 0 0 0 0 11 1 1 1 0 0 1 1 1 1 1 1 0 01 0 0 0
Figure 2.1 Single point crossover in standard binary genetic algorithms
2.5.2.2 Mutation
The point mutation operator is analogous to the biological random mutations that
infrequently occur on DNA molecules. It is typically applied with much lower
frequencies than the crossover operator. The mutation operator is applied to a single
parent by randomly selecting a bit and then flipping it (see Figure 2.2).
Parent
Offspring
1 0 0 0 0 1 1 1 1 0 1 1 0
1 0 0 0 0 0 1 1 1 0 1 1 0
Randomly selected bit
Figure 2.2 Single point mutation in standard binary genetic algorithms
2.5.2.3 Direct reproduction
Direct reproduction is carried out simply by copying the selected individual into the
next generation with no modification of its bit structure. The purpose of this operator
is to promote the propagation of successful individuals through to future generations
Evolutionary computation and genetic methods
34
in such a way that they are immune to the (possibly) harmful effects of mutation and
crossover events.
An “elitist” selection scheme can also be employed to protect the best individuals in
the current population. The difference is that direct reproduction is applied
probabilistically, so whilst it is extremely likely that the best individuals of the
population will be carried over to the next population, there is no guarantee, so elitist
selection acts as a safeguard. The simplest way to implement it is to copy the top,
say, five per cent of the current population into the new population before
embarking on the ordinary probabilistic selection/reproduction mechanisms. The 5%
elitist method is used in all runs described in this thesis.
2.5.3 Genetic algorithm flowsheet
The overall operation of a standard genetic algorithm can be represented by the
flowsheet in Figure 2.3. The flowsheet shows that the genetic algorithm is
essentially very simple to operate, consisting of straightforward selection and bit
string manipulation mechanisms.
The user must supply the various initialisation parameters (the first block in the
flowsheet), e.g. the population size, the encoding scheme, the termination criterion,
the reproduction operator frequencies etc. The best way to determine these factors is
by referring to existing literature describing a related problem and using the reported
values as default settings. Subsequent experimentation with these settings should
eventually yield satisfactory results, although it is usually impractical to determine
what the optimal settings are for non-trivial problems. The user must also supply a
set of functions that decodes a candidate individual, evaluates it and then returns a
numerical measure of its quality.
Note that the process shown in Figure 2.3 is a simplified version of the genetic
algorithm. It does not include provision for selection method variants such as elitist
selection. Other details of the standard algorithm have also been omitted for the sake
of clarity.
Evolutionary computation and genetic methods
35
Initialise algorithm control
parameters.
Generation index t = 0
Randomly generate N
individuals
Evaluate N individuals
against fitness function f
Stop
Increment generation index
t = t + 1
Set individual index
k = 0
Select variation
operator based on
probabilities Pc, Pm
and Pr
Choose one individual
using probabilistic
selection procedure
k = k + 1
Choose two inviduals
using probabilistic
selection procedure
k = k +2
mutation direct reproduction
crossover
Mutate selected
individual
Reproduce selected
individual
Crossover selected
individuals
Add offspring to new
population
Choose one individual
using probabilistic
selection procedure
k = k + 1
Is termination condition
satisfied?
Yes
No
Is k = N ?
Yes
No
Figure 2.3 Flowsheet of a standard genetic algorithm
Evolutionary computation and genetic methods
36
2.5.4 Genetic algorithms as function optimisers
Figure 2.4 shows an example of a partitioned binary bit string encoding of two real
valued that could be used in a function optimisation scenario. Typically this would
stated as “find the values of the parameters α and β that minimise (or maximise)
some objective function f(α , β ) subject to certain constraints.” In Figure 2.4, Jk is
an individual in a population P(t) = {J1, J2, …, JN} of N individuals at generation t.
It is useful to clarify some of the terminology associated with genetic algorithms.
The bit string in the above example is referred to as a chromosome and the
contiguous bit sections corresponding to each parameter are referred to as genes.
Each gene can take on a number of values, called alleles. The entire string, the
genotype, can be regarded as the prescriptive structure responsible for the expressed
parameter set (Goldberg, 1989). The genotype need not always be completely
defined by the contents of one chromosome; multiple chromosomes can be used to
encode the information in a modular form, allowing restrictions on the interchange
of genetic information to be imposed during recombination.
Genetic algorithms are discrete combinatorial processors but many parameter
optimisation problems are based on continuous real valued parameters. Certain
trade-offs between precision and the size of the coding used must therefore be made.
The number of bits chosen to represent each parameter depends on the range of
admissible parameter values and the degree of precision required. Hence, prior
knowledge of the range in which the optimal values fall (and the desired precision)
is necessary when designing a binary bit string representation. If the function
optimisation requires a high degree of precision over large parameter ranges then the
length of the bit string will become commensurately large, and the size of the space
that the genetic algorithm has to search increases at an exponential rate. In the
example in Figure 2.4 there are 14 bits in total so there are 214
(16384) distinct
combinations of the bits. For such an example, the number of points in the search
space is not huge and it could be successfully searched, providing the fitness
function is not complex, using non-genetic methods in a reasonable amount of time,
e.g. an exhaustive search. However, binary bit string lengths of 200 are quite
Evolutionary computation and genetic methods
37
common in engineering applications (e.g. the optimisation of 20 real valued
parameters simultaneously, each represented to 10 bit precision). The search space
in this case is astronomically large, consisting of 2200
(approximately 1059
) possible
combinations of bits. An exhaustive search would be infeasible in this case: if one
million points could be searched per second then it would still take around three
billion, trillion, trillion, trillion years to complete. The fact that genetic algorithms
can successfully search spaces of this in far shorter times emphasises the fact that
genetic algorithms, although having a number of random elements in their operation,
are not random walks through the search space.
Decoding function
Objective function evaluation
Bit string Jk
1 0 0 0 0 1 1 1 1 0 1 1 0 1
α β
Fitness value f( Jk )
Gene 1 Gene 2
Figure 2.4 Example of bit string in parameter encoding and evaluation
2.5.5 Underlying processes: the schema theorem
How do the partially randomised mechanistic operations involved in genetic
algorithm processing enable a good quality solution, in the form of a particular
sequence of bits, to be obtained from the enormous number of sequences available
in a typical run?
Evolutionary computation and genetic methods
38
The schema theorem (Holland, 1975) is one explanation, although often criticised,
of how genetic algorithms process information and gradually progress towards a
near-optimal solution. The basis of this theorem is that the genetic algorithm
implicitly processes large numbers of candidate solutions in parallel by means of
similarity templates (so called schemata). Schemata are notational devices that allow
structural similarities in groups of solutions to be quantified, and it is thought that
the genetic algorithm implicitly employs this structural similarity information in
approaching a high quality solution structure. A corollary of the schema theorem is
the building block hypothesis. This arises as a direct result of the schema theorem
and implies that, for a GA to work efficiently, short bit string sections representing
relatively successful partial solutions (i.e. the “building blocks”) must be combined
in order to realise a global solution. Intuitively, and from a human perspective, this
makes sense: often solutions to problems are found by applying successful solutions
to related problems or by breaking down the problem into smaller, more manageable
problems and combining the partial solutions obtained.
Although the schema theorem and the building block hypothesis are useful as
visualisation tools in genetic algorithms, there are certain inconsistencies in the
underlying assumptions, and many of the criticisms that have been directed at the
schema theorem suggest that there may be processes at work in genetic algorithms
that have yet to be adequately explained. Thornton (1997) summarises a number of
the problems with the schema theorem and the building block hypothesis.
2.5.6 Towards flexible representation in genetic algorithms
It has long been recognised that the greatest shortcoming of the classical genetic
algorithm is its lack of representational flexibility. The straightforward coding
scheme is sufficient for parameter optimisation problems, but for the more complex
tasks of generalised machine learning it is a severe restriction. In these cases, a
solution that adapts its own structure by progressively improving on previous
structures would be highly desirable.
One attempt to broach this representation problem for learning systems was the
Evolutionary computation and genetic methods
39
learning classifier system (Holland and Reitman, 1978) based on the use of IF-
THEN production rules coded as fixed length binary strings. Another example is the
use of variable length strings, the so-called “messy genetic algorithm” introduced by
Goldberg et al. (1989). Whilst these methods were not unsuccessful, it was still felt
that the utility of genetic algorithms should be combined with higher order variable
length structures, capable of allowing more complex interactions, and amenable to
the learning of general tasks.
2.6 Genetic programming
A number of researchers in the 1980s pursued the application of genetic algorithms
to more complex structures: e.g., Cramer (1985) used a language consisting of loops
and increments on variables to evolve solutions to a simple symbolic regression
problem. The representation he used consisted of integer strings that could be
decoded to form structured programs. Hicklin (1986) and Fujiki and Dickinson
(1987) investigated the use of genetic reproduction operators in generating programs
in a language called LISP (LISt Processing language). LISP is appropriate for the
application of recombinative methods because groups of instructions and data are
represented in a syntactically identical way, allowing parts of programs to be spliced
into other programs in a manner resembling the bit string splicing in binary genetic
algorithms. Most importantly, this is accomplished whilst still maintaining legal
program syntax.
Genetic programming was the logical progression from the work carried out on the
application of genetic algorithms to higher order data structures. John Koza
published a series of papers in the early 1990s, e.g. Koza (1990, 1991), that
culminated in his extensively referenced text: “Genetic programming: on the
programming of computers by means of natural selection” (Koza, 1992). In it, Koza
describes a wide array of problems, from various fields, that he uses genetic
programming to solve: e.g. symbolic regression (evolving a model that best fits a set
of input-output data), robotic planning (i.e. the “artificial ant” problem: the solution
lies in evolving a program that guides an entity around a grid picking up all the
items of ‘food’ in as few manoeuvres as possible), controller design (deriving a
Evolutionary computation and genetic methods
40
computer program that brings a vehicle to rest in minimal time using an “on/off”
control signal.)
Due to the varied applications described, and the relative ease with which the
genetic programming algorithm can be implemented, Koza’s work was much more
accessible to the research community than the existing approaches to machine
learning. These had tended to rely heavily on formal inferences, abstract symbol
processing and impenetrable mathematical theorems and, hence, seemed to be at a
great remove from being able to solve the sorts of problems that people wanted them
to solve. Genetic programming, on the other hand, although initially only applied to
trivial problems, gave impetus to the idea that artificial intelligence could be
engineered from the ground up and set to work on scientific problems. Most of the
work in genetic programming, both theory and application based, stems from the
algorithms described in Koza’s book.
2.6.1 Program induction: parse trees as adaptable data structures
One of the main insights of Koza is that of the “pervasiveness of the problem of
program induction”, i.e. that a very large number of problems can be solved with the
use of a computer program of some description as an answer. Obtaining a suitable
program for a given problem is what most scientists, engineers, economists etc.
spend a great deal of their professional lives trying to accomplish. Generating and
subsequently adapting program code, given a measure of program fitness (however
implicitly defined), is what humans do to solve technical problems.
The idea of using genetic methods to perturb, fragment and splice programs together
to generate better programs is an appealing one. What is less appealing is the
perceived fragility of program code: most people know from experience that
chopping and changing code in an ad hoc manner is unlikely to result in a program
that actually executes without errors, much less give anything approaching the
correct answer. However, the source code that one types in and the internal
representation of code within a computer are vastly different. Most programs are
internally represented as a parse tree; a data structure that represents a hierarchical
Evolutionary computation and genetic methods
41
sequence of instructions in the form of an ordered tree. This representation of a
program as an ordered tree structure strips away most of the clutter associated with
the majority of computer languages. That which remains is the functional backbone
of the program. Hence, the problem of cutting and splicing programs is vastly
simplified. Given a few necessary assumptions and constraints (these will be
described in the coming sections) the tree structure can be modified in an ad hoc
manner yet still maintain internal syntactic consistency. Of course, it is highly
unlikely that any one perturbation will result in a better program but genetic
programming, like all evolutionary techniques, uses the cumulative effect of
artificial selection to amplify the effects of the few modifications that do give
slightly better results.
As an example of a program as a tree structure, consider the following simple piece
of pseudo-code (a callable function named prog1 that accepts two real valued
arguments a and b and returns a real argument c, the value of which depends on
whether a or b is greater).
function [c]=prog1(a,b)
If a <= b then c = a + b
else
c = a - b
end
The same function can be represented as a rooted, ordered tree structure as depicted
in Figure 2.5. The tree consists of two types of node: terminals and functionals.
Terminal nodes are the “leaves” of the tree structure and typically represent items of
program data (program inputs or constants). Functional nodes are the branch points
within the tree; they are operators that are used to process terminal node values (and
results from branches further down the tree). In the case of Figure 2.5 the terminals
are the inputs to the program: the arguments a and b. The functional nodes are the
addition operator, +, the subtraction operator, -, and the ‘IF THEN ELSE’
conditional operator designated by the tag IFLTE. The tree processes information as
Evolutionary computation and genetic methods
42
follows: the data represented by the lowest (leaf) nodes are passed up the tree to the
node immediately above them. At this point, they are operated on by functional
nodes, e.g. the addition operator. Then the results of these calculations are passed up
to the next node and so forth until the root node is reached. The final calculation
ends here and this is usually designated as the overall program output.
The ordering of the branches generally makes a difference to the structure of the
program because of the way that some functional nodes are specified. E.g., the
IFLTE node always has four input arguments, which are processed in the following
way:
if (argument1)≤(argument2)
then return (argument3) as node output
else return (argument 4) as node output
All function nodes used in GP must be explicitly defined in this manner.
IFLTE
Program
Output
b +a -
a b a b
Figure 2.5 Tree structure of function prog1
Although a parse tree diagram gives a clear view of the processing hierarchy of a
program, it is not amenable to direct computer manipulation. A more convenient
notation for the trees used in genetic programming is that of prefix notation
Evolutionary computation and genetic methods
43
(sometimes called polish notation). In this form of notation, which is directly
equivalent to a parse tree representation, functionals are represented by a symbol
followed by the arguments in parenthesis. E.g. the familiar algebraic expression a +
b would be written as +(a b) in prefix notation. Note that the functional
arguments can also be functions themselves: e.g. the expression a - (b + c) would
become –(a +(b c)) in prefix notation. The pseudo-code function prog1
illustrated in Figure 2.5 can be written as: IFLTE(a b +(a b) –(a b)).
(The computer language LISP, originally chosen for genetic programming, uses a
variant of prefix notation but virtually any high level language can be used if an
appropriate interpreter is available. (All of the GP runs in this thesis were performed
using the MATLAB programming language to operate on ASCII coded prefix
expressions).
It can be seen why tree structures are amenable to the problem of automatic program
induction; sub-trees can be swapped from place to place, nodes can be deleted and
replaced with other nodes (or sub-trees) because the syntax that renders a program
executable is inherent in the tree representation.
Details of the genetic operators used and some of the other details needed to set up a
genetic programming experiment are given the following sections.
2.6.2 Reproduction operators in genetic programming
Three principal genetic reproduction operators are defined by Koza (1992) for
genetic programming: direct reproduction, mutation and crossover (although Koza
did not originally advocate the use of mutation, see Section 2.6.2.2). The concepts
behind them are very similar to the operators used for binary bit string genetic
algorithms.
2.6.2.1 GP crossover
Analogous to the method used in binary bit string genetic algorithms, GP crossover
exchanges information between two chromosomes. Unlike binary genetic
algorithms, there is no theoretical restriction on the sizes of the sections of the
Evolutionary computation and genetic methods
44
chromosome being exchanged (considerations such as computer memory and
available processing time, however, mean that the practical implementation of
crossover will have an upper limit on the new tree sizes.) As an example, the
following two simple programs will be shown undergoing GP crossover. In the
context of a GP run it is assumed that these two programs are population members
that have been chosen by means of an appropriate selection mechanism.
Parent 1: Output = 3 + (a – b)
Parent 2: Output = c1b +
In this example the terminals a, b and c are input variables. The other nodes used
are the terminal constants 1 and 3 and the addition, subtraction and square root
functions. Figure 2.6 illustrates Parent 1 and Parent 2 undergoing the
crossover process. The subtrees selected (randomly) for crossover are shaded.
+
3 -
a b
*
b SQRT
+
1 c
+
-
a b
*
b SQRT+
1 c 3
Parent 1 Parent 2
Offspring 1 Offspring 2
Figure 2.6 Example of crossover in genetic programming.
Evolutionary computation and genetic methods
45
The results of this process are the two programs shown below:
Offspring 1: Output = (1+c) + (a-b)
Offspring 2: Output = 3b
In an actual GP run, these two offspring would be inserted into the new population
and subsequently evaluated to determine their fitness. The crossover mechanism is
easily implemented by computer using prefix notation, in which the crossover
subtrees are highlighted in boldface:
Parent 1 (prefix): +(3 –(a b))
Parent 2 (prefix): *(b SQRT(+(1 c)))
Offspring 1 (prefix): +(+(1 c) –(a b))
Offspring 2 (prefix): *(b SQRT(3))
The process illustrated in Figure 2.6 is the most commonly implemented crossover
operator in the GP literature. It has, however, been strongly criticised because,
although superficially similar, it operates in a fundamentally different manner to GA
crossover and, indeed, biological gene crossover. For instance, Francone et al.
(1999) state that biological crossover (and GA crossover) typically exchange genes
that are on the same position on the chromosome and that these genes have
functional similarity. This is not the case for exchanged subgroups (genes) in GP
crossover. For this reason it has been claimed that GP crossover is, in fact, no more
than a “macro-mutation” operator (Angeline, 1997). Koza, however, has responded
to those that have claimed that crossover is unnecessary (Koza, 1999). He presents
several experiments clearly illustrating that GP performs poorly without crossover.
Other statistical studies support this view, e.g. Luke and Spector (1998) and Hiden
Evolutionary computation and genetic methods
46
(1998), show that GP runs generally benefit from the use of the crossover operator.2
2.6.2.2 GP Mutation
In a manner analogous to its GA counterpart, the purpose of the GP mutation
operator is to improve population diversity by generating entirely new chromosome
segments, in order to explore new regions of the search space. Like the GA mutation
operator, it is a form of asexual reproduction based on only one parent.
As an example, consider that the following program has been selected from the
existing GP population based on its fitness:
Parent: Output = (1+c) + (a-b)
Figure 2.7 demonstrates a mutation operation on this program. First, a mutation
node is randomly selected, and then the corresponding subtree (with the mutation
node as its root) is deleted. Finally, a new subtree is randomly generated (in a
manner similar to that employed when generating the initial GP population) and
inserted in the place of the deleted subtree.
For the example given, the offspring program resulting in the mutation operation is
shown below:
New subtree: Output= log10
(1.2c)
Offspring: Output = log10
(1.2c) + c + (a-b)
Again, the computer implementation of this operation is based on prefix notation as
shown below; the deleted and new substructures are once again highlighted in
boldface:
2
It should be noted that, for any search algorithm, there are no optimal neighbourhood search
operators for all problems. This is due to the implications of the No Free Lunch (NFL) theorem
(Wolpert and Macready, 1997) which states that, averaged over all possible search problems; no
search algorithm is better than any other.
Evolutionary computation and genetic methods
47
Parent (prefix): +(+(c 1) –(a b))
New subtree (prefix): log10((1.2 c))
Offspring (prefix): +(+(c log10((1.2 c))) –(a b))
Note that, unlike crossover, the mutation operator introduces a subtree that contains
structures not necessarily present in the parent. In this case, the functional nodes
log10
and  (divide) as well as the terminal constant 1.2 have been used. The
definition of terminal and functional sets, and how they are used in tree generation,
will be discussed further in Section 2.6.4.
+
-
a b
+
1 c
Parent
+
-
a b
+
c
Offspring
log10

1.2 c
Randomly generated
subtree
log10

1.2 c
Figure 2.7 Example of subtree mutation in genetic programming.
Evolutionary computation and genetic methods
48
Originally, Koza maintained that crossover and direct reproduction are the only
reproduction operators that are required to complete a successful GP design (Koza,
1992). He argued that a mutation operator is necessary in the case of GAs because
crossover by itself only recombines bit string sections of the original population that
are associated with high performance, hence mutation is needed to restore the loss of
diversity that accompanies this process. Koza goes on to state that, in for GP, this is
not the case, as the crossover operator combines “genes” in a functionally more
flexible way than GAs and so the equivalent loss of diversity should not occur.
While this seems to be a valid theoretical consideration, the informal consensus
among GP practitioners is that the use of mutation (at relatively low rates of about
10% or less) assists the evolution process. This is borne out by a number of
statistical studies on a variety of simple GP applications (e.g. Luke and Spector,
1998, Hiden, 1998). Hence, all the GP runs described in this thesis use the standard
sub-tree mutation operator.
2.6.3 Specifying a genetic program
Koza (1994) describes six steps necessary to define a genetic program. These are:
1) Terminal set selection: choosing the variables that are needed to solve the
problem.
2) Functional set selection: choosing the functions that are needed to operate
on the variables in order to solve the problem.
3) Fitness function specification: choosing appropriate tests of the evolved
programs.
4) Run control parameter settings: choosing parameter values for the GP run,
e.g. rates of mutation, crossover and direct reproduction.
5) Termination criterion specification: setting a condition to terminate the
run.
Evolutionary computation and genetic methods
49
6) Program architecture specification: deciding how the tree (or a number of
trees in some cases) is decoded into an individual test program for evaluation
and what reproduction operators are used to alter the architecture of the
tree(s).
Some of these steps are often trivial. Determining the termination criterion, for
example, is usually a simple process, e.g. stop after a certain number of generations
have elapsed or stop when a solution of a high enough quality is found. Specifying
an appropriate fitness test suite and appropriate program components (i.e. terminal
and functional nodes), however, can be a difficult task if the designer is to ensure
that the evolved programs can solve the set problem in a desirable manner.
The remainder of the chapter highlights, briefly, a number of issues pertaining to the
use of GP in solving engineering and science problems. For a more general, and
thorough, overview of GP, see Koza (1999).
2.6.4 Terminals and functions: specifying the program components
The specification of a “toolbox” of components, the terminals and functions, which
can be subsequently manipulated by simulated genetic processes into a working
program, is the first step that calls upon the human user’s knowledge of the problem.
The user will have a good idea, at this stage, what they want the evolved programs
to accomplish and will have some degree of knowledge as to what information must
be manipulated to solve the problem.
The terminal set and the functional set must exhibit the joint property of sufficiency
(Koza, 1992), i.e. out of all the trees that can be constructed from them; there must
be at least one that is capable of expressing the actual solution to the problem. In
addition to the sufficiency requirement, it is necessary that the terminal and function
sets in GP exhibit the property of closure (Koza, 1992). This simply means that any
tree generated from these sets must be syntactically valid. Closure is attained by
ensuring that all functions and terminals return values of the same data type (e.g.
Boolean). This is usually straightforward to achieve for many engineering
Evolutionary computation and genetic methods
50
applications: the terminals and constants will generally be of the scalar floating point
type, and the standard arithmetical functions will return a value of the same type as
the input arguments. However, there are a few minor exceptions: the floating point
division operator will not return a floating point value if both input arguments are
zero3
, the square root operator will return a complex value if its input argument is
less than zero. Similarly, the natural logarithm operator will return a complex value
if its input argument is less than zero, or will return a value of “infinity” or
“undefined” (depending on the computing language being used) if its input is zero.
These problems can be sidestepped by taking a few liberties with the definition of
some mathematical functions in order to maintain data type consistency. For
instance, the division operator can be redefined so that it returns a zero when both
arguments are zero and the square root operator can be redefined to return the
positive root of the absolute value of its input argument. In the GP literature, this is
commonly referred to as “protecting” functions; the redefined division operator is
referred to as protected division, the redefined natural logarithm as protected natural
log and so forth.
2.6.5 Handling constants in genetic programming
Although it is often possible to specify what inputs will be required for an evolved
program to solve a problem, it is not generally possible to know in advance what
constants are required. Exceptions to this rule are Boolean problems, modular
problems (e.g. clock arithmetic) and the like.
Engineering and scientific problems, in general, require the use of non-integer real
constant terms in their solutions, but how can one incorporate these in a terminal set
without knowing their exact values beforehand? The standard GP solution to this is
known as the ephemeral random constant (ERC) method (Koza, 1992) and its use,
although often augmented by other methods of determining constant values, is
widespread.
3
Many computing languages will simply halt program execution and return a ‘division by zero’ error
at this point. Matlab returns the value ‘NaN’ (not a number), this effectively renders any parse trees
meaningless as all further operations on ‘NaN’ result in ‘NaN’.
Evolutionary computation and genetic methods
51
The ERC method is actually very simple to implement: a special terminal R is added
to the existing terminal set and used in an identical way to the other terminals when
generating the initial population (i.e. generation 0) of random program trees. At this
point, each occurrence of R in the population can be considered a placeholder for an,
as yet, unknown constant. Then, before the tree is inserted into the initial population,
each instance of R is replaced by a randomly generated real constant from a user-
defined range, e.g. [-1, 1]. Once the initial population has been generated, the values
of the constant nodes are fixed throughout the run; it is only before insertion into
generation 0 that the placeholder R is used. However, many researchers who have
investigated the use of GP for data modelling purposes have asserted that this
method of handling constants is inadequate and not numerically efficient. Other
methods of handling constants have been suggested and these are discussed more
fully in the survey of GP for process modelling purposes in Chapter 3.
2.6.6 Multiple populations
Although the basic GP formulation uses a single population of individuals, it is
possible to distribute the population by employing several sub-populations (called
“demes”) that are evolved in isolation except for the periodic exchange (migration)
of individuals from a deme to one or more other demes. This is done to prevent the
premature convergence, due to lack of diversity, which can occur in the single
population algorithm. This scheme also has the advantage that it is suited to GP
performed over parallel processing units because the fitness evaluations and
selection are performed separately on each processing unit and the only
communications between these units are the periodic transfer of migrating
individuals (Koza, 1995).
2.7 Summary
This chapter has introduced the field of evolutionary computation in an engineering
setting and has focussed primarily on the mechanisms involved in selection, and the
reproduction operators used in genetic algorithms and genetic programming. A
number of issues relating to implementation of genetic programming to solve
engineering problems have also been addressed. A discussion of the use of genetic
Evolutionary computation and genetic methods
52
programming as a tool for data based modelling is presented in Chapter 3.
Chapter 3
Genetic programming as a modelling tool
3 Genetic programming as a modelling tool
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis
SearsonGP_PLS_PhD_thesis

More Related Content

What's hot

CMES201308262603_16563
CMES201308262603_16563CMES201308262603_16563
CMES201308262603_16563
Richard Haney
 

What's hot (12)

Optimised Kd-Tree Approach with Dimension Reduction for Efficient Indexing an...
Optimised Kd-Tree Approach with Dimension Reduction for Efficient Indexing an...Optimised Kd-Tree Approach with Dimension Reduction for Efficient Indexing an...
Optimised Kd-Tree Approach with Dimension Reduction for Efficient Indexing an...
 
IRJET-Performance Enhancement in Machine Learning System using Hybrid Bee Col...
IRJET-Performance Enhancement in Machine Learning System using Hybrid Bee Col...IRJET-Performance Enhancement in Machine Learning System using Hybrid Bee Col...
IRJET-Performance Enhancement in Machine Learning System using Hybrid Bee Col...
 
A Modified GA-based Workflow Scheduling for Cloud Computing Environment
A Modified GA-based Workflow Scheduling for Cloud Computing EnvironmentA Modified GA-based Workflow Scheduling for Cloud Computing Environment
A Modified GA-based Workflow Scheduling for Cloud Computing Environment
 
Distributed Feature Selection for Efficient Economic Big Data Analysis
Distributed Feature Selection for Efficient Economic Big Data AnalysisDistributed Feature Selection for Efficient Economic Big Data Analysis
Distributed Feature Selection for Efficient Economic Big Data Analysis
 
IRJET - Employee Performance Prediction System using Data Mining
IRJET - Employee Performance Prediction System using Data MiningIRJET - Employee Performance Prediction System using Data Mining
IRJET - Employee Performance Prediction System using Data Mining
 
High Dimensionality Structures Selection for Efficient Economic Big data usin...
High Dimensionality Structures Selection for Efficient Economic Big data usin...High Dimensionality Structures Selection for Efficient Economic Big data usin...
High Dimensionality Structures Selection for Efficient Economic Big data usin...
 
SURVEY ON SCHEDULING AND ALLOCATION IN HIGH LEVEL SYNTHESIS
SURVEY ON SCHEDULING AND ALLOCATION IN HIGH LEVEL SYNTHESISSURVEY ON SCHEDULING AND ALLOCATION IN HIGH LEVEL SYNTHESIS
SURVEY ON SCHEDULING AND ALLOCATION IN HIGH LEVEL SYNTHESIS
 
IMQA Paper
IMQA PaperIMQA Paper
IMQA Paper
 
Estimation of Optimized Energy and Latency Constraint for Task Allocation in ...
Estimation of Optimized Energy and Latency Constraint for Task Allocation in ...Estimation of Optimized Energy and Latency Constraint for Task Allocation in ...
Estimation of Optimized Energy and Latency Constraint for Task Allocation in ...
 
F017423643
F017423643F017423643
F017423643
 
CMES201308262603_16563
CMES201308262603_16563CMES201308262603_16563
CMES201308262603_16563
 
Decision tree clustering a columnstores tuple reconstruction
Decision tree clustering  a columnstores tuple reconstructionDecision tree clustering  a columnstores tuple reconstruction
Decision tree clustering a columnstores tuple reconstruction
 

Similar to SearsonGP_PLS_PhD_thesis

[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...
[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...
[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...
Evaldas Taroza
 
Literature-survey
Literature-surveyLiterature-survey
Literature-survey
Deepu Rajan
 
Query-drift prevention for robust query expansion
Query-drift prevention for robust query expansionQuery-drift prevention for robust query expansion
Query-drift prevention for robust query expansion
Liron Zighelnic
 
Parallel and Distributed Algorithms for Large Text Datasets Analysis
Parallel and Distributed Algorithms for Large Text Datasets AnalysisParallel and Distributed Algorithms for Large Text Datasets Analysis
Parallel and Distributed Algorithms for Large Text Datasets Analysis
Illia Ovchynnikov
 

Similar to SearsonGP_PLS_PhD_thesis (20)

[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...
[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...
[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...
 
Literature-survey
Literature-surveyLiterature-survey
Literature-survey
 
My thesis
My thesisMy thesis
My thesis
 
Heuristic based query optimisation for rsp(rdf stream processing) engines
Heuristic based query optimisation for rsp(rdf stream processing) enginesHeuristic based query optimisation for rsp(rdf stream processing) engines
Heuristic based query optimisation for rsp(rdf stream processing) engines
 
Ce25481484
Ce25481484Ce25481484
Ce25481484
 
ilp
ilpilp
ilp
 
Query-drift prevention for robust query expansion
Query-drift prevention for robust query expansionQuery-drift prevention for robust query expansion
Query-drift prevention for robust query expansion
 
IRJET- Machine Learning Techniques for Code Optimization
IRJET-  	  Machine Learning Techniques for Code OptimizationIRJET-  	  Machine Learning Techniques for Code Optimization
IRJET- Machine Learning Techniques for Code Optimization
 
main
mainmain
main
 
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
 
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
 
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...
 
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
 
Proposal with sdlc
Proposal with sdlcProposal with sdlc
Proposal with sdlc
 
Gildenblatt, Robbie 12-17-12
Gildenblatt, Robbie 12-17-12Gildenblatt, Robbie 12-17-12
Gildenblatt, Robbie 12-17-12
 
Machine_translation_for_low_resource_Indian_Languages_thesis_report
Machine_translation_for_low_resource_Indian_Languages_thesis_reportMachine_translation_for_low_resource_Indian_Languages_thesis_report
Machine_translation_for_low_resource_Indian_Languages_thesis_report
 
thesis
thesisthesis
thesis
 
Hadoop as an extension of DW
Hadoop as an extension of DWHadoop as an extension of DW
Hadoop as an extension of DW
 
Fast and Scalable Semi Supervised Adaptation For Video Action Recognition
Fast and Scalable Semi Supervised Adaptation For Video Action RecognitionFast and Scalable Semi Supervised Adaptation For Video Action Recognition
Fast and Scalable Semi Supervised Adaptation For Video Action Recognition
 
Parallel and Distributed Algorithms for Large Text Datasets Analysis
Parallel and Distributed Algorithms for Large Text Datasets AnalysisParallel and Distributed Algorithms for Large Text Datasets Analysis
Parallel and Distributed Algorithms for Large Text Datasets Analysis
 

SearsonGP_PLS_PhD_thesis

  • 1. Non-linear PLS using Genetic Programming Dominic Searson School of Chemical Engineering and Advanced Materials University of Newcastle upon Tyne An online version of a thesis submitted to the Faculty of Engineering, University of Newcastle upon Tyne in 2002, in partial fulfilment of the requirements for the degree of Doctor of Philosophy. © Dominic Searson 2002-2005
  • 2. 2 Preface to online version This version is virtually identical to the original, but the formatting is wrong so it may look a bit odd in places. I blame Word 2003 for this. Also, the page numbers are different to the original. Abstract The economic and safe operation of modern industrial process plants usually requires that accurate models of the processes are available. Unfortunately, detailed mathematical models of industrial process systems are often time consuming and expensive to develop. Consequently, the use of data based models is often the only practical alternative. The need for effective methods to build accurate data based models with a minimum of specialist knowledge has given impetus to the research of automatic model development methods. One method, genetic programming (GP), which is an evolutionary computational technique for automatically learning how to solve problems, has previously been identified as a candidate for automatic non- linear model development. GP has also been combined with a multivariate statistical regression method called PLS (partial least squares) in order to improve its performance (GP-PLS). One version of this method, called GP_NPLS2, was found to give good performance but at a computational expense deemed too high for use as a modelling tool. In this thesis, the GP-PLS framework is developed further. A novel architecture, called team based GP-PLS, is proposed. This method evolves teams of co-operating sub-models in parallel in an attempt to improve modelling performance without incurring significant additional computational expense. The performance of the team based method is compared with the original formulations of GP-PLS on steady state data sets from three synthetic test systems. Subsequently, a number of other modifications are made to the GP-PLS algorithms. These include the use of a multiple gene sub-model representation and a novel training method used to improve the ability of the evolved models to generalise to unseen data. Finally, an extended team method that encodes certain PLS parameters (the input projection weights) as binary team members is presented. The extended team method allows the optimisation of the sub-models and the projection weights simultaneously without recourse to computationally expensive iterative methods.
  • 3. 3 Table of Contents Nomenclature ............................................................................................................ 7 1 Introduction...................................................................................................... 11 1.1 Background.................................................................................................. 12 1.2 Data based modelling .................................................................................. 13 1.2.1 Linear regression methods .................................................................... 13 1.2.2 Automatic model development............................................................. 13 1.2.3 Combining GP with PLS (GP-PLS) ..................................................... 14 1.3 Thesis aims and outline ............................................................................... 15 2 Evolutionary computation and genetic methods........................................... 18 2.1 Introduction ................................................................................................. 19 2.1.1 What is evolutionary computation? ...................................................... 19 2.2 EC algorithms- basic structure and functionality ........................................ 20 2.2.1 Historical perspective............................................................................ 21 2.2.2 Context of EC in engineering search and optimisation ........................ 22 2.2.3 Evolvability and the choice of representation....................................... 24 2.3 Selection mechanisms.................................................................................. 25 2.3.1 Fitness proportionate selection ............................................................. 26 2.3.2 Ranking selection.................................................................................. 28 2.3.3 Tournament Selection........................................................................... 29 2.4 Notes on evolutionary computational methodology.................................... 29 2.5 Genetic algorithms....................................................................................... 30 2.5.1 Background........................................................................................... 31 2.5.2 Reproduction operators in genetic algorithms ...................................... 31 2.5.3 Genetic algorithm flowsheet................................................................. 34 2.5.4 Genetic algorithms as function optimisers............................................ 36 2.5.5 Underlying processes: the schema theorem.......................................... 37 2.5.6 Towards flexible representation in genetic algorithms......................... 38 2.6 Genetic programming.................................................................................. 39 2.6.1 Program induction: parse trees as adaptable data structures................. 40 2.6.2 Reproduction operators in genetic programming ................................. 43 2.6.3 Specifying a genetic program ............................................................... 48 2.6.4 Terminals and functions: specifying the program components ............ 49 2.6.5 Handling constants in genetic programming ........................................ 50 2.6.6 Multiple populations............................................................................. 51 2.7 Summary...................................................................................................... 51 3 Genetic programming as a modelling tool..................................................... 53 3.1 Introduction ................................................................................................. 54 3.2 Representational approaches ....................................................................... 55 3.2.1 Steady state modelling.......................................................................... 55 3.2.2 Dynamic modelling............................................................................... 57 3.2.3 Process knowledge................................................................................ 59 3.2.4 Multigene representations..................................................................... 60
  • 4. 4 3.3 Improving the numerical properties of GP.................................................. 60 3.3.1 ERC optimisation methods ................................................................... 60 3.3.2 Code growth and parsimony pressure................................................... 63 3.3.3 Multi-objective fitness functions .......................................................... 65 3.4 Summary...................................................................................................... 66 4 Non-linear PLS using genetic programming ................................................. 67 4.1 Introduction ................................................................................................. 68 4.2 Linear regression methods........................................................................... 69 4.2.1 Multiple linear regression ..................................................................... 69 4.2.2 Principal component regression (PCR)................................................. 71 4.2.3 Projection to latent structures (PLS)..................................................... 75 4.3 The NIPALS algorithm ............................................................................... 81 4.3.1 Single output linear PLS....................................................................... 83 4.3.2 Determining the number of latent variables in linear PLS ................... 83 4.4 Non-linear PLS............................................................................................ 84 4.4.1 Non-linear polynomial PLS methods.................................................... 84 4.4.2 Non-linear artificial neural network PLS methods ............................... 88 4.4.3 Projection pursuit regression................................................................. 90 4.4.4 Overfitting in non-linear modelling...................................................... 91 4.5 GP-PLS........................................................................................................ 92 4.5.1 Introduction........................................................................................... 92 4.5.2 SGP-PLS configuration ........................................................................ 93 4.5.3 Team based GP-PLS............................................................................. 97 4.5.4 TGP-PLSa configuration .................................................................... 100 4.5.5 Determining the number of latent variables in GP-PLSa algorithms . 101 4.6 Summary.................................................................................................... 102 5 Comparing team and sequential GP-PLS.................................................... 106 5.1 Introduction ............................................................................................... 107 5.2 Test System 1: simulated cooking extruder............................................... 107 5.2.1 Experimental details............................................................................ 109 5.2.2 Interpretation of experimental data..................................................... 110 5.2.3 Qualitative comparisons between SGP-PLSa and TGP-PLSa models114 5.3 Test system 2: the non-linear Cherkassky function................................... 117 5.3.1 Interpretation of experimental data..................................................... 119 5.4 Test system 3: simulated pH process......................................................... 124 5.4.1 Interpretation of experimental data..................................................... 125 5.5 Use of validation data to avoid overfitting................................................ 129 5.5.1 Split-sample validation ....................................................................... 129 5.5.2 Retrospective early-stopping (RES) ................................................... 131 5.5.3 Further analysis of a GP-PLSa model................................................. 134 5.6 Analysis of computational costs................................................................ 139 5.7 Summary.................................................................................................... 140 6 Multigene GP-PLS ......................................................................................... 142
  • 5. 5 6.1 Introduction ............................................................................................... 143 6.1.1 The multigene concept........................................................................ 143 6.1.2 Multigene GP for system identification.............................................. 144 6.1.3 Multigene GP algorithm modifications .............................................. 146 6.2 Multigene GP-PLS .................................................................................... 149 6.2.1 Multigene SGP-PLS............................................................................ 150 6.2.2 Multigene TGP-PLS ........................................................................... 151 6.3 Comparison of single gene GP-PLSa with multigene GP-PLSa............... 153 6.3.1 Test system 1: simulated cooking extruder data................................. 153 6.3.2 Test system 2: Cherkassky function data............................................ 157 6.3.3 Test system 3: simulated pH process.................................................. 161 6.4 Analysis of computational costs................................................................ 164 6.5 Discussion.................................................................................................. 166 6.6 Summary.................................................................................................... 166 7 Dynamic data partitioning ............................................................................ 167 7.1 Introduction ............................................................................................... 168 7.1.1 SGP-PLS implementation of DDP...................................................... 170 7.1.2 TGP-PLS implementation of DDP ..................................................... 172 7.2 Experimental comparison of GP-PLS algorithms with and without DDP 174 7.2.1 Test system 1: simulated cooking extruder data................................. 174 7.2.2 Test system 2: Cherkassky function data............................................ 180 7.2.3 Test system 3: simulated pH process data .......................................... 185 7.3 Analysis of computational costs................................................................ 190 7.4 Discussion.................................................................................................. 192 7.5 Summary.................................................................................................... 192 8 Extended teams............................................................................................... 194 8.1 Introduction ............................................................................................... 195 8.1.1 Extended binary teams........................................................................ 196 8.1.2 Projection weight encoding................................................................. 197 8.1.3 Extended team algorithm details......................................................... 200 8.2 Experimental comparison of team based GP-PLS algorithms with and without projection weight encoding .................................................................... 201 8.2.1 Test system 1: simulated cooking extruder data................................. 201 8.2.2 Test system 2: Cherkassky function data............................................ 206 8.2.3 Test system 3: simulated pH process data .......................................... 210 8.3 Discussion.................................................................................................. 215 8.3.1 Effects of projection weight encoding ................................................ 215 8.4 Analysis of computational costs................................................................ 217 8.5 Summary.................................................................................................... 219 9 Conclusions and further work ...................................................................... 220 9.1 Conclusions ............................................................................................... 221 9.1.1 Team based GP-PLS........................................................................... 221 9.1.2 The multigene approach to SGP-PLSa and TGP-PLSa...................... 222
  • 6. 6 9.1.3 Dynamic data partitioning................................................................... 224 9.1.4 Extended teams ................................................................................... 225 9.2 Further work .............................................................................................. 226 9.2.1 Highly multivariate systems ............................................................... 226 9.2.2 Multiple output variables .................................................................... 227 9.2.3 Final comments................................................................................... 229 References.............................................................................................................. 230 Acknowledgements ............................................................................................... 242
  • 7. Nomenclature Nomenclature Acronyms The following shorthand expressions are used to designate the various regression methods and algorithms considered in this thesis. Where appropriate, the Chapters containing experimental work with the algorithm are also cited. CR Continuum regression. EBNNPLS Error based neural network PLS. EBQPLS Error based quadratic PLS. EBRBFPLS Error based radial basis function PLS. GP_NPLS1 The version of SGP-PLS implemented by Hiden (1998). GP_NPLS2 The version of SGP-PLS implemented by Hiden (1998), which incorporates optimisation of the input projection weights. GP-PLS An umbrella term referring to GP based PLS methods. INNPLS Integrated neural network based PLS. LQPLS Linear quadratic PLS. MLR Multiple linear regression. MSGP-PLSa The default multiple gene implementation of SGP-PLS (Chapter 6). MSGP-PLSb A multiple gene implementation of SGP-PLS that uses dynamic data partitioning (Chapter 7). MTGP-PLSa The default multiple gene implementation of TGP-PLS (Chapter 6). MTGP-PLSb A multiple gene implementation of TGP-PLS that uses dynamic data partitioning (Chapter 7). MTGP-PLSc A multiple gene implementation of extended TGP-PLS that uses dynamic data partitioning and a standard reflected binary Gray coding of the input projection weights (Chapter 8). NNPLS Neural Net PLS. PCA Principal component analysis. PCR Principal component regression. PLS Partial least squares/ Projection to latent structures. PLS1 Linear PLS with a single output variable. PLS2 Linear PLS with multiple output variables. PPR Projection pursuit regression QPLS Quadratic PLS. RBFPLS Radial basis function network PLS. SGP-PLS Umbrella term for sequential GP-PLS algorithms. SGP-PLSa The default single gene implementation of SGP-PLS used in this thesis (Chapter 5). SGP-PLSb A single gene implementation of SGP-PLS that uses dynamic data partitioning (Chapter 7). SPLINE-PLS A non-linear PLS algorithm that uses piecewise polynomials. TGP-PLS Umbrella term for team based GP-PLS algorithms. TGP-PLSa The default single gene implementation of TGP-PLS used in this thesis (Chapter 5).
  • 8. 8 TGP-PLSb A single gene implementation of TGP-PLS that uses dynamic data partitioning (Chapter 7). TGP-PLSc A single gene implementation of extended TGP-PLS that uses dynamic data partitioning and a standard reflected binary Gray coding of the input projection weights (Chapter 8). Symbols α A function parameter. β A function parameter. jϕ The jth regressor vector in a basis function expression. A A subset of the training data. bi The ith linear model parameter in MLR. Also, the univariate regression model coefficient for the ith latent variable stage in PLS. {b0,i,k , b1,i,k} Model coefficients for the kth GP individual and the ith inner model in single gene GP-PLS. B A subset of the training data. B An (m × p) matrix of model coefficients. Bq An (q × p) reduced matrix of PCR model coefficients. {c0,i, …, c2,i} The quadratic coefficients for the ith inner model in QPLS. corr(x,y) The correlation coefficient of x and y. cov(x,y) The covariance of x and y. dj The jth basis function coefficient. e An (n × 1) error vector. E An (n × p) error matrix. f The number of basis functions. )( kJf The raw fitness of the kth individual in a population. )(' kJf The adjusted fitness of the kth individual in a population. gj The jth basis function. G The number of generations in a GP run. Gj,k The jth gene in the kth multiple gene expression. Gj,i,k In MSGP-PLS, the jth gene in the kth individual at the ith latent variable stage. In MTGP-PLS, the jth gene in the ith team member of the kth team. K Process gain. Jk The kth individual in a population. Ji,k In SGP-PLSa, the kth GP individual at the ith latent variable stage. In TGP-PLSa, the ith team member in the kth team. m The number of input (independent) variables. n The number of measurements in the training set. nA The number of measurements in training subset A. nB The number of measurements in training subset B. N The number of individuals in a GP population. k gN The number of genes in the kth multiple gene individual. Ngm The maximum number of allowed genes in a multiple gene individual.
  • 9. 9 Nlv,k The number of members in the kth team that are used to generate the overall model. Np The number of symbolic members in a team. Nt Tournament size. Nw The number of bits used to encode a projection weight in TGP- PLSc and MTGP-PLSc (extended binary teams). p The number of output (response) variables. pi An (m × 1) vector of input loadings at the ith latent variable stage. pi,k In TGP-PLS, an (m × 1) vector of input loadings at the ith latent variable stage as calculated in the kth team evaluation. Pc Probability of crossover. Phigh Probability of high level crossover. Pi The ith GP population. Pm Probability of mutation. Pr Probability of direct reproduction. q The number of principal components retained in PCA. qi A (p × 1) vector of PLS output loadings for the ith latent variable stage. r The radius of a circle. rms(.) The root mean square value of (.). R The set of ephemeral random constants. s The Laplace operator. Si The ith subpopulation. t The generation index. ti An (n × 1) vector of input scores for the ith latent variable stage. ti(A) An (nA × 1) vector of input scores over the subset A for the ith latent variable stage. T An (n × m) matrix of PCA scores. Tk The kth GP team in a population. Tq An (n × q) reduced matrix of PCA scores. ui An (n × 1) vector of output scores for the ith latent variable stage. ui(A) An (nA × 1) vector of output scores over the subset A. ui,k In TGP-PLS, the (n × 1) vector of the ith output scores vector as calculated in the kth team evaluation. ki, ˆu In SGP-PLS, the (n × 1) vector of the prediction of the ith output scores vector by the kth GP individual. var(x) The variance of the elements in the vector x. V An (n × m) matrix of PCA loadings. Vq An (n × q) reduced matrix of PCA loadings. wi An (m × 1) vector of input projection weights at the ith latent variable stage. wi,k In TGP-PLS, an (m × 1) vector of input projection weights at the ith latent variable stage as calculated in the kth team evaluation. Wi,k The ith binary team member in the kth team in TGP-PLSc and MTGP-PLSc. x An input (predictor) variable. xj An (n × 1) vector of scaled measurements on the jth input variable. xt The value of the input variable at time t.
  • 10. 10 X An (n × m) input data matrix. The jth column contains scaled measurements of the jth input variable. Xˆ An (n × m) matrix of estimated values for X. X(A) An (nA × m) matrix of input values over the subset A. Xi The (n × m) deflated input matrix at latent variable stage i. y An output (response) variable. yt The value of the output variable at time t. y A (n × 1) vector of scaled measurements on a single output variable. y(A) An (nA × 1) vector of output values over the subset A. yj A (n × 1) vector of scaled measurements on the jth output variable. Y An (n × p) output data matrix. The jth column contains scaled measurements of the jth output variable. Yi The (n × p) deflated output matrix at latent variable stage i.
  • 12. Introduction 12 1.1 Background Industrial plants can be operated most safely and economically when the engineer has detailed fundamental knowledge of how the component processes work. Ideally, every physical aspect of the each process is understood in detail, allowing it to be efficiently controlled so that it performs in optimal conditions. In the real world, however, this is rarely the case. Modern industrial processes tend be highly complex, involving physical and chemical interactions that are often poorly understood at the quantitative level. Sometimes the control of these processes is based purely on rules gleaned from experience of operating the plant. This lack of fundamental process knowledge precludes the use of detailed mathematical models, which are, in any event, usually too time consuming and expensive to develop. On the other hand, whilst fundamental knowledge of process behaviour is difficult to obtain, process data is not. Plant instrumentation and computer systems routinely collect and store data on hundreds of variables such as flow rates, pressures, temperatures as well as measures of product quality. This data affords the possibility of constructing entirely empirically determined models of how the process behaves (models of this type are usually called “black box” models and allow the outputs of the process, e.g. product quality variables, to be predicted using the inputs of the plant, e.g. reactant flowrates, chemical composition etc). The drawbacks of black box modelling are that the models do not usually have any physical interpretation and they cannot safely extrapolate beyond the range of the data that were used to train them. Process data can also be used in conjunction with existing process knowledge, in this case only certain relationships are determined from the data and these are combined with a physically derived mathematical model. This approach is often called “grey box” modelling. However, in the absence of formal physical and chemical equations, methods that rely solely on plant data to quickly develop cost-effective and accurate models are needed. The concept of automatic model development: i.e. methods that allow good
  • 13. Introduction 13 data based models to be built with minimum expert knowledge, is of particular interest. 1.2 Data based modelling 1.2.1 Linear regression methods Methods that assume a linear relationship between the process inputs and outputs are the traditional tool of the engineer because they are fairly simple in structure and can sometimes result in models that have some physical interpretation. These are usually in the form of regression models, such as multiple linear regression (MLR) and, more recently, principal component regression (PCR) and partial least squares regression (PLS; Wold, 1975)1 . The latter two methods are developments of MLR that have certain properties that make them useful for finding linear relationships between output variables and large numbers of highly correlated input variables. PLS works by projecting the input and output data onto low dimensional subspaces and then fitting univariate regression models between the projections. It has proved to be an effective way to develop multivariate process models from noisy, high dimensional and correlated data (e.g. see Wise and Gallagher, 1996). A drawback with the PLS regression method is that it assumes a linear relationship between inputs and outputs, whereas the behaviour of industrial processes is frequently non- linear. There are ways of extending the PLS framework to capture non-linear relationships, however, and these are described in Chapter 4. 1.2.2 Automatic model development Methods to develop models that can effectively capture non-linear relationships in process data with a minimum of expert knowledge have been extensively researched. The most well known of these techniques is the artificial neural network (ANN). This method has had a good degree of success in the process industries (e.g. see Willis et al., 1992, Lennox, 1996). There are certain disadvantages to using ANNs, however, not least of which is that the user must make informed decisions 1 PLS is also known as “projection to latent structures” for reasons that will become apparent in Chapter 4.
  • 14. Introduction 14 about the size, topology and training method as well as the selection of appropriate network inputs (McKay et al., 1997). Of course, it is arguable that a meta-level optimisation layer for performing these choices can be implemented (although these may be computationally costly) or established rules of thumb may be used to determine network topologies, but Sarle (1997) points out that many of these rules are ‘nonsense’ and states that “in most situations, there is no way to determine the best number of hidden units without training several networks and estimating the generalisation error of each”. Another relatively recent technique that shows promise in the automatic development of process models is that of genetic programming (GP; Koza, 1992). GP is an evolutionary search method and it was originally designed as way of automatically learning how to solve problems by applying artificial selection and reproduction to populations of solutions that are encoded as variable length tree structures. GP was identified as a good candidate for automatic model development because it appeared to be able to automatically select the appropriate input variables as well as discover the model structure and parameters simultaneously (this process is known as symbolic regression). Whilst this is true to a certain extent, some research at the University of Newcastle has shown that the standard form of GP (i.e. the original form of GP using a single population of trees constructed from arithmetic and simple non-linear functions with no use of advanced architectures or representations) does not generally perform any better than feedforward artificial neural networks with sigmoidal activation functions (Hiden, 1998). It is also debatable whether the GP approach is any more ‘automatic’ than the ANN approach since a number of user decisions must be made with regards to population size, architecture, selection method, choice of primitives etc. However, it remains to be seen if GP is, in practice, more amenable to automatic model development. 1.2.3 Combining GP with PLS (GP-PLS) In an attempt to improve the ability of GP to model steady state non-linear multivariate systems, a hybrid of GP and the PLS modelling method was proposed
  • 15. Introduction 15 (GP_NPLS1; Hiden, 1998, Hiden et al., 1998). This method, in common with other non-linear PLS methods, sequentially supplies a series of non-linear univariate models to fit the relationships between the data projections. The GP_NPLS1 method was found to increase the accuracy of the evolved models, in terms of the prediction errors on unseen data, but was not found to give any better performance than an equivalent neural network based PLS method. A variant on this method, GP_NPLS2, was also proposed. GP_NPLS2 retains the same basic architecture as GP_NPLS1 but it incorporates an iterative non-linear least squares routine that optimises the data projection directions. GP_NPLS2 gave better results than GP_NPLS1 and outperformed the equivalent neural network PLS approach (i.e. neural network PLS with optimised projection directions). Despite this improvement, GP_NPLS2 was deemed unacceptable for use as a modelling tool due to the extremely high computational cost requirements of the external optimiser. However, it was felt that the GP-PLS concept had unexplored potential and there might be ways of modifying it so that better results could be attained without resorting to the use of “brute force” optimisation methods, which are both time consuming and aesthetically unappealing from an engineering standpoint. This lead to the starting point of the work described in this thesis, the outline of which is detailed in the next section. 1.3 Thesis aims and outline The primary aim of this work is to demonstrate that the GP-PLS method has potential as a viable process systems modelling tool by improving its performance without recourse to methods that would greatly increase the computation load, such as iterative optimisation routines. This is accomplished by testing some novel GP- PLS architectures and evaluating training and validation methods on some simple steady state test systems. This thesis begins, however, by introducing the field of evolutionary computation (focussing particularly on genetic algorithms and genetic programming) before reviewing the use of genetic programming as a modelling (system identification) tool with industrial process applications. A brief abstract of each of the following Chapters is provided below:
  • 16. Introduction 16 Chapter 2: Evolutionary computation and genetic methods The underlying principles of evolutionary computation are briefly introduced as well as a more detailed discussion of the mechanisms of genetic algorithms and genetic programming. Chapter 3: Genetic programming as a modelling tool A review of the use of GP as modelling tool is provided, with an emphasis on applications for process systems modelling. Representational and numerical issues relevant to the model building capabilities of GP are discussed. Chapter 4: Non-linear PLS using genetic programming The mechanisms underlying linear PLS regression and related approaches are discussed as well as non-linear extensions to the PLS framework, such as neural net based PLS and the sequential GP-PLS algorithms proposed by Hiden (1998). A novel GP-PLS architecture, team based GP-PLS, is proposed. This evolves a population of co-operating teams of models to solve the modelling task in parallel, unlike the original GP-PLS method in which the models are supplied sequentially by consecutive GP runs. Chapter 5: Comparing team and sequential GP-PLS A comparative study of the team based GP-PLS architecture with the sequential architecture proposed by Hiden (1998) is described. This study is based on the use of data obtained from three synthetic test systems: a simulated cooking extruder, a non-linear mathematical function and a simulated pH process. Some properties of the evolved models are described and an attempt is made to improve the generalisation performance of the evolved models by use of a split-sample validation method. Chapter 6: Multigene GP-PLS The combination of the “multigene” GP method (Hinchliffe et al., 1996) with both team based and sequential GP-PLS methods is described. This method decomposes GP models into modular substructures in order to improve their evolvability
  • 17. Introduction 17 properties. A comparative study of the multigene GP-PLS algorithms and the single gene algorithms on the three synthetic test systems is described. Chapter 7: Dynamic data partitioning A novel method for GP-PLS training is proposed and combined with the team and sequential algorithms. This method, provisionally called dynamic data partitioning, is intended to improve the generalisation of the evolved models by reducing model overfitting. A comparative study of the various GP-PLS methods with and without the dynamic data partitioning method on the three synthetic test systems is described. Chapter 8: Extended teams A novel team based GP-PLS architecture whereby the data projection directions are encoded as binary team members and evolved in parallel with the GP-PLS models is proposed. A comparative study (on the three test systems) of the extended team algorithm with the algorithms developed in Chapter 7 is described and some properties of the evolved models are discussed. Chapter 9: Conclusions and further work A number of comments and conclusions on the development of the GP-PLS framework in this thesis are offered. Suggestions for further work in the area are provided.
  • 18. Chapter 2 Evolutionary computation and genetic methods 2 Evolutionary computation and genetic methods
  • 19. Evolutionary computation and genetic methods 19 2.1 Introduction This thesis is concerned with the application of genetic search to the projection based regression method of partial least squares (PLS). Genetic search methods are members of a closely related family of procedures called evolutionary computation (EC). The purpose of this chapter is to introduce EC, and subsequently genetic algorithms and genetic programming by exposition of the underlying principles of simulated evolution. Concepts that are of importance in EC (and in the work that is discussed in the following chapters) such as evolvability and representational issues are outlined. The chapter concludes with a discussion of various features of genetic programming as a prelude to the use of GP in a system identification framework. 2.1.1 What is evolutionary computation? Evolutionary computational methods are a class of iterative learning algorithms that imitate the natural processes of biological evolution in order to solve science and engineering problems. EC methods utilise a set of concepts and arguments that are essentially identical to those that underpin the modern theoretical framework of evolutionary biology. This framework is known as neo-Darwinism as it builds on the ideas of “survival of the fittest” and cumulative selection first proposed coherently by Darwin (1859). If one combines Darwin’s ideas with the notion of differences in genetic encoding (the genotype) mapping to differences in physical attributes (the phenotype) then evolution can be regarded as a statistical process operating on complex data structures. It is then possible to view evolution as an open-ended optimisation process that can be formalised, modelled and exploited. The basic requirements for evolution to occur in a biological context are: (e.g. Darwin, 1859) • There is a finite population of individuals. • The individuals can reproduce and pass on their traits to their offspring. • There should be variety of traits within the population of individuals. • The traits of the individuals should be related to their ability to survive (i.e.
  • 20. Evolutionary computation and genetic methods 20 the variety of the individuals should enable them to compete for the right to be selected for reproduction). In addition to these points there should be added the explicit requirement of encoding of the traits: • The salient characteristics of the individual (i.e. those characteristics which impart upon the individual the ability to reproduce in its environment) should be partly transmissible to its offspring via some sort of encoding system. Furthermore, the transmission of the traits should not be error free otherwise no new traits can develop. It is now almost universally accepted that, in nature, it is predominantly the DNA code that determines the variability of individuals. Thus, it is the complex interactions between the genetic information and the processes of selection in a finite population that give rise to the evolutionary driving force. In nature this evolutionary pressure is not directed towards a particular goal, it is open ended, whereas in simulated evolutionary processes (for the most part) the evolution is directed towards solving a specific problem. 2.2 EC algorithms- basic structure and functionality EC algorithms work by iteratively processing a population of individuals, each of which forms a candidate solution to some problem. At the beginning of the EC algorithm, it would not be expected that any individual would constitute a good solution, since the initial population is randomly generated. The population is then forced (by means of a process analogous to natural selection) to evolve with the goal of producing better and better candidate solutions. Different EC algorithms use different representations of candidate solutions. Representations range from real- valued vectors in evolutionary programming (Fogel, Owens and Walsh, 1966) and evolutionary strategies (e.g. see Schwefel, 1995) to bit strings and symbolic tree structures in genetic algorithms (Holland, 1975) and genetic programming (Koza, 1992).
  • 21. Evolutionary computation and genetic methods 21 The level of performance of any particular individual can be ascribed some numerical value, this is frequently referred to as its fitness. Assigning the fitness involves evaluating the individual against some problem dependent objective function (called a fitness function in EC literature) and is, for non-trivial applications, the most time consuming part of the algorithm. The rate of replication for any individual is determined by its fitness value and the fitness values of the other individuals in the population. Exactly how the replication occurs is determined by the selection scheme used to pick the individuals for replication and the genetic operators used to perform the replications (possibly with modifications to the individuals; analogous to mutation and sexual reproduction in biological evolution). Those individuals that perform well, i.e. those of above average fitness, must have a selection rate higher than those that perform relatively poorly in order for their genetic information to successfully penetrate, and remain tenable in, the population. The algorithm is terminated according to some pre-specified termination criteria. This is generally dependent on the type of algorithm and the application. The most frequently used method is to allow a pre-set number of iterations to elapse before termination although in some situations, where there is some a priori quantitative knowledge of the problem solution, it is possible to terminate the procedure when a member of the population gives an acceptable solution. 2.2.1 Historical perspective The roots of EC can be traced back to the late 1950s and early 1960s with works published in the general area of machine learning by a number of contributors e.g. Friedberg (1958), Holland (1962). Interest in the use of evolutionary methods for performing adaptation continued throughout the 1970s but was mostly restricted to a relatively small number of researchers with access to suitable computer hardware, and who published only in a narrow spectrum of journals. This situation persisted for a number of years and it was not until the early 1990s that the previously disparate components of (what is now termed) EC formally cohered (Bäck et al., 1997).
  • 22. Evolutionary computation and genetic methods 22 The mid 1980s saw the beginning of a more widespread interest in EC methods. Largely, this was catalysed by the availability of relatively cheap, high performance computing to researchers in a variety of technical disciplines. Difficult optimisation problems (i.e. those posed in noisy, uncertain and highly constrained domains), in particular, have become popular candidates for the application of EC techniques. As greater computing power becomes available, however, it is expected that EC methods will be increasingly be used for design purposes. Most of the evolutionary algorithms around today can be loosely classified as belonging to one of the following three categories: genetic algorithms (subsuming genetic programming and classifier systems), evolutionary programming and evolutionary strategies. These approaches are highly related but, historically, they were developed independently (Fogel, 1997). 2.2.2 Context of EC in engineering search and optimisation A number of search and optimisation methods have been developed for wide ranging uses in the fields of science, engineering and economics. They are typically applied when the solution (or solutions) to the problem being examined cannot be readily expressed in a neat, closed analytical form. This is usually the case in the majority of real-world problems: often the available information is not sufficient for a simple solution to be deduced, or the mathematical analysis may be intractable. Hence, further techniques to search for a satisfactory solution are usually required. Calculus driven and enumerative methods form the traditional base of search and optimisation techniques. Exhaustive enumerative methods, e.g. dynamic programming (Bellman, 1957), directly evaluate the objective function for possible solutions point by point. The regions of search are progressively refined and explored (e.g. using geometrical considerations) so that the number of points evaluated does not become too large and degenerate the procedure into a random search. However, these techniques are not efficient and they “break down on problems of moderate size and complexity” (Goldberg, 1989).
  • 23. Evolutionary computation and genetic methods 23 Calculus driven techniques assume that the space to be searched can be treated as an analytically well-behaved surface with extrema that can be located using derivative functions. The efficacy of such a search is highly dependant on the topography of the optimisation surface and the initial conditions. Again, the assumptions that need to be made about the behaviour of the search space are quite strong and, in general, are not satisfied by the majority of real-world problems. The difficulty in solving these problems has been the main driving force behind the development of stochastic (including evolutionary) methods. It is possible to employ algorithms that do not rely on the search space being continuous and well behaved. Evolutionary algorithms are included in this class of search methods, as is the (non-population based) class of algorithms known as simulated annealing (Metropolis et al., 1953). This method searches points in the space of possible solutions in a probabilistic manner. Previous candidate solutions are perturbed according to a statistical schedule analogous to the annealing method in metal cooling. Initially, when the “temperature” is high, perturbations to previous solutions are accepted with high probability. As the algorithm continues, a cooling schedule is imposed so that future perturbations are accepted with ever decreasing probability. The mechanisms involved in a simulated annealing optimisation are similar to those occurring in certain evolutionary algorithms, the main differences are in the physical analogy used and the use of a population of search points in the evolutionary case. One of the frequently cited advantages of evolutionary algorithms over traditional methods is that the use of a population of search points is less sensitive to the initial conditions of the search, and that the explicit parallelism of the algorithms makes them less likely to become trapped at local extrema in a multi-modal search space. Another reason for their popularity is that it is not necessary to evaluate or estimate derivatives during the search: no auxiliary information other than objective function evaluations is required. It is important to remember that, despite the recent excitement, evolutionary
  • 24. Evolutionary computation and genetic methods 24 algorithms are not voodoo. They, like any other search methods, have their limitations. It is, in general, not possible to establish convergence proofs for evolutionary methods (although this can be done in certain cases, e.g. Rudolph, 1994) and it is somewhat unclear (unlike calculus based techniques) under what circumstances evolutionary algorithms perform poorly or what representation is most appropriate for a given task. Another problem is that, as the complexity of the target problems becomes greater, the computational demands of these algorithms begin to outstrip the available resources. Hence, the cost of repeated experimentation can be prohibitive. This can often limit their usefulness for certain purposes. Indeed, the perceived high computational cost to performance ratio involved in the best performing GP-PLS algorithm proposed by Hiden (1998) was the prime motivation for much of the work tackled later in this thesis. 2.2.3 Evolvability and the choice of representation In addition to the requirements outlined in Section 2.1.1, it is necessary for the population as a whole to have a property known as evolvability, i.e. “the ability of random variations to sometimes produce improvement” (Wagner and Altenberg, 1996). Without this “hidden” criterion, the requirements listed earlier are not sufficient to ensure evolution. Evolvability is, in general, a complex property of the way that the genotype (i.e. the coding of the individual as an abstract mathematical entity) maps to the phenotype (i.e. the “physical” structure of the individual, which ultimately dictates its behaviour). In EC designs this is often, mistakenly, considered to be a purely representational problem, with the fitness function regarded as a self- evident and immutable goal, rather than an as functional ingredient of the evolutionary process (Jakobi, 1996). This means a somewhat more lateral approach to fitness function design might be needed in EC designs than is normally required for traditional search techniques. Evolvability is a necessary property of a successful EC application but how does one, from a practical standpoint, go about achieving it? Actually, up to a point, this may not be as difficult as it sounds: common EC representations (e.g. genetic algorithm bit strings) are common because they are structures that, empirically, have
  • 25. Evolutionary computation and genetic methods 25 shown to exhibit good evolvability properties in a number of situations. Furthermore, the form of the fitness function is, to some degree, pre-determined by the nature of the application domain. The skill of the designer then lies in modifying these basic components in order to further improve the evolvability of the population, e.g. by augmenting the fitness functions with penalty functions (e.g. Searson et al., 1998) or by modularising the representation by ensuring that, as far as possible, functionally independent phenotypic effects are represented by syntactically independent genotype structures (Altenberg, 1994). So, whilst there are some general avenues of exploration open to the designer of an EC application wishing to maximise evolvability, there are no hard and fast rules for accomplishing this and so the designer frequently must utilise an iterative, heuristic procedure, based on the recommendations of the available literature and experience of similar applications. The methodology of EC designs is discussed further in Section 2.4 2.3 Selection mechanisms The selection mechanism is central to the successful operation of evolutionary algorithms. It must improve the average fitness of the next generation by giving individuals with a high relative fitness a high probability of being selected. Then, reproduction operators (such as mutation and crossover in the case of genetic algorithms) can be applied to the selected individuals to create new individuals, thereby investigating new regions of the search space. Thus, the selection mechanism allows the exploitation of genetic material currently contained within the population, with a view to its further improvement in future generations by means of evolutionary reproduction operators. This is in stark contrast to traditional “hill climbing” techniques that focus only on transforming the current best solution into a better one, ignoring the possibilities of previous partial solutions, and leaving the approach susceptible to being trapped in a local optimum. By allocating trials to inferior solutions, evolutionary algorithms delay the immediate moderate payoff in expectation of a higher future payoff. A balance is struck between the exploitation of individuals with a higher than average fitness in the population and the exploration
  • 26. Evolutionary computation and genetic methods 26 of individuals that are not quite as good (but may contain genetic information that could be useful when mutated to a slightly different form or suitably combined with other individuals). The weighting of this balance is determined by the selection pressure over the population. The term “selection pressure” is frequently used in an informal manner1 to indicate the probability that individuals with a given fitness value have of being picked by the selection process. This term can also be applied to a population as a whole. If it said that there is a high selection pressure over a population it usually means that the selection mechanism is heavily biased towards individuals of high relative fitness, with the advantage of greatly raising the average fitness of the next generation. If too high a selection pressure is applied, however, this could have the undesirable effect of causing a loss of diversity in the next generation and premature convergence of the algorithm to an unsatisfactory solution. Conversely, too little selection pressure and the algorithm stagnates and, in the degenerate case, becomes little better than a random search. The choice of the level of selection pressure to exert on the population throughout the course of an evolutionary algorithm is a major consideration. In most applications, however, the designer does not have sufficient a priori information to gauge the effect of a given selection scheme on the success of the evolutionary algorithm and so must usually opt for mechanisms that have proved successful in the past or have been recommended in the literature. 2.3.1 Fitness proportionate selection Fitness proportionate selection (often known as roulette wheel selection) is probably the simplest selection mechanism to implement and is the method that was originally chosen for use with the earliest genetic algorithms by Holland (1975). It may be stated simply as: the selection probability p(Jk) of the kth individual Jk in the current population P(t) = { J1, J2, …, JN } at generation t is directly proportional to the fitness value f(Jk) of the individual. 1 Additionally, there are a number of formal measures of selection pressure (or ‘selection intensity’). Blickle and Thiele (1995) define it as the difference between the population average fitness before and after selection, normalised by the mean variance of the pre-selection population fitness. They use this selection intensity measure as a means of quantitatively comparing different selection schemes.
  • 27. Evolutionary computation and genetic methods 27 The constant of proportionality is the inverse of the sum of the fitness values of the individuals in the current population, it serves to normalise the sum of the individual probabilities to one (Equation 2.1). ∑= = N k k k k Jf Jf Jp 1 )( )( )( 2.1 Here it is assumed that that all N fitness values are greater than zero, and that larger fitness values correspond to better individual performance. If smaller fitness values correspond to better individual performance (e.g. when minimising prediction errors in data modelling) then the following scaling is often used (e.g. McKay et al., 1996). )(1 1 )( k k Jf Jf + =′ 2.2 This adjusted fitness value can then be used in place of f(Jk) in Equation 2.1. Blickle and Thiele (1995) point out that there are properties of fitness proportionate selection that make it undesirable for general use as a selection mechanism in evolutionary algorithm applications. The main problem is that it is not translation invariant with respect to the raw fitness values. This means that as the evolutionary algorithm progresses it is difficult to ascertain, with any certainty, the level of selection pressure imposed on the population. There are also problems associated with the use of the inversion function of Equation 2.2 when the goal of the search is lower fitness values. In particular, when the f(Jk) values are small (< 0.1), Equation 2.2 has the effect of compacting the )( kJf ′ values into the interval [0.9,1]. The problem is then that the selection probabilities tend to equalise as the algorithm progresses and the raw fitness values become smaller, reducing the driving force behind the algorithm so that it is very
  • 28. Evolutionary computation and genetic methods 28 difficult to exploit the better individuals preferentially to the poorer ones. Appropriate pre-scaling of the raw fitness values can, in principle, be used to remove this problem but, in general, the use of fitness proportionate selection is fraught with difficulties and is best avoided. 2.3.2 Ranking selection The problems associated with the use of fitness proportionate selection can be overcome by the use of ranking selection mechanisms (Grefenstette and Baker, 1989). Once the N raw fitness values have been calculated for each individual in the population they are sorted so that the best individual has the rank N and the worst the rank 1. The rank values can then be used in place of the raw fitness values in Equation 2.1. This has the effect of imposing a selection pressure over the population that varies in a linear manner and is independent of the absolute values of the fitness measurements. One problem that can occur with this method is that multiple individuals with the same raw fitness value are ranked differently. The rank assigned to these individuals is then an artefact of the sorting algorithm used. This could seriously bias the selection procedure in cases where there are a relatively large number of individuals in the population with equal fitnesses. The ranking method can, however, be modified so that individuals with equal fitness values are given the same rank. This is accomplished by performing the normal linear ranking procedure and then, for the individuals with equal fitness (or for each group of individuals that exhibit equal fitnesses), assigning the mean rank of that group to each of the individuals within that group. For example, in a population of 10 individuals with unique raw fitness values, the best individual would be assigned rank 10 and the worst, rank 1. If however, the individuals with ranks 8,7 and 6 actually had equal raw fitness then these individuals would be assigned the modified rank of 3 678 ++ = 7. This form of modified ranking is the selection mechanism adopted for the work described in this thesis.
  • 29. Evolutionary computation and genetic methods 29 2.3.3 Tournament Selection An alternative selection method that has proved popular is tournament selection. It is similar to ranking selection in that it also overcomes the problems associated with fitness proportionate selection by decoupling the distribution of selection probabilities from the absolute distribution of raw fitness values. The method is analogous to that sometimes observed in nature where individuals directly compete for the right to mate. Rather than directly sorting all of the individuals in the population according to fitness, a tournament group of size Nt is formed by randomly selecting individuals from the population. The tournament group is then ranked according to fitness and the best individual in the group is then selected. Tournament selection can be regarded as a probabilistic version of ranking selection (Koza, 1992) and in the case of Nt = 2 the two techniques are mathematically equivalent (Blickle and Thiele, 1995). Larger tournament sizes increase the selection pressure on the best individuals in the population, e.g. in the degenerate case of Nt = N the best individual in the population is always selected, leading to a massive loss of diversity in the next generation of the evolutionary algorithm. In the EC literature, tournament sizes of 4- 6 are commonly reported. In some applications of evolutionary algorithms, e.g. algorithms involving a high degree of parallelisation over a number of processing nodes, tournament selection is preferred over ranking selection because no centralised sorting procedure is required. 2.4 Notes on evolutionary computational methodology Back et al. (1997), in a recent review of the status and history of EC, contend that it can often be useful to view EC as a general framework of related concepts that can be tailored to the user’s application, rather than a pre-defined collection of algorithms that can be bolted on to a domain specific problem without consideration of the issues involved. This is worth bearing in mind when attempting to describe any particular subgroup of evolutionary algorithms: ultimately the form of the algorithm and the problem representation that is used are not uniquely defined by the application. An adaptive, incremental approach to the design is required as well as a
  • 30. Evolutionary computation and genetic methods 30 willingness to utilise heuristic and qualitative arguments in the construction of the solution. Barto (1990) states that whilst traditional engineering methods tend to deal with quantities and concepts that are of low-dimensionality and natural to the engineer, connectionist methods (e.g. artificial neural networks) tend to employ “expansive” representations, in which the representation of the problem is apparently of a higher dimension that the problem requires. This property is also shared by a number of representations common in EC, such as genetic programming. The expansionist representation is an underdetermined one and therefore researchers have a large amount of freedom in implementing a design. There is, however, the accompanying burden that there are no clearly defined procedures to an EC design and the formulation used does not necessarily have the regular mathematical properties, such as linearity, determinism, stability and convergence that traditional engineering methods display. 2.5 Genetic algorithms Genetic algorithms are, perhaps, the best known type of evolutionary algorithm. They have gained a reputation for being both robust and relatively easy to implement (Goldberg, 1989). This is borne out by the degree of use that genetic algorithms have recently seen in a number of diverse research areas: e.g., genetic algorithms have been used to optimise the design of plastic extruder dies where gradient based techniques were found too inefficient (Chung and Hwang, 1997). Moros et al. (1996) used genetic algorithms to generate initial parameter estimates for kinetic models of a methane dehydrodimerisation process. They found that this reduced overall computing time and increased the reliability of the model parameter solutions. Genetic algorithms have also been applied to a number of medical imaging problems with a good deal of success; e.g. Handels et al. (1999) report on different methods to recognise malignant melanomas automatically by extracting features from skin surface profiles. The genetic algorithm method performed best with a 97.7% successful classification performance on unseen skin profiles.
  • 31. Evolutionary computation and genetic methods 31 In view of the fact that the focus of this thesis, genetic programming, is seen by many as an extension of the basic genetic algorithm, fundamentally employing the same mechanisms but with greater representational flexibility, the following sections summarise the basic concepts of genetic algorithms and the theories behind their efficacy. 2.5.1 Background Although there had been interest in the modelling and simulation of population genetics around the same time as the general field of evolutionary computation was founded it was not until John Holland published the landmark text “Adaptation in Natural and Artificial Systems” (Holland, 1975) that the advantages of using genetics as a general model for adaptation in non-biological systems became apparent to a wider audience. In the most widely used form of the genetic algorithm, the standard binary crossover genetic algorithm (SGA), each individual within the population consists of a string of binary digits. This bit string is a discrete combinatorial representation of a solution to the problem being examined, meaning that the entire search space can be represented by the (finite) available combinations of the bits. In the simplest case, the bit string is usually a direct binary encoding of a real valued parameter. However, other more mechanistic representations are possible, wherein the order of the bits represents the nature of the interactions in some entity with modular characteristics, e.g. in genetic algorithm based classifier systems. The following sections introduce the basic mechanisms of genetic algorithms. 2.5.2 Reproduction operators in genetic algorithms For evolution to occur in there must be cumulative selection over a number of generations coupled with the property that small variations in the genotype sometimes produce improvements in the individual. The selection mechanisms (fitness proportionate selection, ranking etc.) are largely independent of the representation of the individual, but the reproduction operators used must be
  • 32. Evolutionary computation and genetic methods 32 designed appropriately. The reproduction operators most often used in binary bit- string genetic algorithms; direct reproduction, point mutation and single point crossover, are inspired by the recombinative processes that enable adaptation in the natural world. A number of alternative reproduction operators have been proposed for use with binary genetic algorithms, e.g. multi-point crossover (De Jong, 1975), but they are generally simple adaptations or hybrids of the basic single point crossover and mutation methods and, as a rule, have not been adopted by the bulk of GA practitioners. In constructing a new population, the reproduction operator to be used is picked based on the probabilities Pc (probability of crossover), Pm (probability of mutation) and Pr (probability of direct reproduction) where Pc + Pm + Pr = 1. These are algorithm control parameters and must be set by the user. (The rate of crossover tends to dominate the recombination process in most applications, followed by direct reproduction and mutation being used as “background” operators). The selection mechanism is then used to select an individual (or two individuals in the case of crossover) and the appropriate reproduction operation is performed. The parent(s) are left in the current population and are available for reselection. The offspring are inserted into the new population. 2.5.2.1 Single point crossover The single point crossover operator is analogous to the exchange of genetic information (stored on chromosomes) that occurs during sexual reproduction in nature. A new individual is created by recombining two complementary fragments of the parent bit strings, thereby testing new individuals that retain characteristics of both parents. Because the standard genetic algorithm operates over fixed length linear vectors, the fragment sizes must be constrained so that their combination results in an individual of the same length. This is accomplished by randomly picking a crossover point, and applying it to both parents to create two new offspring. Figure 2.1 depicts this process.
  • 33. Evolutionary computation and genetic methods 33 Randomly selected crossover point Generation t Generation t+1 Offspring 1 Offspring 2 1 1 1 0 0 0 0 0 0 11 0 0 0 0 0 1 1 1 1 1 1 0 01 1 1 1 1 1 1 0 0 0 0 0 0 11 1 1 1 0 0 1 1 1 1 1 1 0 01 0 0 0 Figure 2.1 Single point crossover in standard binary genetic algorithms 2.5.2.2 Mutation The point mutation operator is analogous to the biological random mutations that infrequently occur on DNA molecules. It is typically applied with much lower frequencies than the crossover operator. The mutation operator is applied to a single parent by randomly selecting a bit and then flipping it (see Figure 2.2). Parent Offspring 1 0 0 0 0 1 1 1 1 0 1 1 0 1 0 0 0 0 0 1 1 1 0 1 1 0 Randomly selected bit Figure 2.2 Single point mutation in standard binary genetic algorithms 2.5.2.3 Direct reproduction Direct reproduction is carried out simply by copying the selected individual into the next generation with no modification of its bit structure. The purpose of this operator is to promote the propagation of successful individuals through to future generations
  • 34. Evolutionary computation and genetic methods 34 in such a way that they are immune to the (possibly) harmful effects of mutation and crossover events. An “elitist” selection scheme can also be employed to protect the best individuals in the current population. The difference is that direct reproduction is applied probabilistically, so whilst it is extremely likely that the best individuals of the population will be carried over to the next population, there is no guarantee, so elitist selection acts as a safeguard. The simplest way to implement it is to copy the top, say, five per cent of the current population into the new population before embarking on the ordinary probabilistic selection/reproduction mechanisms. The 5% elitist method is used in all runs described in this thesis. 2.5.3 Genetic algorithm flowsheet The overall operation of a standard genetic algorithm can be represented by the flowsheet in Figure 2.3. The flowsheet shows that the genetic algorithm is essentially very simple to operate, consisting of straightforward selection and bit string manipulation mechanisms. The user must supply the various initialisation parameters (the first block in the flowsheet), e.g. the population size, the encoding scheme, the termination criterion, the reproduction operator frequencies etc. The best way to determine these factors is by referring to existing literature describing a related problem and using the reported values as default settings. Subsequent experimentation with these settings should eventually yield satisfactory results, although it is usually impractical to determine what the optimal settings are for non-trivial problems. The user must also supply a set of functions that decodes a candidate individual, evaluates it and then returns a numerical measure of its quality. Note that the process shown in Figure 2.3 is a simplified version of the genetic algorithm. It does not include provision for selection method variants such as elitist selection. Other details of the standard algorithm have also been omitted for the sake of clarity.
  • 35. Evolutionary computation and genetic methods 35 Initialise algorithm control parameters. Generation index t = 0 Randomly generate N individuals Evaluate N individuals against fitness function f Stop Increment generation index t = t + 1 Set individual index k = 0 Select variation operator based on probabilities Pc, Pm and Pr Choose one individual using probabilistic selection procedure k = k + 1 Choose two inviduals using probabilistic selection procedure k = k +2 mutation direct reproduction crossover Mutate selected individual Reproduce selected individual Crossover selected individuals Add offspring to new population Choose one individual using probabilistic selection procedure k = k + 1 Is termination condition satisfied? Yes No Is k = N ? Yes No Figure 2.3 Flowsheet of a standard genetic algorithm
  • 36. Evolutionary computation and genetic methods 36 2.5.4 Genetic algorithms as function optimisers Figure 2.4 shows an example of a partitioned binary bit string encoding of two real valued that could be used in a function optimisation scenario. Typically this would stated as “find the values of the parameters α and β that minimise (or maximise) some objective function f(α , β ) subject to certain constraints.” In Figure 2.4, Jk is an individual in a population P(t) = {J1, J2, …, JN} of N individuals at generation t. It is useful to clarify some of the terminology associated with genetic algorithms. The bit string in the above example is referred to as a chromosome and the contiguous bit sections corresponding to each parameter are referred to as genes. Each gene can take on a number of values, called alleles. The entire string, the genotype, can be regarded as the prescriptive structure responsible for the expressed parameter set (Goldberg, 1989). The genotype need not always be completely defined by the contents of one chromosome; multiple chromosomes can be used to encode the information in a modular form, allowing restrictions on the interchange of genetic information to be imposed during recombination. Genetic algorithms are discrete combinatorial processors but many parameter optimisation problems are based on continuous real valued parameters. Certain trade-offs between precision and the size of the coding used must therefore be made. The number of bits chosen to represent each parameter depends on the range of admissible parameter values and the degree of precision required. Hence, prior knowledge of the range in which the optimal values fall (and the desired precision) is necessary when designing a binary bit string representation. If the function optimisation requires a high degree of precision over large parameter ranges then the length of the bit string will become commensurately large, and the size of the space that the genetic algorithm has to search increases at an exponential rate. In the example in Figure 2.4 there are 14 bits in total so there are 214 (16384) distinct combinations of the bits. For such an example, the number of points in the search space is not huge and it could be successfully searched, providing the fitness function is not complex, using non-genetic methods in a reasonable amount of time, e.g. an exhaustive search. However, binary bit string lengths of 200 are quite
  • 37. Evolutionary computation and genetic methods 37 common in engineering applications (e.g. the optimisation of 20 real valued parameters simultaneously, each represented to 10 bit precision). The search space in this case is astronomically large, consisting of 2200 (approximately 1059 ) possible combinations of bits. An exhaustive search would be infeasible in this case: if one million points could be searched per second then it would still take around three billion, trillion, trillion, trillion years to complete. The fact that genetic algorithms can successfully search spaces of this in far shorter times emphasises the fact that genetic algorithms, although having a number of random elements in their operation, are not random walks through the search space. Decoding function Objective function evaluation Bit string Jk 1 0 0 0 0 1 1 1 1 0 1 1 0 1 α β Fitness value f( Jk ) Gene 1 Gene 2 Figure 2.4 Example of bit string in parameter encoding and evaluation 2.5.5 Underlying processes: the schema theorem How do the partially randomised mechanistic operations involved in genetic algorithm processing enable a good quality solution, in the form of a particular sequence of bits, to be obtained from the enormous number of sequences available in a typical run?
  • 38. Evolutionary computation and genetic methods 38 The schema theorem (Holland, 1975) is one explanation, although often criticised, of how genetic algorithms process information and gradually progress towards a near-optimal solution. The basis of this theorem is that the genetic algorithm implicitly processes large numbers of candidate solutions in parallel by means of similarity templates (so called schemata). Schemata are notational devices that allow structural similarities in groups of solutions to be quantified, and it is thought that the genetic algorithm implicitly employs this structural similarity information in approaching a high quality solution structure. A corollary of the schema theorem is the building block hypothesis. This arises as a direct result of the schema theorem and implies that, for a GA to work efficiently, short bit string sections representing relatively successful partial solutions (i.e. the “building blocks”) must be combined in order to realise a global solution. Intuitively, and from a human perspective, this makes sense: often solutions to problems are found by applying successful solutions to related problems or by breaking down the problem into smaller, more manageable problems and combining the partial solutions obtained. Although the schema theorem and the building block hypothesis are useful as visualisation tools in genetic algorithms, there are certain inconsistencies in the underlying assumptions, and many of the criticisms that have been directed at the schema theorem suggest that there may be processes at work in genetic algorithms that have yet to be adequately explained. Thornton (1997) summarises a number of the problems with the schema theorem and the building block hypothesis. 2.5.6 Towards flexible representation in genetic algorithms It has long been recognised that the greatest shortcoming of the classical genetic algorithm is its lack of representational flexibility. The straightforward coding scheme is sufficient for parameter optimisation problems, but for the more complex tasks of generalised machine learning it is a severe restriction. In these cases, a solution that adapts its own structure by progressively improving on previous structures would be highly desirable. One attempt to broach this representation problem for learning systems was the
  • 39. Evolutionary computation and genetic methods 39 learning classifier system (Holland and Reitman, 1978) based on the use of IF- THEN production rules coded as fixed length binary strings. Another example is the use of variable length strings, the so-called “messy genetic algorithm” introduced by Goldberg et al. (1989). Whilst these methods were not unsuccessful, it was still felt that the utility of genetic algorithms should be combined with higher order variable length structures, capable of allowing more complex interactions, and amenable to the learning of general tasks. 2.6 Genetic programming A number of researchers in the 1980s pursued the application of genetic algorithms to more complex structures: e.g., Cramer (1985) used a language consisting of loops and increments on variables to evolve solutions to a simple symbolic regression problem. The representation he used consisted of integer strings that could be decoded to form structured programs. Hicklin (1986) and Fujiki and Dickinson (1987) investigated the use of genetic reproduction operators in generating programs in a language called LISP (LISt Processing language). LISP is appropriate for the application of recombinative methods because groups of instructions and data are represented in a syntactically identical way, allowing parts of programs to be spliced into other programs in a manner resembling the bit string splicing in binary genetic algorithms. Most importantly, this is accomplished whilst still maintaining legal program syntax. Genetic programming was the logical progression from the work carried out on the application of genetic algorithms to higher order data structures. John Koza published a series of papers in the early 1990s, e.g. Koza (1990, 1991), that culminated in his extensively referenced text: “Genetic programming: on the programming of computers by means of natural selection” (Koza, 1992). In it, Koza describes a wide array of problems, from various fields, that he uses genetic programming to solve: e.g. symbolic regression (evolving a model that best fits a set of input-output data), robotic planning (i.e. the “artificial ant” problem: the solution lies in evolving a program that guides an entity around a grid picking up all the items of ‘food’ in as few manoeuvres as possible), controller design (deriving a
  • 40. Evolutionary computation and genetic methods 40 computer program that brings a vehicle to rest in minimal time using an “on/off” control signal.) Due to the varied applications described, and the relative ease with which the genetic programming algorithm can be implemented, Koza’s work was much more accessible to the research community than the existing approaches to machine learning. These had tended to rely heavily on formal inferences, abstract symbol processing and impenetrable mathematical theorems and, hence, seemed to be at a great remove from being able to solve the sorts of problems that people wanted them to solve. Genetic programming, on the other hand, although initially only applied to trivial problems, gave impetus to the idea that artificial intelligence could be engineered from the ground up and set to work on scientific problems. Most of the work in genetic programming, both theory and application based, stems from the algorithms described in Koza’s book. 2.6.1 Program induction: parse trees as adaptable data structures One of the main insights of Koza is that of the “pervasiveness of the problem of program induction”, i.e. that a very large number of problems can be solved with the use of a computer program of some description as an answer. Obtaining a suitable program for a given problem is what most scientists, engineers, economists etc. spend a great deal of their professional lives trying to accomplish. Generating and subsequently adapting program code, given a measure of program fitness (however implicitly defined), is what humans do to solve technical problems. The idea of using genetic methods to perturb, fragment and splice programs together to generate better programs is an appealing one. What is less appealing is the perceived fragility of program code: most people know from experience that chopping and changing code in an ad hoc manner is unlikely to result in a program that actually executes without errors, much less give anything approaching the correct answer. However, the source code that one types in and the internal representation of code within a computer are vastly different. Most programs are internally represented as a parse tree; a data structure that represents a hierarchical
  • 41. Evolutionary computation and genetic methods 41 sequence of instructions in the form of an ordered tree. This representation of a program as an ordered tree structure strips away most of the clutter associated with the majority of computer languages. That which remains is the functional backbone of the program. Hence, the problem of cutting and splicing programs is vastly simplified. Given a few necessary assumptions and constraints (these will be described in the coming sections) the tree structure can be modified in an ad hoc manner yet still maintain internal syntactic consistency. Of course, it is highly unlikely that any one perturbation will result in a better program but genetic programming, like all evolutionary techniques, uses the cumulative effect of artificial selection to amplify the effects of the few modifications that do give slightly better results. As an example of a program as a tree structure, consider the following simple piece of pseudo-code (a callable function named prog1 that accepts two real valued arguments a and b and returns a real argument c, the value of which depends on whether a or b is greater). function [c]=prog1(a,b) If a <= b then c = a + b else c = a - b end The same function can be represented as a rooted, ordered tree structure as depicted in Figure 2.5. The tree consists of two types of node: terminals and functionals. Terminal nodes are the “leaves” of the tree structure and typically represent items of program data (program inputs or constants). Functional nodes are the branch points within the tree; they are operators that are used to process terminal node values (and results from branches further down the tree). In the case of Figure 2.5 the terminals are the inputs to the program: the arguments a and b. The functional nodes are the addition operator, +, the subtraction operator, -, and the ‘IF THEN ELSE’ conditional operator designated by the tag IFLTE. The tree processes information as
  • 42. Evolutionary computation and genetic methods 42 follows: the data represented by the lowest (leaf) nodes are passed up the tree to the node immediately above them. At this point, they are operated on by functional nodes, e.g. the addition operator. Then the results of these calculations are passed up to the next node and so forth until the root node is reached. The final calculation ends here and this is usually designated as the overall program output. The ordering of the branches generally makes a difference to the structure of the program because of the way that some functional nodes are specified. E.g., the IFLTE node always has four input arguments, which are processed in the following way: if (argument1)≤(argument2) then return (argument3) as node output else return (argument 4) as node output All function nodes used in GP must be explicitly defined in this manner. IFLTE Program Output b +a - a b a b Figure 2.5 Tree structure of function prog1 Although a parse tree diagram gives a clear view of the processing hierarchy of a program, it is not amenable to direct computer manipulation. A more convenient notation for the trees used in genetic programming is that of prefix notation
  • 43. Evolutionary computation and genetic methods 43 (sometimes called polish notation). In this form of notation, which is directly equivalent to a parse tree representation, functionals are represented by a symbol followed by the arguments in parenthesis. E.g. the familiar algebraic expression a + b would be written as +(a b) in prefix notation. Note that the functional arguments can also be functions themselves: e.g. the expression a - (b + c) would become –(a +(b c)) in prefix notation. The pseudo-code function prog1 illustrated in Figure 2.5 can be written as: IFLTE(a b +(a b) –(a b)). (The computer language LISP, originally chosen for genetic programming, uses a variant of prefix notation but virtually any high level language can be used if an appropriate interpreter is available. (All of the GP runs in this thesis were performed using the MATLAB programming language to operate on ASCII coded prefix expressions). It can be seen why tree structures are amenable to the problem of automatic program induction; sub-trees can be swapped from place to place, nodes can be deleted and replaced with other nodes (or sub-trees) because the syntax that renders a program executable is inherent in the tree representation. Details of the genetic operators used and some of the other details needed to set up a genetic programming experiment are given the following sections. 2.6.2 Reproduction operators in genetic programming Three principal genetic reproduction operators are defined by Koza (1992) for genetic programming: direct reproduction, mutation and crossover (although Koza did not originally advocate the use of mutation, see Section 2.6.2.2). The concepts behind them are very similar to the operators used for binary bit string genetic algorithms. 2.6.2.1 GP crossover Analogous to the method used in binary bit string genetic algorithms, GP crossover exchanges information between two chromosomes. Unlike binary genetic algorithms, there is no theoretical restriction on the sizes of the sections of the
  • 44. Evolutionary computation and genetic methods 44 chromosome being exchanged (considerations such as computer memory and available processing time, however, mean that the practical implementation of crossover will have an upper limit on the new tree sizes.) As an example, the following two simple programs will be shown undergoing GP crossover. In the context of a GP run it is assumed that these two programs are population members that have been chosen by means of an appropriate selection mechanism. Parent 1: Output = 3 + (a – b) Parent 2: Output = c1b + In this example the terminals a, b and c are input variables. The other nodes used are the terminal constants 1 and 3 and the addition, subtraction and square root functions. Figure 2.6 illustrates Parent 1 and Parent 2 undergoing the crossover process. The subtrees selected (randomly) for crossover are shaded. + 3 - a b * b SQRT + 1 c + - a b * b SQRT+ 1 c 3 Parent 1 Parent 2 Offspring 1 Offspring 2 Figure 2.6 Example of crossover in genetic programming.
  • 45. Evolutionary computation and genetic methods 45 The results of this process are the two programs shown below: Offspring 1: Output = (1+c) + (a-b) Offspring 2: Output = 3b In an actual GP run, these two offspring would be inserted into the new population and subsequently evaluated to determine their fitness. The crossover mechanism is easily implemented by computer using prefix notation, in which the crossover subtrees are highlighted in boldface: Parent 1 (prefix): +(3 –(a b)) Parent 2 (prefix): *(b SQRT(+(1 c))) Offspring 1 (prefix): +(+(1 c) –(a b)) Offspring 2 (prefix): *(b SQRT(3)) The process illustrated in Figure 2.6 is the most commonly implemented crossover operator in the GP literature. It has, however, been strongly criticised because, although superficially similar, it operates in a fundamentally different manner to GA crossover and, indeed, biological gene crossover. For instance, Francone et al. (1999) state that biological crossover (and GA crossover) typically exchange genes that are on the same position on the chromosome and that these genes have functional similarity. This is not the case for exchanged subgroups (genes) in GP crossover. For this reason it has been claimed that GP crossover is, in fact, no more than a “macro-mutation” operator (Angeline, 1997). Koza, however, has responded to those that have claimed that crossover is unnecessary (Koza, 1999). He presents several experiments clearly illustrating that GP performs poorly without crossover. Other statistical studies support this view, e.g. Luke and Spector (1998) and Hiden
  • 46. Evolutionary computation and genetic methods 46 (1998), show that GP runs generally benefit from the use of the crossover operator.2 2.6.2.2 GP Mutation In a manner analogous to its GA counterpart, the purpose of the GP mutation operator is to improve population diversity by generating entirely new chromosome segments, in order to explore new regions of the search space. Like the GA mutation operator, it is a form of asexual reproduction based on only one parent. As an example, consider that the following program has been selected from the existing GP population based on its fitness: Parent: Output = (1+c) + (a-b) Figure 2.7 demonstrates a mutation operation on this program. First, a mutation node is randomly selected, and then the corresponding subtree (with the mutation node as its root) is deleted. Finally, a new subtree is randomly generated (in a manner similar to that employed when generating the initial GP population) and inserted in the place of the deleted subtree. For the example given, the offspring program resulting in the mutation operation is shown below: New subtree: Output= log10 (1.2c) Offspring: Output = log10 (1.2c) + c + (a-b) Again, the computer implementation of this operation is based on prefix notation as shown below; the deleted and new substructures are once again highlighted in boldface: 2 It should be noted that, for any search algorithm, there are no optimal neighbourhood search operators for all problems. This is due to the implications of the No Free Lunch (NFL) theorem (Wolpert and Macready, 1997) which states that, averaged over all possible search problems; no search algorithm is better than any other.
  • 47. Evolutionary computation and genetic methods 47 Parent (prefix): +(+(c 1) –(a b)) New subtree (prefix): log10((1.2 c)) Offspring (prefix): +(+(c log10((1.2 c))) –(a b)) Note that, unlike crossover, the mutation operator introduces a subtree that contains structures not necessarily present in the parent. In this case, the functional nodes log10 and (divide) as well as the terminal constant 1.2 have been used. The definition of terminal and functional sets, and how they are used in tree generation, will be discussed further in Section 2.6.4. + - a b + 1 c Parent + - a b + c Offspring log10 1.2 c Randomly generated subtree log10 1.2 c Figure 2.7 Example of subtree mutation in genetic programming.
  • 48. Evolutionary computation and genetic methods 48 Originally, Koza maintained that crossover and direct reproduction are the only reproduction operators that are required to complete a successful GP design (Koza, 1992). He argued that a mutation operator is necessary in the case of GAs because crossover by itself only recombines bit string sections of the original population that are associated with high performance, hence mutation is needed to restore the loss of diversity that accompanies this process. Koza goes on to state that, in for GP, this is not the case, as the crossover operator combines “genes” in a functionally more flexible way than GAs and so the equivalent loss of diversity should not occur. While this seems to be a valid theoretical consideration, the informal consensus among GP practitioners is that the use of mutation (at relatively low rates of about 10% or less) assists the evolution process. This is borne out by a number of statistical studies on a variety of simple GP applications (e.g. Luke and Spector, 1998, Hiden, 1998). Hence, all the GP runs described in this thesis use the standard sub-tree mutation operator. 2.6.3 Specifying a genetic program Koza (1994) describes six steps necessary to define a genetic program. These are: 1) Terminal set selection: choosing the variables that are needed to solve the problem. 2) Functional set selection: choosing the functions that are needed to operate on the variables in order to solve the problem. 3) Fitness function specification: choosing appropriate tests of the evolved programs. 4) Run control parameter settings: choosing parameter values for the GP run, e.g. rates of mutation, crossover and direct reproduction. 5) Termination criterion specification: setting a condition to terminate the run.
  • 49. Evolutionary computation and genetic methods 49 6) Program architecture specification: deciding how the tree (or a number of trees in some cases) is decoded into an individual test program for evaluation and what reproduction operators are used to alter the architecture of the tree(s). Some of these steps are often trivial. Determining the termination criterion, for example, is usually a simple process, e.g. stop after a certain number of generations have elapsed or stop when a solution of a high enough quality is found. Specifying an appropriate fitness test suite and appropriate program components (i.e. terminal and functional nodes), however, can be a difficult task if the designer is to ensure that the evolved programs can solve the set problem in a desirable manner. The remainder of the chapter highlights, briefly, a number of issues pertaining to the use of GP in solving engineering and science problems. For a more general, and thorough, overview of GP, see Koza (1999). 2.6.4 Terminals and functions: specifying the program components The specification of a “toolbox” of components, the terminals and functions, which can be subsequently manipulated by simulated genetic processes into a working program, is the first step that calls upon the human user’s knowledge of the problem. The user will have a good idea, at this stage, what they want the evolved programs to accomplish and will have some degree of knowledge as to what information must be manipulated to solve the problem. The terminal set and the functional set must exhibit the joint property of sufficiency (Koza, 1992), i.e. out of all the trees that can be constructed from them; there must be at least one that is capable of expressing the actual solution to the problem. In addition to the sufficiency requirement, it is necessary that the terminal and function sets in GP exhibit the property of closure (Koza, 1992). This simply means that any tree generated from these sets must be syntactically valid. Closure is attained by ensuring that all functions and terminals return values of the same data type (e.g. Boolean). This is usually straightforward to achieve for many engineering
  • 50. Evolutionary computation and genetic methods 50 applications: the terminals and constants will generally be of the scalar floating point type, and the standard arithmetical functions will return a value of the same type as the input arguments. However, there are a few minor exceptions: the floating point division operator will not return a floating point value if both input arguments are zero3 , the square root operator will return a complex value if its input argument is less than zero. Similarly, the natural logarithm operator will return a complex value if its input argument is less than zero, or will return a value of “infinity” or “undefined” (depending on the computing language being used) if its input is zero. These problems can be sidestepped by taking a few liberties with the definition of some mathematical functions in order to maintain data type consistency. For instance, the division operator can be redefined so that it returns a zero when both arguments are zero and the square root operator can be redefined to return the positive root of the absolute value of its input argument. In the GP literature, this is commonly referred to as “protecting” functions; the redefined division operator is referred to as protected division, the redefined natural logarithm as protected natural log and so forth. 2.6.5 Handling constants in genetic programming Although it is often possible to specify what inputs will be required for an evolved program to solve a problem, it is not generally possible to know in advance what constants are required. Exceptions to this rule are Boolean problems, modular problems (e.g. clock arithmetic) and the like. Engineering and scientific problems, in general, require the use of non-integer real constant terms in their solutions, but how can one incorporate these in a terminal set without knowing their exact values beforehand? The standard GP solution to this is known as the ephemeral random constant (ERC) method (Koza, 1992) and its use, although often augmented by other methods of determining constant values, is widespread. 3 Many computing languages will simply halt program execution and return a ‘division by zero’ error at this point. Matlab returns the value ‘NaN’ (not a number), this effectively renders any parse trees meaningless as all further operations on ‘NaN’ result in ‘NaN’.
  • 51. Evolutionary computation and genetic methods 51 The ERC method is actually very simple to implement: a special terminal R is added to the existing terminal set and used in an identical way to the other terminals when generating the initial population (i.e. generation 0) of random program trees. At this point, each occurrence of R in the population can be considered a placeholder for an, as yet, unknown constant. Then, before the tree is inserted into the initial population, each instance of R is replaced by a randomly generated real constant from a user- defined range, e.g. [-1, 1]. Once the initial population has been generated, the values of the constant nodes are fixed throughout the run; it is only before insertion into generation 0 that the placeholder R is used. However, many researchers who have investigated the use of GP for data modelling purposes have asserted that this method of handling constants is inadequate and not numerically efficient. Other methods of handling constants have been suggested and these are discussed more fully in the survey of GP for process modelling purposes in Chapter 3. 2.6.6 Multiple populations Although the basic GP formulation uses a single population of individuals, it is possible to distribute the population by employing several sub-populations (called “demes”) that are evolved in isolation except for the periodic exchange (migration) of individuals from a deme to one or more other demes. This is done to prevent the premature convergence, due to lack of diversity, which can occur in the single population algorithm. This scheme also has the advantage that it is suited to GP performed over parallel processing units because the fitness evaluations and selection are performed separately on each processing unit and the only communications between these units are the periodic transfer of migrating individuals (Koza, 1995). 2.7 Summary This chapter has introduced the field of evolutionary computation in an engineering setting and has focussed primarily on the mechanisms involved in selection, and the reproduction operators used in genetic algorithms and genetic programming. A number of issues relating to implementation of genetic programming to solve engineering problems have also been addressed. A discussion of the use of genetic
  • 52. Evolutionary computation and genetic methods 52 programming as a tool for data based modelling is presented in Chapter 3.
  • 53. Chapter 3 Genetic programming as a modelling tool 3 Genetic programming as a modelling tool