SearsonGP_PLS_PhD_thesis

Non-linear PLS using Genetic Programming
Dominic Searson
School of Chemical Engineering and Advanced Materials
University of Newcastle upon Tyne
An online version of a thesis submitted to the Faculty of Engineering, University of
Newcastle upon Tyne in 2002, in partial fulfilment of the requirements for the
degree of Doctor of Philosophy.
© Dominic Searson 2002-2005

2
Preface to online version
This version is virtually identical to the original, but the formatting is wrong so it
may look a bit odd in places. I blame Word 2003 for this. Also, the page numbers
are different to the original.
Abstract
The economic and safe operation of modern industrial process plants usually
requires that accurate models of the processes are available. Unfortunately, detailed
mathematical models of industrial process systems are often time consuming and
expensive to develop. Consequently, the use of data based models is often the only
practical alternative. The need for effective methods to build accurate data based
models with a minimum of specialist knowledge has given impetus to the research
of automatic model development methods. One method, genetic programming (GP),
which is an evolutionary computational technique for automatically learning how to
solve problems, has previously been identified as a candidate for automatic non-
linear model development. GP has also been combined with a multivariate statistical
regression method called PLS (partial least squares) in order to improve its
performance (GP-PLS). One version of this method, called GP_NPLS2, was found
to give good performance but at a computational expense deemed too high for use as
a modelling tool. In this thesis, the GP-PLS framework is developed further. A novel
architecture, called team based GP-PLS, is proposed. This method evolves teams of
co-operating sub-models in parallel in an attempt to improve modelling performance
without incurring significant additional computational expense. The performance of
the team based method is compared with the original formulations of GP-PLS on
steady state data sets from three synthetic test systems. Subsequently, a number of
other modifications are made to the GP-PLS algorithms. These include the use of a
multiple gene sub-model representation and a novel training method used to
improve the ability of the evolved models to generalise to unseen data. Finally, an
extended team method that encodes certain PLS parameters (the input projection
weights) as binary team members is presented. The extended team method allows
the optimisation of the sub-models and the projection weights simultaneously
without recourse to computationally expensive iterative methods.

3
Table of Contents
Nomenclature ............................................................................................................ 7
1 Introduction...................................................................................................... 11
1.1 Background.................................................................................................. 12
1.2 Data based modelling .................................................................................. 13
1.2.1 Linear regression methods .................................................................... 13
1.2.2 Automatic model development............................................................. 13
1.2.3 Combining GP with PLS (GP-PLS) ..................................................... 14
1.3 Thesis aims and outline ............................................................................... 15
2 Evolutionary computation and genetic methods........................................... 18
2.1 Introduction ................................................................................................. 19
2.1.1 What is evolutionary computation? ...................................................... 19
2.2 EC algorithms- basic structure and functionality ........................................ 20
2.2.1 Historical perspective............................................................................ 21
2.2.2 Context of EC in engineering search and optimisation ........................ 22
2.2.3 Evolvability and the choice of representation....................................... 24
2.3 Selection mechanisms.................................................................................. 25
2.3.1 Fitness proportionate selection ............................................................. 26
2.3.2 Ranking selection.................................................................................. 28
2.3.3 Tournament Selection........................................................................... 29
2.4 Notes on evolutionary computational methodology.................................... 29
2.5 Genetic algorithms....................................................................................... 30
2.5.1 Background........................................................................................... 31
2.5.2 Reproduction operators in genetic algorithms ...................................... 31
2.5.3 Genetic algorithm flowsheet................................................................. 34
2.5.4 Genetic algorithms as function optimisers............................................ 36
2.5.5 Underlying processes: the schema theorem.......................................... 37
2.5.6 Towards flexible representation in genetic algorithms......................... 38
2.6 Genetic programming.................................................................................. 39
2.6.1 Program induction: parse trees as adaptable data structures................. 40
2.6.2 Reproduction operators in genetic programming ................................. 43
2.6.3 Specifying a genetic program ............................................................... 48
2.6.4 Terminals and functions: specifying the program components ............ 49
2.6.5 Handling constants in genetic programming ........................................ 50
2.6.6 Multiple populations............................................................................. 51
2.7 Summary...................................................................................................... 51
3 Genetic programming as a modelling tool..................................................... 53
3.1 Introduction ................................................................................................. 54
3.2 Representational approaches ....................................................................... 55
3.2.1 Steady state modelling.......................................................................... 55
3.2.2 Dynamic modelling............................................................................... 57
3.2.3 Process knowledge................................................................................ 59
3.2.4 Multigene representations..................................................................... 60

4
3.3 Improving the numerical properties of GP.................................................. 60
3.3.1 ERC optimisation methods ................................................................... 60
3.3.2 Code growth and parsimony pressure................................................... 63
3.3.3 Multi-objective fitness functions .......................................................... 65
3.4 Summary...................................................................................................... 66
4 Non-linear PLS using genetic programming ................................................. 67
4.1 Introduction ................................................................................................. 68
4.2 Linear regression methods........................................................................... 69
4.2.1 Multiple linear regression ..................................................................... 69
4.2.2 Principal component regression (PCR)................................................. 71
4.2.3 Projection to latent structures (PLS)..................................................... 75
4.3 The NIPALS algorithm ............................................................................... 81
4.3.1 Single output linear PLS....................................................................... 83
4.3.2 Determining the number of latent variables in linear PLS ................... 83
4.4 Non-linear PLS............................................................................................ 84
4.4.1 Non-linear polynomial PLS methods.................................................... 84
4.4.2 Non-linear artificial neural network PLS methods ............................... 88
4.4.3 Projection pursuit regression................................................................. 90
4.4.4 Overfitting in non-linear modelling...................................................... 91
4.5 GP-PLS........................................................................................................ 92
4.5.1 Introduction........................................................................................... 92
4.5.2 SGP-PLS configuration ........................................................................ 93
4.5.3 Team based GP-PLS............................................................................. 97
4.5.4 TGP-PLSa configuration .................................................................... 100
4.5.5 Determining the number of latent variables in GP-PLSa algorithms . 101
4.6 Summary.................................................................................................... 102
5 Comparing team and sequential GP-PLS.................................................... 106
5.1 Introduction ............................................................................................... 107
5.2 Test System 1: simulated cooking extruder............................................... 107
5.2.1 Experimental details............................................................................ 109
5.2.2 Interpretation of experimental data..................................................... 110
5.2.3 Qualitative comparisons between SGP-PLSa and TGP-PLSa models114
5.3 Test system 2: the non-linear Cherkassky function................................... 117
5.4 Test system 3: simulated pH process......................................................... 124
5.5 Use of validation data to avoid overfitting................................................ 129
5.5.1 Split-sample validation ....................................................................... 129
5.5.2 Retrospective early-stopping (RES) ................................................... 131
5.5.3 Further analysis of a GP-PLSa model................................................. 134
5.6 Analysis of computational costs................................................................ 139
5.7 Summary.................................................................................................... 140
6 Multigene GP-PLS ......................................................................................... 142

5
6.1 Introduction ............................................................................................... 143
6.1.1 The multigene concept........................................................................ 143
6.1.2 Multigene GP for system identification.............................................. 144
6.1.3 Multigene GP algorithm modifications .............................................. 146
6.2 Multigene GP-PLS .................................................................................... 149
6.2.1 Multigene SGP-PLS............................................................................ 150
6.2.2 Multigene TGP-PLS ........................................................................... 151
6.3 Comparison of single gene GP-PLSa with multigene GP-PLSa............... 153
6.3.1 Test system 1: simulated cooking extruder data................................. 153
6.3.2 Test system 2: Cherkassky function data............................................ 157
6.3.3 Test system 3: simulated pH process.................................................. 161
6.5 Discussion.................................................................................................. 166
6.6 Summary.................................................................................................... 166
7 Dynamic data partitioning ............................................................................ 167
7.1 Introduction ............................................................................................... 168
7.1.1 SGP-PLS implementation of DDP...................................................... 170
7.1.2 TGP-PLS implementation of DDP ..................................................... 172
7.2 Experimental comparison of GP-PLS algorithms with and without DDP 174
7.2.3 Test system 3: simulated pH process data .......................................... 185
7.4 Discussion.................................................................................................. 192
7.5 Summary.................................................................................................... 192
8 Extended teams............................................................................................... 194
8.1 Introduction ............................................................................................... 195
8.1.1 Extended binary teams........................................................................ 196
8.1.2 Projection weight encoding................................................................. 197
8.1.3 Extended team algorithm details......................................................... 200
8.2 Experimental comparison of team based GP-PLS algorithms with and
without projection weight encoding .................................................................... 201
8.2.3 Test system 3: simulated pH process data .......................................... 210
8.3 Discussion.................................................................................................. 215
8.3.1 Effects of projection weight encoding ................................................ 215
8.5 Summary.................................................................................................... 219
9 Conclusions and further work ...................................................................... 220
9.1 Conclusions ............................................................................................... 221
9.1.1 Team based GP-PLS........................................................................... 221
9.1.2 The multigene approach to SGP-PLSa and TGP-PLSa...................... 222

6
9.1.3 Dynamic data partitioning................................................................... 224
9.1.4 Extended teams ................................................................................... 225
9.2 Further work .............................................................................................. 226
9.2.1 Highly multivariate systems ............................................................... 226
9.2.2 Multiple output variables .................................................................... 227
9.2.3 Final comments................................................................................... 229
References.............................................................................................................. 230
Acknowledgements ............................................................................................... 242

Nomenclature
Nomenclature
Acronyms
The following shorthand expressions are used to designate the various regression
methods and algorithms considered in this thesis. Where appropriate, the Chapters
containing experimental work with the algorithm are also cited.
CR Continuum regression.
EBNNPLS Error based neural network PLS.
EBQPLS Error based quadratic PLS.
EBRBFPLS Error based radial basis function PLS.
GP_NPLS1 The version of SGP-PLS implemented by Hiden (1998).
GP_NPLS2 The version of SGP-PLS implemented by Hiden (1998), which
incorporates optimisation of the input projection weights.
GP-PLS An umbrella term referring to GP based PLS methods.
INNPLS Integrated neural network based PLS.
LQPLS Linear quadratic PLS.
MLR Multiple linear regression.
MSGP-PLSa The default multiple gene implementation of SGP-PLS (Chapter 6).
MSGP-PLSb A multiple gene implementation of SGP-PLS that uses dynamic data
partitioning (Chapter 7).
MTGP-PLSa The default multiple gene implementation of TGP-PLS (Chapter 6).
MTGP-PLSb A multiple gene implementation of TGP-PLS that uses dynamic data
MTGP-PLSc A multiple gene implementation of extended TGP-PLS that uses
dynamic data partitioning and a standard reflected binary Gray
coding of the input projection weights (Chapter 8).
NNPLS Neural Net PLS.
PCA Principal component analysis.
PCR Principal component regression.
PLS Partial least squares/ Projection to latent structures.
PLS1 Linear PLS with a single output variable.
PLS2 Linear PLS with multiple output variables.
PPR Projection pursuit regression
QPLS Quadratic PLS.
RBFPLS Radial basis function network PLS.
SGP-PLS Umbrella term for sequential GP-PLS algorithms.
SGP-PLSa The default single gene implementation of SGP-PLS used in this
thesis (Chapter 5).
SGP-PLSb A single gene implementation of SGP-PLS that uses dynamic data
SPLINE-PLS A non-linear PLS algorithm that uses piecewise polynomials.
TGP-PLS Umbrella term for team based GP-PLS algorithms.
TGP-PLSa The default single gene implementation of TGP-PLS used in this
thesis (Chapter 5).

8
TGP-PLSb A single gene implementation of TGP-PLS that uses dynamic data
TGP-PLSc A single gene implementation of extended TGP-PLS that uses
dynamic data partitioning and a standard reflected binary Gray
coding of the input projection weights (Chapter 8).
Symbols
α A function parameter.
β A function parameter.
jϕ The jth regressor vector in a basis function expression.
A A subset of the training data.
bi The ith linear model parameter in MLR. Also, the univariate
regression model coefficient for the ith latent variable stage in
PLS.
{b0,i,k , b1,i,k} Model coefficients for the kth GP individual and the ith inner
model in single gene GP-PLS.
B A subset of the training data.
B An (m × p) matrix of model coefficients.
Bq An (q × p) reduced matrix of PCR model coefficients.
{c0,i, …, c2,i} The quadratic coefficients for the ith inner model in QPLS.
corr(x,y) The correlation coefficient of x and y.
cov(x,y) The covariance of x and y.
dj The jth basis function coefficient.
e An (n × 1) error vector.
E An (n × p) error matrix.
f The number of basis functions.
)( kJf The raw fitness of the kth individual in a population.
)(' kJf The adjusted fitness of the kth individual in a population.
gj The jth basis function.
G The number of generations in a GP run.
Gj,k The jth gene in the kth multiple gene expression.
Gj,i,k In MSGP-PLS, the jth gene in the kth individual at the ith latent
variable stage.
In MTGP-PLS, the jth gene in the ith team member of the kth
team.
K Process gain.
Jk The kth individual in a population.
Ji,k In SGP-PLSa, the kth GP individual at the ith latent variable
stage. In TGP-PLSa, the ith team member in the kth team.
m The number of input (independent) variables.
n The number of measurements in the training set.
nA The number of measurements in training subset A.
nB The number of measurements in training subset B.
N The number of individuals in a GP population.
k
gN The number of genes in the kth multiple gene individual.
Ngm The maximum number of allowed genes in a multiple gene
individual.

9
Nlv,k The number of members in the kth team that are used to generate
the overall model.
Np The number of symbolic members in a team.
Nt Tournament size.
Nw The number of bits used to encode a projection weight in TGP-
PLSc and MTGP-PLSc (extended binary teams).
p The number of output (response) variables.
pi An (m × 1) vector of input loadings at the ith latent variable stage.
pi,k In TGP-PLS, an (m × 1) vector of input loadings at the ith latent
variable stage as calculated in the kth team evaluation.
Pc Probability of crossover.
Phigh Probability of high level crossover.
Pi The ith GP population.
Pm Probability of mutation.
Pr Probability of direct reproduction.
q The number of principal components retained in PCA.
qi A (p × 1) vector of PLS output loadings for the ith latent variable
stage.
r The radius of a circle.
rms(.) The root mean square value of (.).
R The set of ephemeral random constants.
s The Laplace operator.
Si The ith subpopulation.
t The generation index.
ti An (n × 1) vector of input scores for the ith latent variable stage.
ti(A) An (nA × 1) vector of input scores over the subset A for the ith
latent variable stage.
T An (n × m) matrix of PCA scores.
Tk The kth GP team in a population.
Tq An (n × q) reduced matrix of PCA scores.
ui An (n × 1) vector of output scores for the ith latent variable stage.
ui(A) An (nA × 1) vector of output scores over the subset A.
ui,k In TGP-PLS, the (n × 1) vector of the ith output scores vector as
calculated in the kth team evaluation.
ki,
ˆu In SGP-PLS, the (n × 1) vector of the prediction of the ith output
scores vector by the kth GP individual.
var(x) The variance of the elements in the vector x.
V An (n × m) matrix of PCA loadings.
Vq An (n × q) reduced matrix of PCA loadings.
wi An (m × 1) vector of input projection weights at the ith latent
variable stage.
wi,k In TGP-PLS, an (m × 1) vector of input projection weights at the
ith latent variable stage as calculated in the kth team evaluation.
Wi,k The ith binary team member in the kth team in TGP-PLSc and
MTGP-PLSc.
x An input (predictor) variable.
xj An (n × 1) vector of scaled measurements on the jth input
variable.
xt The value of the input variable at time t.

10
X An (n × m) input data matrix. The jth column contains scaled
measurements of the jth input variable.
Xˆ An (n × m) matrix of estimated values for X.
X(A) An (nA × m) matrix of input values over the subset A.
Xi The (n × m) deflated input matrix at latent variable stage i.
y An output (response) variable.
yt The value of the output variable at time t.
y A (n × 1) vector of scaled measurements on a single output
variable.
y(A) An (nA × 1) vector of output values over the subset A.
yj A (n × 1) vector of scaled measurements on the jth output
variable.
Y An (n × p) output data matrix. The jth column contains scaled
measurements of the jth output variable.
Yi The (n × p) deflated output matrix at latent variable stage i.

Chapter 1
Introduction
1 Introduction

Introduction
12
1.1 Background
Industrial plants can be operated most safely and economically when the engineer
has detailed fundamental knowledge of how the component processes work. Ideally,
every physical aspect of the each process is understood in detail, allowing it to be
efficiently controlled so that it performs in optimal conditions. In the real world,
however, this is rarely the case. Modern industrial processes tend be highly
complex, involving physical and chemical interactions that are often poorly
understood at the quantitative level. Sometimes the control of these processes is
based purely on rules gleaned from experience of operating the plant. This lack of
fundamental process knowledge precludes the use of detailed mathematical models,
which are, in any event, usually too time consuming and expensive to develop.
On the other hand, whilst fundamental knowledge of process behaviour is difficult
to obtain, process data is not. Plant instrumentation and computer systems routinely
collect and store data on hundreds of variables such as flow rates, pressures,
temperatures as well as measures of product quality. This data affords the possibility
of constructing entirely empirically determined models of how the process behaves
(models of this type are usually called “black box” models and allow the outputs of
the process, e.g. product quality variables, to be predicted using the inputs of the
plant, e.g. reactant flowrates, chemical composition etc). The drawbacks of black
box modelling are that the models do not usually have any physical interpretation
and they cannot safely extrapolate beyond the range of the data that were used to
train them.
Process data can also be used in conjunction with existing process knowledge, in
this case only certain relationships are determined from the data and these are
combined with a physically derived mathematical model. This approach is often
called “grey box” modelling.
However, in the absence of formal physical and chemical equations, methods that
rely solely on plant data to quickly develop cost-effective and accurate models are
needed. The concept of automatic model development: i.e. methods that allow good

Introduction
13
data based models to be built with minimum expert knowledge, is of particular
interest.
1.2 Data based modelling
1.2.1 Linear regression methods
Methods that assume a linear relationship between the process inputs and outputs are
the traditional tool of the engineer because they are fairly simple in structure and can
sometimes result in models that have some physical interpretation. These are usually
in the form of regression models, such as multiple linear regression (MLR) and,
more recently, principal component regression (PCR) and partial least squares
regression (PLS; Wold, 1975)1
. The latter two methods are developments of MLR
that have certain properties that make them useful for finding linear relationships
between output variables and large numbers of highly correlated input variables.
PLS works by projecting the input and output data onto low dimensional subspaces
and then fitting univariate regression models between the projections. It has proved
to be an effective way to develop multivariate process models from noisy, high
dimensional and correlated data (e.g. see Wise and Gallagher, 1996). A drawback
with the PLS regression method is that it assumes a linear relationship between
inputs and outputs, whereas the behaviour of industrial processes is frequently non-
linear. There are ways of extending the PLS framework to capture non-linear
relationships, however, and these are described in Chapter 4.
1.2.2 Automatic model development
Methods to develop models that can effectively capture non-linear relationships in
process data with a minimum of expert knowledge have been extensively
researched. The most well known of these techniques is the artificial neural network
(ANN). This method has had a good degree of success in the process industries (e.g.
see Willis et al., 1992, Lennox, 1996). There are certain disadvantages to using
ANNs, however, not least of which is that the user must make informed decisions
1
PLS is also known as “projection to latent structures” for reasons that will become apparent in
Chapter 4.

Introduction
14
about the size, topology and training method as well as the selection of appropriate
network inputs (McKay et al., 1997). Of course, it is arguable that a meta-level
optimisation layer for performing these choices can be implemented (although these
may be computationally costly) or established rules of thumb may be used to
determine network topologies, but Sarle (1997) points out that many of these rules
are ‘nonsense’ and states that “in most situations, there is no way to determine the
best number of hidden units without training several networks and estimating the
generalisation error of each”.
Another relatively recent technique that shows promise in the automatic
development of process models is that of genetic programming (GP; Koza, 1992).
GP is an evolutionary search method and it was originally designed as way of
automatically learning how to solve problems by applying artificial selection and
reproduction to populations of solutions that are encoded as variable length tree
structures. GP was identified as a good candidate for automatic model development
because it appeared to be able to automatically select the appropriate input variables
as well as discover the model structure and parameters simultaneously (this process
is known as symbolic regression). Whilst this is true to a certain extent, some
research at the University of Newcastle has shown that the standard form of GP (i.e.
the original form of GP using a single population of trees constructed from
arithmetic and simple non-linear functions with no use of advanced architectures or
representations) does not generally perform any better than feedforward artificial
neural networks with sigmoidal activation functions (Hiden, 1998).
It is also debatable whether the GP approach is any more ‘automatic’ than the ANN
approach since a number of user decisions must be made with regards to population
size, architecture, selection method, choice of primitives etc. However, it remains to
be seen if GP is, in practice, more amenable to automatic model development.
1.2.3 Combining GP with PLS (GP-PLS)
In an attempt to improve the ability of GP to model steady state non-linear
multivariate systems, a hybrid of GP and the PLS modelling method was proposed

Introduction
15
(GP_NPLS1; Hiden, 1998, Hiden et al., 1998). This method, in common with other
non-linear PLS methods, sequentially supplies a series of non-linear univariate
models to fit the relationships between the data projections. The GP_NPLS1 method
was found to increase the accuracy of the evolved models, in terms of the prediction
errors on unseen data, but was not found to give any better performance than an
equivalent neural network based PLS method.
A variant on this method, GP_NPLS2, was also proposed. GP_NPLS2 retains the
same basic architecture as GP_NPLS1 but it incorporates an iterative non-linear
least squares routine that optimises the data projection directions. GP_NPLS2 gave
better results than GP_NPLS1 and outperformed the equivalent neural network PLS
approach (i.e. neural network PLS with optimised projection directions). Despite
this improvement, GP_NPLS2 was deemed unacceptable for use as a modelling tool
due to the extremely high computational cost requirements of the external optimiser.
However, it was felt that the GP-PLS concept had unexplored potential and there
might be ways of modifying it so that better results could be attained without
resorting to the use of “brute force” optimisation methods, which are both time
consuming and aesthetically unappealing from an engineering standpoint. This lead
to the starting point of the work described in this thesis, the outline of which is
detailed in the next section.
1.3 Thesis aims and outline
The primary aim of this work is to demonstrate that the GP-PLS method has
potential as a viable process systems modelling tool by improving its performance
without recourse to methods that would greatly increase the computation load, such
as iterative optimisation routines. This is accomplished by testing some novel GP-
PLS architectures and evaluating training and validation methods on some simple
steady state test systems. This thesis begins, however, by introducing the field of
evolutionary computation (focussing particularly on genetic algorithms and genetic
programming) before reviewing the use of genetic programming as a modelling
(system identification) tool with industrial process applications. A brief abstract of
each of the following Chapters is provided below:

Introduction
16
Chapter 2: Evolutionary computation and genetic methods
The underlying principles of evolutionary computation are briefly introduced as well
as a more detailed discussion of the mechanisms of genetic algorithms and genetic
programming.
Chapter 3: Genetic programming as a modelling tool
A review of the use of GP as modelling tool is provided, with an emphasis on
applications for process systems modelling. Representational and numerical issues
relevant to the model building capabilities of GP are discussed.
Chapter 4: Non-linear PLS using genetic programming
The mechanisms underlying linear PLS regression and related approaches are
discussed as well as non-linear extensions to the PLS framework, such as neural net
based PLS and the sequential GP-PLS algorithms proposed by Hiden (1998). A
novel GP-PLS architecture, team based GP-PLS, is proposed. This evolves a
population of co-operating teams of models to solve the modelling task in parallel,
unlike the original GP-PLS method in which the models are supplied sequentially by
consecutive GP runs.
Chapter 5: Comparing team and sequential GP-PLS
A comparative study of the team based GP-PLS architecture with the sequential
architecture proposed by Hiden (1998) is described. This study is based on the use
of data obtained from three synthetic test systems: a simulated cooking extruder, a
non-linear mathematical function and a simulated pH process. Some properties of
the evolved models are described and an attempt is made to improve the
generalisation performance of the evolved models by use of a split-sample
validation method.
Chapter 6: Multigene GP-PLS
The combination of the “multigene” GP method (Hinchliffe et al., 1996) with both
team based and sequential GP-PLS methods is described. This method decomposes
GP models into modular substructures in order to improve their evolvability

Introduction
17
properties. A comparative study of the multigene GP-PLS algorithms and the single
gene algorithms on the three synthetic test systems is described.
Chapter 7: Dynamic data partitioning
A novel method for GP-PLS training is proposed and combined with the team and
sequential algorithms. This method, provisionally called dynamic data partitioning,
is intended to improve the generalisation of the evolved models by reducing model
overfitting. A comparative study of the various GP-PLS methods with and without
the dynamic data partitioning method on the three synthetic test systems is
described.
Chapter 8: Extended teams
A novel team based GP-PLS architecture whereby the data projection directions are
encoded as binary team members and evolved in parallel with the GP-PLS models is
proposed. A comparative study (on the three test systems) of the extended team
algorithm with the algorithms developed in Chapter 7 is described and some
properties of the evolved models are discussed.
Chapter 9: Conclusions and further work
A number of comments and conclusions on the development of the GP-PLS
framework in this thesis are offered. Suggestions for further work in the area are
provided.

Chapter 2
Evolutionary computation and genetic methods
2 Evolutionary computation and genetic methods

19
2.1 Introduction
This thesis is concerned with the application of genetic search to the projection
based regression method of partial least squares (PLS). Genetic search methods are
members of a closely related family of procedures called evolutionary computation
(EC). The purpose of this chapter is to introduce EC, and subsequently genetic
algorithms and genetic programming by exposition of the underlying principles of
simulated evolution. Concepts that are of importance in EC (and in the work that is
discussed in the following chapters) such as evolvability and representational issues
are outlined. The chapter concludes with a discussion of various features of genetic
programming as a prelude to the use of GP in a system identification framework.
2.1.1 What is evolutionary computation?
Evolutionary computational methods are a class of iterative learning algorithms that
imitate the natural processes of biological evolution in order to solve science and
engineering problems. EC methods utilise a set of concepts and arguments that are
essentially identical to those that underpin the modern theoretical framework of
evolutionary biology. This framework is known as neo-Darwinism as it builds on
the ideas of “survival of the fittest” and cumulative selection first proposed
coherently by Darwin (1859). If one combines Darwin’s ideas with the notion of
differences in genetic encoding (the genotype) mapping to differences in physical
attributes (the phenotype) then evolution can be regarded as a statistical process
operating on complex data structures. It is then possible to view evolution as an
open-ended optimisation process that can be formalised, modelled and exploited.
The basic requirements for evolution to occur in a biological context are: (e.g.
Darwin, 1859)
• There is a finite population of individuals.
• The individuals can reproduce and pass on their traits to their offspring.
• There should be variety of traits within the population of individuals.
• The traits of the individuals should be related to their ability to survive (i.e.

20
the variety of the individuals should enable them to compete for the right to
be selected for reproduction).
In addition to these points there should be added the explicit requirement of
encoding of the traits:
• The salient characteristics of the individual (i.e. those characteristics which
impart upon the individual the ability to reproduce in its environment) should
be partly transmissible to its offspring via some sort of encoding system.
Furthermore, the transmission of the traits should not be error free otherwise
no new traits can develop.
It is now almost universally accepted that, in nature, it is predominantly the DNA
code that determines the variability of individuals. Thus, it is the complex
interactions between the genetic information and the processes of selection in a
finite population that give rise to the evolutionary driving force. In nature this
evolutionary pressure is not directed towards a particular goal, it is open ended,
whereas in simulated evolutionary processes (for the most part) the evolution is
directed towards solving a specific problem.
2.2 EC algorithms- basic structure and functionality
EC algorithms work by iteratively processing a population of individuals, each of
which forms a candidate solution to some problem. At the beginning of the EC
algorithm, it would not be expected that any individual would constitute a good
solution, since the initial population is randomly generated. The population is then
forced (by means of a process analogous to natural selection) to evolve with the goal
of producing better and better candidate solutions. Different EC algorithms use
different representations of candidate solutions. Representations range from real-
valued vectors in evolutionary programming (Fogel, Owens and Walsh, 1966) and
evolutionary strategies (e.g. see Schwefel, 1995) to bit strings and symbolic tree
structures in genetic algorithms (Holland, 1975) and genetic programming (Koza,
1992).

21
The level of performance of any particular individual can be ascribed some
numerical value, this is frequently referred to as its fitness. Assigning the fitness
involves evaluating the individual against some problem dependent objective
function (called a fitness function in EC literature) and is, for non-trivial
applications, the most time consuming part of the algorithm. The rate of replication
for any individual is determined by its fitness value and the fitness values of the
other individuals in the population. Exactly how the replication occurs is determined
by the selection scheme used to pick the individuals for replication and the genetic
operators used to perform the replications (possibly with modifications to the
individuals; analogous to mutation and sexual reproduction in biological evolution).
Those individuals that perform well, i.e. those of above average fitness, must have a
selection rate higher than those that perform relatively poorly in order for their
genetic information to successfully penetrate, and remain tenable in, the population.
The algorithm is terminated according to some pre-specified termination criteria.
This is generally dependent on the type of algorithm and the application. The most
frequently used method is to allow a pre-set number of iterations to elapse before
termination although in some situations, where there is some a priori quantitative
knowledge of the problem solution, it is possible to terminate the procedure when a
member of the population gives an acceptable solution.
2.2.1 Historical perspective
The roots of EC can be traced back to the late 1950s and early 1960s with works
published in the general area of machine learning by a number of contributors e.g.
Friedberg (1958), Holland (1962). Interest in the use of evolutionary methods for
performing adaptation continued throughout the 1970s but was mostly restricted to a
relatively small number of researchers with access to suitable computer hardware,
and who published only in a narrow spectrum of journals. This situation persisted
for a number of years and it was not until the early 1990s that the previously
disparate components of (what is now termed) EC formally cohered (Bäck et al.,
1997).

22
The mid 1980s saw the beginning of a more widespread interest in EC methods.
Largely, this was catalysed by the availability of relatively cheap, high performance
computing to researchers in a variety of technical disciplines. Difficult optimisation
problems (i.e. those posed in noisy, uncertain and highly constrained domains), in
particular, have become popular candidates for the application of EC techniques. As
greater computing power becomes available, however, it is expected that EC
methods will be increasingly be used for design purposes.
Most of the evolutionary algorithms around today can be loosely classified as
belonging to one of the following three categories: genetic algorithms (subsuming
genetic programming and classifier systems), evolutionary programming and
evolutionary strategies. These approaches are highly related but, historically, they
were developed independently (Fogel, 1997).
2.2.2 Context of EC in engineering search and optimisation
A number of search and optimisation methods have been developed for wide
ranging uses in the fields of science, engineering and economics. They are typically
applied when the solution (or solutions) to the problem being examined cannot be
readily expressed in a neat, closed analytical form. This is usually the case in the
majority of real-world problems: often the available information is not sufficient for
a simple solution to be deduced, or the mathematical analysis may be intractable.
Hence, further techniques to search for a satisfactory solution are usually required.
Calculus driven and enumerative methods form the traditional base of search and
optimisation techniques. Exhaustive enumerative methods, e.g. dynamic
programming (Bellman, 1957), directly evaluate the objective function for possible
solutions point by point. The regions of search are progressively refined and
explored (e.g. using geometrical considerations) so that the number of points
evaluated does not become too large and degenerate the procedure into a random
search. However, these techniques are not efficient and they “break down on
problems of moderate size and complexity” (Goldberg, 1989).

23
Calculus driven techniques assume that the space to be searched can be treated as an
analytically well-behaved surface with extrema that can be located using derivative
functions. The efficacy of such a search is highly dependant on the topography of
the optimisation surface and the initial conditions. Again, the assumptions that need
to be made about the behaviour of the search space are quite strong and, in general,
are not satisfied by the majority of real-world problems. The difficulty in solving
these problems has been the main driving force behind the development of
stochastic (including evolutionary) methods.
It is possible to employ algorithms that do not rely on the search space being
continuous and well behaved. Evolutionary algorithms are included in this class of
search methods, as is the (non-population based) class of algorithms known as
simulated annealing (Metropolis et al., 1953). This method searches points in the
space of possible solutions in a probabilistic manner. Previous candidate solutions
are perturbed according to a statistical schedule analogous to the annealing method
in metal cooling. Initially, when the “temperature” is high, perturbations to previous
solutions are accepted with high probability. As the algorithm continues, a cooling
schedule is imposed so that future perturbations are accepted with ever decreasing
probability. The mechanisms involved in a simulated annealing optimisation are
similar to those occurring in certain evolutionary algorithms, the main differences
are in the physical analogy used and the use of a population of search points in the
evolutionary case.
One of the frequently cited advantages of evolutionary algorithms over traditional
methods is that the use of a population of search points is less sensitive to the initial
conditions of the search, and that the explicit parallelism of the algorithms makes
them less likely to become trapped at local extrema in a multi-modal search space.
Another reason for their popularity is that it is not necessary to evaluate or estimate
derivatives during the search: no auxiliary information other than objective function
evaluations is required.
It is important to remember that, despite the recent excitement, evolutionary

24
algorithms are not voodoo. They, like any other search methods, have their
limitations. It is, in general, not possible to establish convergence proofs for
evolutionary methods (although this can be done in certain cases, e.g. Rudolph,
1994) and it is somewhat unclear (unlike calculus based techniques) under what
circumstances evolutionary algorithms perform poorly or what representation is
most appropriate for a given task. Another problem is that, as the complexity of the
target problems becomes greater, the computational demands of these algorithms
begin to outstrip the available resources. Hence, the cost of repeated experimentation
can be prohibitive. This can often limit their usefulness for certain purposes. Indeed,
the perceived high computational cost to performance ratio involved in the best
performing GP-PLS algorithm proposed by Hiden (1998) was the prime motivation
for much of the work tackled later in this thesis.
2.2.3 Evolvability and the choice of representation
In addition to the requirements outlined in Section 2.1.1, it is necessary for the
population as a whole to have a property known as evolvability, i.e. “the ability of
random variations to sometimes produce improvement” (Wagner and Altenberg,
1996). Without this “hidden” criterion, the requirements listed earlier are not
sufficient to ensure evolution. Evolvability is, in general, a complex property of the
way that the genotype (i.e. the coding of the individual as an abstract mathematical
entity) maps to the phenotype (i.e. the “physical” structure of the individual, which
ultimately dictates its behaviour). In EC designs this is often, mistakenly, considered
to be a purely representational problem, with the fitness function regarded as a self-
evident and immutable goal, rather than an as functional ingredient of the
evolutionary process (Jakobi, 1996). This means a somewhat more lateral approach
to fitness function design might be needed in EC designs than is normally required
for traditional search techniques.
Evolvability is a necessary property of a successful EC application but how does
one, from a practical standpoint, go about achieving it? Actually, up to a point, this
may not be as difficult as it sounds: common EC representations (e.g. genetic
algorithm bit strings) are common because they are structures that, empirically, have

25
shown to exhibit good evolvability properties in a number of situations.
Furthermore, the form of the fitness function is, to some degree, pre-determined by
the nature of the application domain. The skill of the designer then lies in modifying
these basic components in order to further improve the evolvability of the
population, e.g. by augmenting the fitness functions with penalty functions (e.g.
Searson et al., 1998) or by modularising the representation by ensuring that, as far as
possible, functionally independent phenotypic effects are represented by
syntactically independent genotype structures (Altenberg, 1994).
So, whilst there are some general avenues of exploration open to the designer of an
EC application wishing to maximise evolvability, there are no hard and fast rules for
accomplishing this and so the designer frequently must utilise an iterative, heuristic
procedure, based on the recommendations of the available literature and experience
of similar applications. The methodology of EC designs is discussed further in
Section 2.4
2.3 Selection mechanisms
The selection mechanism is central to the successful operation of evolutionary
algorithms. It must improve the average fitness of the next generation by giving
individuals with a high relative fitness a high probability of being selected. Then,
reproduction operators (such as mutation and crossover in the case of genetic
algorithms) can be applied to the selected individuals to create new individuals,
thereby investigating new regions of the search space. Thus, the selection
mechanism allows the exploitation of genetic material currently contained within the
population, with a view to its further improvement in future generations by means of
evolutionary reproduction operators. This is in stark contrast to traditional “hill
climbing” techniques that focus only on transforming the current best solution into a
better one, ignoring the possibilities of previous partial solutions, and leaving the
approach susceptible to being trapped in a local optimum. By allocating trials to
inferior solutions, evolutionary algorithms delay the immediate moderate payoff in
expectation of a higher future payoff. A balance is struck between the exploitation of
individuals with a higher than average fitness in the population and the exploration

26
of individuals that are not quite as good (but may contain genetic information that
could be useful when mutated to a slightly different form or suitably combined with
other individuals). The weighting of this balance is determined by the selection
pressure over the population.
The term “selection pressure” is frequently used in an informal manner1
to indicate
the probability that individuals with a given fitness value have of being picked by
the selection process. This term can also be applied to a population as a whole. If it
said that there is a high selection pressure over a population it usually means that the
selection mechanism is heavily biased towards individuals of high relative fitness,
with the advantage of greatly raising the average fitness of the next generation. If too
high a selection pressure is applied, however, this could have the undesirable effect
of causing a loss of diversity in the next generation and premature convergence of
the algorithm to an unsatisfactory solution. Conversely, too little selection pressure
and the algorithm stagnates and, in the degenerate case, becomes little better than a
random search. The choice of the level of selection pressure to exert on the
population throughout the course of an evolutionary algorithm is a major
consideration. In most applications, however, the designer does not have sufficient a
priori information to gauge the effect of a given selection scheme on the success of
the evolutionary algorithm and so must usually opt for mechanisms that have proved
successful in the past or have been recommended in the literature.
2.3.1 Fitness proportionate selection
Fitness proportionate selection (often known as roulette wheel selection) is probably
the simplest selection mechanism to implement and is the method that was originally
chosen for use with the earliest genetic algorithms by Holland (1975). It may be
stated simply as: the selection probability p(Jk) of the kth individual Jk in the current
population P(t) = { J1, J2, …, JN } at generation t is directly proportional to the
fitness value f(Jk) of the individual.
1
Additionally, there are a number of formal measures of selection pressure (or ‘selection intensity’).
Blickle and Thiele (1995) define it as the difference between the population average fitness before
and after selection, normalised by the mean variance of the pre-selection population fitness. They use
this selection intensity measure as a means of quantitatively comparing different selection schemes.

27
The constant of proportionality is the inverse of the sum of the fitness values of the
individuals in the current population, it serves to normalise the sum of the individual
probabilities to one (Equation 2.1).
∑=
= N
k
k
k
k
Jf
Jf
Jp
1
)(
)(
)(
2.1
Here it is assumed that that all N fitness values are greater than zero, and that larger
fitness values correspond to better individual performance. If smaller fitness values
correspond to better individual performance (e.g. when minimising prediction errors
in data modelling) then the following scaling is often used (e.g. McKay et al., 1996).
)(1
1
)(
k
k
Jf
Jf
+
=′
2.2
This adjusted fitness value can then be used in place of f(Jk) in Equation 2.1.
Blickle and Thiele (1995) point out that there are properties of fitness proportionate
selection that make it undesirable for general use as a selection mechanism in
evolutionary algorithm applications. The main problem is that it is not translation
invariant with respect to the raw fitness values. This means that as the evolutionary
algorithm progresses it is difficult to ascertain, with any certainty, the level of
selection pressure imposed on the population.
There are also problems associated with the use of the inversion function of
Equation 2.2 when the goal of the search is lower fitness values. In particular, when
the f(Jk) values are small (< 0.1), Equation 2.2 has the effect of compacting the
)( kJf ′ values into the interval [0.9,1]. The problem is then that the selection
probabilities tend to equalise as the algorithm progresses and the raw fitness values
become smaller, reducing the driving force behind the algorithm so that it is very

28
difficult to exploit the better individuals preferentially to the poorer ones.
Appropriate pre-scaling of the raw fitness values can, in principle, be used to remove
this problem but, in general, the use of fitness proportionate selection is fraught with
difficulties and is best avoided.
2.3.2 Ranking selection
The problems associated with the use of fitness proportionate selection can be
overcome by the use of ranking selection mechanisms (Grefenstette and Baker,
1989). Once the N raw fitness values have been calculated for each individual in the
population they are sorted so that the best individual has the rank N and the worst
the rank 1. The rank values can then be used in place of the raw fitness values in
Equation 2.1. This has the effect of imposing a selection pressure over the
population that varies in a linear manner and is independent of the absolute values of
the fitness measurements.
One problem that can occur with this method is that multiple individuals with the
same raw fitness value are ranked differently. The rank assigned to these individuals
is then an artefact of the sorting algorithm used. This could seriously bias the
selection procedure in cases where there are a relatively large number of individuals
in the population with equal fitnesses. The ranking method can, however, be
modified so that individuals with equal fitness values are given the same rank. This
is accomplished by performing the normal linear ranking procedure and then, for the
individuals with equal fitness (or for each group of individuals that exhibit equal
fitnesses), assigning the mean rank of that group to each of the individuals within
that group. For example, in a population of 10 individuals with unique raw fitness
values, the best individual would be assigned rank 10 and the worst, rank 1. If
however, the individuals with ranks 8,7 and 6 actually had equal raw fitness then
these individuals would be assigned the modified rank of
3
678 ++
= 7. This form of
modified ranking is the selection mechanism adopted for the work described in this
thesis.

29
2.3.3 Tournament Selection
An alternative selection method that has proved popular is tournament selection. It is
similar to ranking selection in that it also overcomes the problems associated with
fitness proportionate selection by decoupling the distribution of selection
probabilities from the absolute distribution of raw fitness values. The method is
analogous to that sometimes observed in nature where individuals directly compete
for the right to mate.
Rather than directly sorting all of the individuals in the population according to
fitness, a tournament group of size Nt is formed by randomly selecting individuals
from the population. The tournament group is then ranked according to fitness and
the best individual in the group is then selected. Tournament selection can be
regarded as a probabilistic version of ranking selection (Koza, 1992) and in the case
of Nt = 2 the two techniques are mathematically equivalent (Blickle and Thiele,
1995). Larger tournament sizes increase the selection pressure on the best
individuals in the population, e.g. in the degenerate case of Nt = N the best individual
in the population is always selected, leading to a massive loss of diversity in the next
generation of the evolutionary algorithm. In the EC literature, tournament sizes of 4-
6 are commonly reported. In some applications of evolutionary algorithms, e.g.
algorithms involving a high degree of parallelisation over a number of processing
nodes, tournament selection is preferred over ranking selection because no
centralised sorting procedure is required.
2.4 Notes on evolutionary computational methodology
Back et al. (1997), in a recent review of the status and history of EC, contend that it
can often be useful to view EC as a general framework of related concepts that can
be tailored to the user’s application, rather than a pre-defined collection of
algorithms that can be bolted on to a domain specific problem without consideration
of the issues involved. This is worth bearing in mind when attempting to describe
any particular subgroup of evolutionary algorithms: ultimately the form of the
algorithm and the problem representation that is used are not uniquely defined by the
application. An adaptive, incremental approach to the design is required as well as a

30
willingness to utilise heuristic and qualitative arguments in the construction of the
solution.
Barto (1990) states that whilst traditional engineering methods tend to deal with
quantities and concepts that are of low-dimensionality and natural to the engineer,
connectionist methods (e.g. artificial neural networks) tend to employ “expansive”
representations, in which the representation of the problem is apparently of a higher
dimension that the problem requires. This property is also shared by a number of
representations common in EC, such as genetic programming. The expansionist
representation is an underdetermined one and therefore researchers have a large
amount of freedom in implementing a design. There is, however, the accompanying
burden that there are no clearly defined procedures to an EC design and the
formulation used does not necessarily have the regular mathematical properties, such
as linearity, determinism, stability and convergence that traditional engineering
methods display.
2.5 Genetic algorithms
Genetic algorithms are, perhaps, the best known type of evolutionary algorithm.
They have gained a reputation for being both robust and relatively easy to
implement (Goldberg, 1989). This is borne out by the degree of use that genetic
algorithms have recently seen in a number of diverse research areas: e.g., genetic
algorithms have been used to optimise the design of plastic extruder dies where
gradient based techniques were found too inefficient (Chung and Hwang, 1997).
Moros et al. (1996) used genetic algorithms to generate initial parameter estimates
for kinetic models of a methane dehydrodimerisation process. They found that this
reduced overall computing time and increased the reliability of the model parameter
solutions. Genetic algorithms have also been applied to a number of medical
imaging problems with a good deal of success; e.g. Handels et al. (1999) report on
different methods to recognise malignant melanomas automatically by extracting
features from skin surface profiles. The genetic algorithm method performed best
with a 97.7% successful classification performance on unseen skin profiles.

31
In view of the fact that the focus of this thesis, genetic programming, is seen by
many as an extension of the basic genetic algorithm, fundamentally employing the
same mechanisms but with greater representational flexibility, the following sections
summarise the basic concepts of genetic algorithms and the theories behind their
efficacy.
2.5.1 Background
Although there had been interest in the modelling and simulation of population
genetics around the same time as the general field of evolutionary computation was
founded it was not until John Holland published the landmark text “Adaptation in
Natural and Artificial Systems” (Holland, 1975) that the advantages of using
genetics as a general model for adaptation in non-biological systems became
apparent to a wider audience.
In the most widely used form of the genetic algorithm, the standard binary crossover
genetic algorithm (SGA), each individual within the population consists of a string
of binary digits. This bit string is a discrete combinatorial representation of a
solution to the problem being examined, meaning that the entire search space can be
represented by the (finite) available combinations of the bits.
In the simplest case, the bit string is usually a direct binary encoding of a real valued
parameter. However, other more mechanistic representations are possible, wherein
the order of the bits represents the nature of the interactions in some entity with
modular characteristics, e.g. in genetic algorithm based classifier systems. The
following sections introduce the basic mechanisms of genetic algorithms.
2.5.2 Reproduction operators in genetic algorithms
For evolution to occur in there must be cumulative selection over a number of
generations coupled with the property that small variations in the genotype
sometimes produce improvements in the individual. The selection mechanisms
(fitness proportionate selection, ranking etc.) are largely independent of the
representation of the individual, but the reproduction operators used must be

32
designed appropriately. The reproduction operators most often used in binary bit-
string genetic algorithms; direct reproduction, point mutation and single point
crossover, are inspired by the recombinative processes that enable adaptation in the
natural world. A number of alternative reproduction operators have been proposed
for use with binary genetic algorithms, e.g. multi-point crossover (De Jong, 1975),
but they are generally simple adaptations or hybrids of the basic single point
crossover and mutation methods and, as a rule, have not been adopted by the bulk of
GA practitioners.
In constructing a new population, the reproduction operator to be used is picked
based on the probabilities Pc (probability of crossover), Pm (probability of mutation)
and Pr (probability of direct reproduction) where Pc + Pm + Pr = 1. These are
algorithm control parameters and must be set by the user. (The rate of crossover
tends to dominate the recombination process in most applications, followed by direct
reproduction and mutation being used as “background” operators). The selection
mechanism is then used to select an individual (or two individuals in the case of
crossover) and the appropriate reproduction operation is performed. The parent(s)
are left in the current population and are available for reselection. The offspring are
inserted into the new population.
2.5.2.1 Single point crossover
The single point crossover operator is analogous to the exchange of genetic
information (stored on chromosomes) that occurs during sexual reproduction in
nature. A new individual is created by recombining two complementary fragments
of the parent bit strings, thereby testing new individuals that retain characteristics of
both parents. Because the standard genetic algorithm operates over fixed length
linear vectors, the fragment sizes must be constrained so that their combination
results in an individual of the same length. This is accomplished by randomly
picking a crossover point, and applying it to both parents to create two new
offspring. Figure 2.1 depicts this process.

33
Randomly selected
crossover point
Generation t
Generation t+1
Offspring 1 Offspring 2
1 1 1 0 0 0 0 0 0 11 0 0 0 0 0 1 1 1 1 1 1 0 01 1 1 1
1 1 1 0 0 0 0 0 0 11 1 1 1 0 0 1 1 1 1 1 1 0 01 0 0 0
Figure 2.1 Single point crossover in standard binary genetic algorithms
2.5.2.2 Mutation
The point mutation operator is analogous to the biological random mutations that
infrequently occur on DNA molecules. It is typically applied with much lower
frequencies than the crossover operator. The mutation operator is applied to a single
parent by randomly selecting a bit and then flipping it (see Figure 2.2).
Parent
Offspring
1 0 0 0 0 1 1 1 1 0 1 1 0
1 0 0 0 0 0 1 1 1 0 1 1 0
Randomly selected bit
Figure 2.2 Single point mutation in standard binary genetic algorithms
2.5.2.3 Direct reproduction
Direct reproduction is carried out simply by copying the selected individual into the
next generation with no modification of its bit structure. The purpose of this operator
is to promote the propagation of successful individuals through to future generations

34
in such a way that they are immune to the (possibly) harmful effects of mutation and
crossover events.
An “elitist” selection scheme can also be employed to protect the best individuals in
the current population. The difference is that direct reproduction is applied
probabilistically, so whilst it is extremely likely that the best individuals of the
population will be carried over to the next population, there is no guarantee, so elitist
selection acts as a safeguard. The simplest way to implement it is to copy the top,
say, five per cent of the current population into the new population before
embarking on the ordinary probabilistic selection/reproduction mechanisms. The 5%
elitist method is used in all runs described in this thesis.
2.5.3 Genetic algorithm flowsheet
The overall operation of a standard genetic algorithm can be represented by the
flowsheet in Figure 2.3. The flowsheet shows that the genetic algorithm is
essentially very simple to operate, consisting of straightforward selection and bit
string manipulation mechanisms.
The user must supply the various initialisation parameters (the first block in the
flowsheet), e.g. the population size, the encoding scheme, the termination criterion,
the reproduction operator frequencies etc. The best way to determine these factors is
by referring to existing literature describing a related problem and using the reported
values as default settings. Subsequent experimentation with these settings should
eventually yield satisfactory results, although it is usually impractical to determine
what the optimal settings are for non-trivial problems. The user must also supply a
set of functions that decodes a candidate individual, evaluates it and then returns a
numerical measure of its quality.
Note that the process shown in Figure 2.3 is a simplified version of the genetic
algorithm. It does not include provision for selection method variants such as elitist
selection. Other details of the standard algorithm have also been omitted for the sake
of clarity.

35
Initialise algorithm control
parameters.
Generation index t = 0
Randomly generate N
individuals
Evaluate N individuals
against fitness function f
Stop
Increment generation index
t = t + 1
Set individual index
k = 0
Select variation
operator based on
probabilities Pc, Pm
and Pr
Choose one individual
using probabilistic
selection procedure
k = k + 1
Choose two inviduals
using probabilistic
selection procedure
k = k +2
mutation direct reproduction
crossover
Mutate selected
individual
Reproduce selected
individual
Crossover selected
individuals
Add offspring to new
population
Choose one individual
using probabilistic
selection procedure
k = k + 1
Is termination condition
satisfied?
Yes
No
Is k = N ?
Yes
No
Figure 2.3 Flowsheet of a standard genetic algorithm

36
2.5.4 Genetic algorithms as function optimisers
Figure 2.4 shows an example of a partitioned binary bit string encoding of two real
valued that could be used in a function optimisation scenario. Typically this would
stated as “find the values of the parameters α and β that minimise (or maximise)
some objective function f(α , β ) subject to certain constraints.” In Figure 2.4, Jk is
an individual in a population P(t) = {J1, J2, …, JN} of N individuals at generation t.
It is useful to clarify some of the terminology associated with genetic algorithms.
The bit string in the above example is referred to as a chromosome and the
contiguous bit sections corresponding to each parameter are referred to as genes.
Each gene can take on a number of values, called alleles. The entire string, the
genotype, can be regarded as the prescriptive structure responsible for the expressed
parameter set (Goldberg, 1989). The genotype need not always be completely
defined by the contents of one chromosome; multiple chromosomes can be used to
encode the information in a modular form, allowing restrictions on the interchange
of genetic information to be imposed during recombination.
Genetic algorithms are discrete combinatorial processors but many parameter
optimisation problems are based on continuous real valued parameters. Certain
trade-offs between precision and the size of the coding used must therefore be made.
The number of bits chosen to represent each parameter depends on the range of
admissible parameter values and the degree of precision required. Hence, prior
knowledge of the range in which the optimal values fall (and the desired precision)
is necessary when designing a binary bit string representation. If the function
optimisation requires a high degree of precision over large parameter ranges then the
length of the bit string will become commensurately large, and the size of the space
that the genetic algorithm has to search increases at an exponential rate. In the
example in Figure 2.4 there are 14 bits in total so there are 214
(16384) distinct
combinations of the bits. For such an example, the number of points in the search
space is not huge and it could be successfully searched, providing the fitness
function is not complex, using non-genetic methods in a reasonable amount of time,
e.g. an exhaustive search. However, binary bit string lengths of 200 are quite

37
common in engineering applications (e.g. the optimisation of 20 real valued
parameters simultaneously, each represented to 10 bit precision). The search space
in this case is astronomically large, consisting of 2200
(approximately 1059
) possible
combinations of bits. An exhaustive search would be infeasible in this case: if one
million points could be searched per second then it would still take around three
billion, trillion, trillion, trillion years to complete. The fact that genetic algorithms
can successfully search spaces of this in far shorter times emphasises the fact that
genetic algorithms, although having a number of random elements in their operation,
are not random walks through the search space.
Decoding function
Objective function evaluation
Bit string Jk
1 0 0 0 0 1 1 1 1 0 1 1 0 1
α β
Fitness value f( Jk )
Gene 1 Gene 2
Figure 2.4 Example of bit string in parameter encoding and evaluation
2.5.5 Underlying processes: the schema theorem
How do the partially randomised mechanistic operations involved in genetic
algorithm processing enable a good quality solution, in the form of a particular
sequence of bits, to be obtained from the enormous number of sequences available
in a typical run?

38
The schema theorem (Holland, 1975) is one explanation, although often criticised,
of how genetic algorithms process information and gradually progress towards a
near-optimal solution. The basis of this theorem is that the genetic algorithm
implicitly processes large numbers of candidate solutions in parallel by means of
similarity templates (so called schemata). Schemata are notational devices that allow
structural similarities in groups of solutions to be quantified, and it is thought that
the genetic algorithm implicitly employs this structural similarity information in
approaching a high quality solution structure. A corollary of the schema theorem is
the building block hypothesis. This arises as a direct result of the schema theorem
and implies that, for a GA to work efficiently, short bit string sections representing
relatively successful partial solutions (i.e. the “building blocks”) must be combined
in order to realise a global solution. Intuitively, and from a human perspective, this
makes sense: often solutions to problems are found by applying successful solutions
to related problems or by breaking down the problem into smaller, more manageable
problems and combining the partial solutions obtained.
Although the schema theorem and the building block hypothesis are useful as
visualisation tools in genetic algorithms, there are certain inconsistencies in the
underlying assumptions, and many of the criticisms that have been directed at the
schema theorem suggest that there may be processes at work in genetic algorithms
that have yet to be adequately explained. Thornton (1997) summarises a number of
the problems with the schema theorem and the building block hypothesis.
2.5.6 Towards flexible representation in genetic algorithms
It has long been recognised that the greatest shortcoming of the classical genetic
algorithm is its lack of representational flexibility. The straightforward coding
scheme is sufficient for parameter optimisation problems, but for the more complex
tasks of generalised machine learning it is a severe restriction. In these cases, a
solution that adapts its own structure by progressively improving on previous
structures would be highly desirable.
One attempt to broach this representation problem for learning systems was the

39
learning classifier system (Holland and Reitman, 1978) based on the use of IF-
THEN production rules coded as fixed length binary strings. Another example is the
use of variable length strings, the so-called “messy genetic algorithm” introduced by
Goldberg et al. (1989). Whilst these methods were not unsuccessful, it was still felt
that the utility of genetic algorithms should be combined with higher order variable
length structures, capable of allowing more complex interactions, and amenable to
the learning of general tasks.
2.6 Genetic programming
A number of researchers in the 1980s pursued the application of genetic algorithms
to more complex structures: e.g., Cramer (1985) used a language consisting of loops
and increments on variables to evolve solutions to a simple symbolic regression
problem. The representation he used consisted of integer strings that could be
decoded to form structured programs. Hicklin (1986) and Fujiki and Dickinson
(1987) investigated the use of genetic reproduction operators in generating programs
in a language called LISP (LISt Processing language). LISP is appropriate for the
application of recombinative methods because groups of instructions and data are
represented in a syntactically identical way, allowing parts of programs to be spliced
into other programs in a manner resembling the bit string splicing in binary genetic
algorithms. Most importantly, this is accomplished whilst still maintaining legal
program syntax.
Genetic programming was the logical progression from the work carried out on the
application of genetic algorithms to higher order data structures. John Koza
published a series of papers in the early 1990s, e.g. Koza (1990, 1991), that
culminated in his extensively referenced text: “Genetic programming: on the
programming of computers by means of natural selection” (Koza, 1992). In it, Koza
describes a wide array of problems, from various fields, that he uses genetic
programming to solve: e.g. symbolic regression (evolving a model that best fits a set
of input-output data), robotic planning (i.e. the “artificial ant” problem: the solution
lies in evolving a program that guides an entity around a grid picking up all the
items of ‘food’ in as few manoeuvres as possible), controller design (deriving a

40
computer program that brings a vehicle to rest in minimal time using an “on/off”
control signal.)
Due to the varied applications described, and the relative ease with which the
genetic programming algorithm can be implemented, Koza’s work was much more
accessible to the research community than the existing approaches to machine
learning. These had tended to rely heavily on formal inferences, abstract symbol
processing and impenetrable mathematical theorems and, hence, seemed to be at a
great remove from being able to solve the sorts of problems that people wanted them
to solve. Genetic programming, on the other hand, although initially only applied to
trivial problems, gave impetus to the idea that artificial intelligence could be
engineered from the ground up and set to work on scientific problems. Most of the
work in genetic programming, both theory and application based, stems from the
algorithms described in Koza’s book.
2.6.1 Program induction: parse trees as adaptable data structures
One of the main insights of Koza is that of the “pervasiveness of the problem of
program induction”, i.e. that a very large number of problems can be solved with the
use of a computer program of some description as an answer. Obtaining a suitable
program for a given problem is what most scientists, engineers, economists etc.
spend a great deal of their professional lives trying to accomplish. Generating and
subsequently adapting program code, given a measure of program fitness (however
implicitly defined), is what humans do to solve technical problems.
The idea of using genetic methods to perturb, fragment and splice programs together
to generate better programs is an appealing one. What is less appealing is the
perceived fragility of program code: most people know from experience that
chopping and changing code in an ad hoc manner is unlikely to result in a program
that actually executes without errors, much less give anything approaching the
correct answer. However, the source code that one types in and the internal
representation of code within a computer are vastly different. Most programs are
internally represented as a parse tree; a data structure that represents a hierarchical

41
sequence of instructions in the form of an ordered tree. This representation of a
program as an ordered tree structure strips away most of the clutter associated with
the majority of computer languages. That which remains is the functional backbone
of the program. Hence, the problem of cutting and splicing programs is vastly
simplified. Given a few necessary assumptions and constraints (these will be
described in the coming sections) the tree structure can be modified in an ad hoc
manner yet still maintain internal syntactic consistency. Of course, it is highly
unlikely that any one perturbation will result in a better program but genetic
programming, like all evolutionary techniques, uses the cumulative effect of
artificial selection to amplify the effects of the few modifications that do give
slightly better results.
As an example of a program as a tree structure, consider the following simple piece
of pseudo-code (a callable function named prog1 that accepts two real valued
arguments a and b and returns a real argument c, the value of which depends on
whether a or b is greater).
function [c]=prog1(a,b)
If a <= b then c = a + b
else
c = a - b
end
The same function can be represented as a rooted, ordered tree structure as depicted
in Figure 2.5. The tree consists of two types of node: terminals and functionals.
Terminal nodes are the “leaves” of the tree structure and typically represent items of
program data (program inputs or constants). Functional nodes are the branch points
within the tree; they are operators that are used to process terminal node values (and
results from branches further down the tree). In the case of Figure 2.5 the terminals
are the inputs to the program: the arguments a and b. The functional nodes are the
addition operator, +, the subtraction operator, -, and the ‘IF THEN ELSE’
conditional operator designated by the tag IFLTE. The tree processes information as

42
follows: the data represented by the lowest (leaf) nodes are passed up the tree to the
node immediately above them. At this point, they are operated on by functional
nodes, e.g. the addition operator. Then the results of these calculations are passed up
to the next node and so forth until the root node is reached. The final calculation
ends here and this is usually designated as the overall program output.
The ordering of the branches generally makes a difference to the structure of the
program because of the way that some functional nodes are specified. E.g., the
IFLTE node always has four input arguments, which are processed in the following
way:
if (argument1)≤(argument2)
then return (argument3) as node output
else return (argument 4) as node output
All function nodes used in GP must be explicitly defined in this manner.
IFLTE
Program
Output
b +a -
a b a b
Figure 2.5 Tree structure of function prog1
Although a parse tree diagram gives a clear view of the processing hierarchy of a
program, it is not amenable to direct computer manipulation. A more convenient
notation for the trees used in genetic programming is that of prefix notation

43
(sometimes called polish notation). In this form of notation, which is directly
equivalent to a parse tree representation, functionals are represented by a symbol
followed by the arguments in parenthesis. E.g. the familiar algebraic expression a +
b would be written as +(a b) in prefix notation. Note that the functional
arguments can also be functions themselves: e.g. the expression a - (b + c) would
become –(a +(b c)) in prefix notation. The pseudo-code function prog1
illustrated in Figure 2.5 can be written as: IFLTE(a b +(a b) –(a b)).
(The computer language LISP, originally chosen for genetic programming, uses a
variant of prefix notation but virtually any high level language can be used if an
appropriate interpreter is available. (All of the GP runs in this thesis were performed
using the MATLAB programming language to operate on ASCII coded prefix
expressions).
It can be seen why tree structures are amenable to the problem of automatic program
induction; sub-trees can be swapped from place to place, nodes can be deleted and
replaced with other nodes (or sub-trees) because the syntax that renders a program
executable is inherent in the tree representation.
Details of the genetic operators used and some of the other details needed to set up a
genetic programming experiment are given the following sections.
2.6.2 Reproduction operators in genetic programming
Three principal genetic reproduction operators are defined by Koza (1992) for
genetic programming: direct reproduction, mutation and crossover (although Koza
did not originally advocate the use of mutation, see Section 2.6.2.2). The concepts
behind them are very similar to the operators used for binary bit string genetic
algorithms.
2.6.2.1 GP crossover
Analogous to the method used in binary bit string genetic algorithms, GP crossover
exchanges information between two chromosomes. Unlike binary genetic
algorithms, there is no theoretical restriction on the sizes of the sections of the

44
chromosome being exchanged (considerations such as computer memory and
available processing time, however, mean that the practical implementation of
crossover will have an upper limit on the new tree sizes.) As an example, the
following two simple programs will be shown undergoing GP crossover. In the
context of a GP run it is assumed that these two programs are population members
that have been chosen by means of an appropriate selection mechanism.
Parent 1: Output = 3 + (a – b)
Parent 2: Output = c1b +
In this example the terminals a, b and c are input variables. The other nodes used
are the terminal constants 1 and 3 and the addition, subtraction and square root
functions. Figure 2.6 illustrates Parent 1 and Parent 2 undergoing the
crossover process. The subtrees selected (randomly) for crossover are shaded.
+
3 -
a b
*
b SQRT
+
1 c
+
-
a b
*
b SQRT+
1 c 3
Parent 1 Parent 2
Offspring 1 Offspring 2
Figure 2.6 Example of crossover in genetic programming.

45
The results of this process are the two programs shown below:
Offspring 1: Output = (1+c) + (a-b)
Offspring 2: Output = 3b
In an actual GP run, these two offspring would be inserted into the new population
and subsequently evaluated to determine their fitness. The crossover mechanism is
easily implemented by computer using prefix notation, in which the crossover
subtrees are highlighted in boldface:
Parent 1 (prefix): +(3 –(a b))
Parent 2 (prefix): *(b SQRT(+(1 c)))
Offspring 1 (prefix): +(+(1 c) –(a b))
Offspring 2 (prefix): *(b SQRT(3))
The process illustrated in Figure 2.6 is the most commonly implemented crossover
operator in the GP literature. It has, however, been strongly criticised because,
although superficially similar, it operates in a fundamentally different manner to GA
crossover and, indeed, biological gene crossover. For instance, Francone et al.
(1999) state that biological crossover (and GA crossover) typically exchange genes
that are on the same position on the chromosome and that these genes have
functional similarity. This is not the case for exchanged subgroups (genes) in GP
crossover. For this reason it has been claimed that GP crossover is, in fact, no more
than a “macro-mutation” operator (Angeline, 1997). Koza, however, has responded
to those that have claimed that crossover is unnecessary (Koza, 1999). He presents
several experiments clearly illustrating that GP performs poorly without crossover.
Other statistical studies support this view, e.g. Luke and Spector (1998) and Hiden

46
(1998), show that GP runs generally benefit from the use of the crossover operator.2
2.6.2.2 GP Mutation
In a manner analogous to its GA counterpart, the purpose of the GP mutation
operator is to improve population diversity by generating entirely new chromosome
segments, in order to explore new regions of the search space. Like the GA mutation
operator, it is a form of asexual reproduction based on only one parent.
As an example, consider that the following program has been selected from the
existing GP population based on its fitness:
Parent: Output = (1+c) + (a-b)
Figure 2.7 demonstrates a mutation operation on this program. First, a mutation
node is randomly selected, and then the corresponding subtree (with the mutation
node as its root) is deleted. Finally, a new subtree is randomly generated (in a
manner similar to that employed when generating the initial GP population) and
inserted in the place of the deleted subtree.
For the example given, the offspring program resulting in the mutation operation is
shown below:
New subtree: Output= log10
(1.2c)
Offspring: Output = log10
(1.2c) + c + (a-b)
Again, the computer implementation of this operation is based on prefix notation as
shown below; the deleted and new substructures are once again highlighted in
boldface:
2
It should be noted that, for any search algorithm, there are no optimal neighbourhood search
operators for all problems. This is due to the implications of the No Free Lunch (NFL) theorem
(Wolpert and Macready, 1997) which states that, averaged over all possible search problems; no
search algorithm is better than any other.

47
Parent (prefix): +(+(c 1) –(a b))
New subtree (prefix): log10((1.2 c))
Offspring (prefix): +(+(c log10((1.2 c))) –(a b))
Note that, unlike crossover, the mutation operator introduces a subtree that contains
structures not necessarily present in the parent. In this case, the functional nodes
log10
and (divide) as well as the terminal constant 1.2 have been used. The
definition of terminal and functional sets, and how they are used in tree generation,
will be discussed further in Section 2.6.4.
+
-
a b
+
1 c
Parent
+
-
a b
+
c
Offspring
log10

1.2 c
Randomly generated
subtree
log10

1.2 c
Figure 2.7 Example of subtree mutation in genetic programming.

48
Originally, Koza maintained that crossover and direct reproduction are the only
reproduction operators that are required to complete a successful GP design (Koza,
1992). He argued that a mutation operator is necessary in the case of GAs because
crossover by itself only recombines bit string sections of the original population that
are associated with high performance, hence mutation is needed to restore the loss of
diversity that accompanies this process. Koza goes on to state that, in for GP, this is
not the case, as the crossover operator combines “genes” in a functionally more
flexible way than GAs and so the equivalent loss of diversity should not occur.
While this seems to be a valid theoretical consideration, the informal consensus
among GP practitioners is that the use of mutation (at relatively low rates of about
10% or less) assists the evolution process. This is borne out by a number of
statistical studies on a variety of simple GP applications (e.g. Luke and Spector,
1998, Hiden, 1998). Hence, all the GP runs described in this thesis use the standard
sub-tree mutation operator.
2.6.3 Specifying a genetic program
Koza (1994) describes six steps necessary to define a genetic program. These are:
1) Terminal set selection: choosing the variables that are needed to solve the
problem.
2) Functional set selection: choosing the functions that are needed to operate
on the variables in order to solve the problem.
3) Fitness function specification: choosing appropriate tests of the evolved
programs.
4) Run control parameter settings: choosing parameter values for the GP run,
e.g. rates of mutation, crossover and direct reproduction.
5) Termination criterion specification: setting a condition to terminate the
run.

49
6) Program architecture specification: deciding how the tree (or a number of
trees in some cases) is decoded into an individual test program for evaluation
and what reproduction operators are used to alter the architecture of the
tree(s).
Some of these steps are often trivial. Determining the termination criterion, for
example, is usually a simple process, e.g. stop after a certain number of generations
have elapsed or stop when a solution of a high enough quality is found. Specifying
an appropriate fitness test suite and appropriate program components (i.e. terminal
and functional nodes), however, can be a difficult task if the designer is to ensure
that the evolved programs can solve the set problem in a desirable manner.
The remainder of the chapter highlights, briefly, a number of issues pertaining to the
use of GP in solving engineering and science problems. For a more general, and
thorough, overview of GP, see Koza (1999).
2.6.4 Terminals and functions: specifying the program components
The specification of a “toolbox” of components, the terminals and functions, which
can be subsequently manipulated by simulated genetic processes into a working
program, is the first step that calls upon the human user’s knowledge of the problem.
The user will have a good idea, at this stage, what they want the evolved programs
to accomplish and will have some degree of knowledge as to what information must
be manipulated to solve the problem.
The terminal set and the functional set must exhibit the joint property of sufficiency
(Koza, 1992), i.e. out of all the trees that can be constructed from them; there must
be at least one that is capable of expressing the actual solution to the problem. In
addition to the sufficiency requirement, it is necessary that the terminal and function
sets in GP exhibit the property of closure (Koza, 1992). This simply means that any
tree generated from these sets must be syntactically valid. Closure is attained by
ensuring that all functions and terminals return values of the same data type (e.g.
Boolean). This is usually straightforward to achieve for many engineering

50
applications: the terminals and constants will generally be of the scalar floating point
type, and the standard arithmetical functions will return a value of the same type as
the input arguments. However, there are a few minor exceptions: the floating point
division operator will not return a floating point value if both input arguments are
zero3
, the square root operator will return a complex value if its input argument is
less than zero. Similarly, the natural logarithm operator will return a complex value
if its input argument is less than zero, or will return a value of “infinity” or
“undefined” (depending on the computing language being used) if its input is zero.
These problems can be sidestepped by taking a few liberties with the definition of
some mathematical functions in order to maintain data type consistency. For
instance, the division operator can be redefined so that it returns a zero when both
arguments are zero and the square root operator can be redefined to return the
positive root of the absolute value of its input argument. In the GP literature, this is
commonly referred to as “protecting” functions; the redefined division operator is
referred to as protected division, the redefined natural logarithm as protected natural
log and so forth.
2.6.5 Handling constants in genetic programming
Although it is often possible to specify what inputs will be required for an evolved
program to solve a problem, it is not generally possible to know in advance what
constants are required. Exceptions to this rule are Boolean problems, modular
problems (e.g. clock arithmetic) and the like.
Engineering and scientific problems, in general, require the use of non-integer real
constant terms in their solutions, but how can one incorporate these in a terminal set
without knowing their exact values beforehand? The standard GP solution to this is
known as the ephemeral random constant (ERC) method (Koza, 1992) and its use,
although often augmented by other methods of determining constant values, is
widespread.
3
Many computing languages will simply halt program execution and return a ‘division by zero’ error
at this point. Matlab returns the value ‘NaN’ (not a number), this effectively renders any parse trees
meaningless as all further operations on ‘NaN’ result in ‘NaN’.

51
The ERC method is actually very simple to implement: a special terminal R is added
to the existing terminal set and used in an identical way to the other terminals when
generating the initial population (i.e. generation 0) of random program trees. At this
point, each occurrence of R in the population can be considered a placeholder for an,
as yet, unknown constant. Then, before the tree is inserted into the initial population,
each instance of R is replaced by a randomly generated real constant from a user-
defined range, e.g. [-1, 1]. Once the initial population has been generated, the values
of the constant nodes are fixed throughout the run; it is only before insertion into
generation 0 that the placeholder R is used. However, many researchers who have
investigated the use of GP for data modelling purposes have asserted that this
method of handling constants is inadequate and not numerically efficient. Other
methods of handling constants have been suggested and these are discussed more
fully in the survey of GP for process modelling purposes in Chapter 3.
2.6.6 Multiple populations
Although the basic GP formulation uses a single population of individuals, it is
possible to distribute the population by employing several sub-populations (called
“demes”) that are evolved in isolation except for the periodic exchange (migration)
of individuals from a deme to one or more other demes. This is done to prevent the
premature convergence, due to lack of diversity, which can occur in the single
population algorithm. This scheme also has the advantage that it is suited to GP
performed over parallel processing units because the fitness evaluations and
selection are performed separately on each processing unit and the only
communications between these units are the periodic transfer of migrating
individuals (Koza, 1995).
2.7 Summary
This chapter has introduced the field of evolutionary computation in an engineering
setting and has focussed primarily on the mechanisms involved in selection, and the
reproduction operators used in genetic algorithms and genetic programming. A
number of issues relating to implementation of genetic programming to solve
engineering problems have also been addressed. A discussion of the use of genetic

52
programming as a tool for data based modelling is presented in Chapter 3.

Chapter 3
Genetic programming as a modelling tool
3 Genetic programming as a modelling tool

SearsonGP_PLS_PhD_thesis

Recommended

Recommended

More Related Content

What's hot

What's hot (12)

Similar to SearsonGP_PLS_PhD_thesis

Similar to SearsonGP_PLS_PhD_thesis (20)

SearsonGP_PLS_PhD_thesis