This essay deals with the studies of machine learning, an important part of computer
science. The emphasis is put on three major sub areas, Decision trees, Artificial Neural
Networks and Evolutionary Computation. For each approach the theory behind the algorithm
is explained as well as the experience that I have received when examining different
implementations of the algorithms.
As part of the course TDDB55 - "Medieinformatik, projekt 1" at the University of Linköping I
choose to look into the field of machine learning. To be more precise, I choose the
assignment “Evaluate machine learning algorithms for user modeling”. The different
algorithms I have evaluated are decision tree learning, artificial neural networks and
evolutionary computation. I am also mentioning other approaches such as Bayesian
networks and PAC learning. As the main source of information I have used the books
Artificial intelligence – a modern approach by Peter Norvig and Stuart Russell and Machine
learning by Tom M. Mitchell as well as the various enlightening sites on the Internet.
What is machine learning? That was the first question that I faced when I started looking in to
the subject. It is a fairly new science, approximately as old as computer science itself. Ever
since the realization of the very first computers people have dreamed of teaching their
machines into reasoning and acting like humans and other forms of intelligent life. This is
where machine learning and other closely related fields such as Artificial Intelligence, AI,
Machine learning is the technique of implementing algorithms that learn on computers. What
then is learning? Well Tom M. Mitchell gives this definition:
“A computer program is said to learn from experience E with resect
to some class of tasks T and performance measure P, if its
performance at tasks in T, as measured by P, improves with
The task may for instance be to recognize different faces of people (see Artificial Neural
Network learning) and the experience would then be the set of pictures of people used to
train the system. Measurement of the performance is here checking if the computer can
determine who is on the picture correctly. In this case the performance would partly be
evaluated by humans and fed back to the computer, but there are many examples of tasks
where the performance can be measured by the computer itself e.g. learning to play chess,
where it is easier to define rules for the measurement of the success of the algorithm.
A fundamental part of learning is searching. By searching through a hypothesis space for the
hypothesis that fits the training examples best, the algorithms can simulate a modest form of
Decision tree learning
Decision tree learning is one of the most popular learning methods and has been
successfully implemented in such different tasks as learning to diagnose medical cases and
assessment of credit risk of loan applications. The method is best suited for problems that
have discrete output values and instances with a fixed set of attributes with values that are
preferably discrete and disjoint. At least in the basic form of decision tree algorithm. With
modifications to the basic algorithm, decision trees can be made to handle continuous, real-
valued attributes as well as outputting real-valued results. Applications with these features
are less common though. Other advantages with decision trees are that they are robust to
errors and that they may have missing attribute values and still function. If the set of training
examples is infected by incorrect instances the algorithm is still able to make correct
assumptions on the set and if all attributes are not represented, it will still work as long as the
missing value is not required in the tree search. A major drawback to inductive learning
through decision trees is that the algorithms are not capable of recognizing patterns in
learning examples and therefor do not know anything about cases that have not yet been
examined through examples.
Figure 1 Occupation decision tree
One possible implementation of a decision tree for trying to establish a person’s occupation.
Basically the algorithm is the assembly of a tree graph. The tree will then be used to make
decisions, hence decision tree. The leaf-nodes of the tree represent the classifications. The
rest of the nodes are test on the attribute values of the tree. An important aspect of the
construction of the tree is deciding the order of the tests (the internal structure of the nodes in
the tree). The most popular philosophy on this subject dates back to the 14thcentury and
William of Ockham who preferred the simplest of hypothesis consequent with the examples,
also known as Ockham’s razor.
Ockham’s razor: Prefer the simplest hypothesis that fits all the data.
In other words this means that we are interested in asking the “most interesting” questions
first, or to be more scientific, to ask the questions that give us the greatest information gain.
Claude Shannon, the father of information theory defined a metric for the information gain, in
the 1940’s, called entropy. The entropy for each question is a value between 0 and 1, where
1 rates as the most information possible gained.
Entropy (S) of a question on an attribute = Σ -p log
i 2 pi, over all
disjoint values of the attribute where pi is a classification (such as e.g.
positive or negative) of the example.
This way the question with the highest entropy, out of the remaining questions, is chosen
until all relevant questions are asked (questions with entropy 0 are pointless to ask since
they do not add any information to the tree). Since the entropy of each question changes with
new examples to the training domain, the tree should be reconstructed for every new
example. This is obviously not very practical. A better approach is to wait until a notable
amount of new examples (e.g. 10%) have been added to the original set and then
When looking for implementations of algorithms using decision trees, I soon discovered that
most available systems were more or less associated to data mining and expert systems.
Some examples that I found were Alice from Isoft (France) and CART from Salford Systems
(USA). I examined the CART system on a Windows 98 platform. I found that CART had a
very nice graphic-interface that showed the decision tree. The major limitation to CART was
that it only produced binary trees, but there were also many interesting parameters that could
be tuned in the tree modeling process.
A free demo of the CART system can be downloaded from Salford Systems’ website.
Artificial Neural Networks
Artificial neural networks have proved to be an efficient approach to learning real-valued
functions over both continuous and discrete-valued attributes. One of the biggest advantages
with artificial neural networks is that they are robust to noise in the training data. This ability
has contributed to successful implementation tasks, such as face and handwriting
recognition, robot control, language translation, pronunciation software, vehicle control, etc.
In a “pure” artificial neural network, all the nodes work in parallel. This requires special and
expensive hardware. Most of today’s implementations of artificial neural networks are done
on single-processor computers, by simulation of parallelism. This enables only a fraction of
the capacity in terms of speed of a “ pure” neural network.
The idea for artificial neural networks came from the science of biology.
Figure 2 Artificial Neural Network
Example of neural network for establishing identity of a human face on a picture.
The idea for Artificial Neural Networks originated from the studies of the brain. Since the
brain seem to have an unprecedented ability to learn a wide range of things, it has been an
inspiring challenge to copy its characteristics. The thinking part of the brain is a vast network
made up of approximately 1011 nerve cells, called neurons. Each neuron is connected to, on
average, ten thousand other cells. The connections are organized in layers.
Each neuron consists of a cell body, called the soma and several shorter fibers, dendrites
and a long output fiber called the axon. A junction called the synapse, serves as a
connection between each cell.
When a signal propagates from neuron to neuron, it is first handled by the synapse that can
either increase or decrease its electrochemical potential. Synapses have the ability to
change the characteristics over time. This ability, researchers believe, is what we refer to as
learning. The synapses then lead the signal into the cell via the dendrite. If the total potential
of the cell reaches a certain threshold, an electronic pulse, also called action potential, is
sent down the axon and finally on to the synapses. This is then repeated for each neuron
layer in the network. The last layer is the output layer.
Figure 3 Neuron – the Brain
The computer implementation of the neuron is the unit and the connections it uses, called
links. The links each have a numeric weight. Through updating the weights the links come to
have the same function as the synapses in the brain. The input function sums all the
incoming signals and their associated weights. The activation function (f in figure 4) then
determines whether to send an activation signal (a in figure 4) onto the output links. There
are several ideas for different activation functions, but they all have in common that they
depend on whether the sum of the input function reaches the threshold or not.
Figure 4 Unit – the Computer
Some units are connected to the outside environment and assigned as input or output units.
The rest of the units are called hidden units and serve as network layers between the input
and output layer. There are two major varieties of network structures, the feed-forward and
the recurrent networks. In feed-forward networks the signal only travels in one direction and
there are no loops. In a recurrent network there are no such restrictions. The recurrent
network is much more advanced and can hold memory, but it is also more vulnerable to
chaotic behavior and instability. The brain is a recurrent network. Some other examples of
successful recurrent networks are the Hopfield and the Boltzmann networks.
The simplest form of feed-forward networks are the perceptrons. They do not have any
hidden units. Still they are able to represent advanced functionality as AND, OR and NOT in
boolean algebra. Networks with one or more hidden layers are called multilayer networks.
The most popular method for learning in multilayer networks is called backpropagation. The
basic idea in backpropagation is to minimize the squared error between the network output
and the target values of the training example through dividing the “blame” among the
Overfitting is an important issue in machine learning, especially so in neural network
learning. Overfitting is when the networks are trained too much on a small domain of training
data, which it then performs very well on, but when new data is added it can not generalize
Figure 5 Perceptron
Perceptrons are single layered feed-forward networks. They were the first approach to artificial neural networks that computer
scientist began to study in the late 1950’s.
I first looked at Tom M. Mitchell’s implementation of face recognition using an artificial neural
network. It is made for the Unix platform and can be downloaded over the Internet (see URL
below). It requires some graphic-display program in order to view the images processed by
the system. The images used in the system were in the pgm format and I used XV by John
Bradley to view them. The system gave an interesting input on how artificial neural networks
can discover patterns in pictures for example. The main drawback with this system was that
it used tiny images, only 32x30 pixels in size so it took some time to get used to.
I also looked at a similar commercial system ImageFinder 3.4 from Attrasoft Inc. It had a very
user-friendly interface and was easy to get started with. ImageFinder is a Java application
that can take gif or jpeg images as input and learn their characteristics in an artificial neural
network. The number of images to learn and the number of times to “practice” on them can
be decided by the user. Then when the network is done, the user can specify a directory in
which to search for similar images. The output is the names of the closest resembling images
and their score based on how close they reassemble the training examples. Unfortunately
the free demo that I could download did not allow for any adjustment of parameters which
would have made it even more interesting to evaluate.
Figure 6 ImageFinder 3.4
Genetic algorithms and genetic programming are two closely related forms of evolutionary
programming. Some authors consider these terms to be synonyms, while others chose to
refer to genetic algorithms when the hypothesis or “gene” is a simple bit string and to
genetic programming when the hypothesis is more advanced, usually symbolic expressions
or programming code. Genetic algorithms have been successfully utilized, especially on
optimization problems. Since many problems can be thought of as optimization problems,
this is no limitation to its usefulness.
Nature is the best known producer of robust and successful organisms. Over time,
organisms that are not well suited for their environment die of, while others that are better
suited live to reproduce. Parents form offspring, so that each new generation carry on earlier
generation’s experience. If the environment changes slowly, species can adapt to the
changes. Occasionally, random mutations take place. Most of these results in death for the
mutated individual, but a few mutations result in new successful species.
These facts were first revealed by Charles Darwin in his publication The Origin of Species on
the Basis of Natural Selection.
Figure 7 Reproduction in genetic algorithms
In order to simulate evolution, the algorithm needs a metric to establish which selections are
better than others in respect of solving the problem at hand. This metric is called the fitness
function. The most promising individuals (the highest scores on the fitness function) then
receive a higher reproduction likelihood. The next step is to decide where in the bit string to
make the crossover. This is usually done randomly somewhere along the string. Then the
parts from the original bit strings are swapped to form two new strings. This is the natural
selection part of the algorithm. But if this had been the only step in the reproduction, most
algorithms would have been able to find only local optimums. To solve this problem the
algorithm incorporates another basic component of regeneration in nature, mutations. This
way an individual bit string can leave a population that is “stuck”. The chance of a mutation is
usually very low.
1. Chose individuals for reproduction based on fitness function.
2. Chose where to make crossover.
3. Reproduce using crossover.
4. Make mutations to single bits with small random chance.
5. Repeat from step 1.
Genetic programming differs from genetic algorithms in that it strives to optimize code and
not bit strings. Programs manipulated by a genetic programming are usually represented by
trees corresponding to the parse tree of the program. Just as in genetic algorithms, the
individuals produce new generations through selection, crossover and mutation. The fitness
of an individual is usually determined by executing the program on training data. Crossover is
executed through swapping randomly selected subtrees.
Figure 8 Crossover operation in genetic programming
I found several very interesting sites on evolutionary programming on the Web. Two of the
best ones were Java applets on the web. One of the game Tron and the other a site called
The GA playground: http://www.aridolan.com/ga/gaa/gaa.html
Tron is a computer game based on the 1982 Walt Disney movie with the same name. It uses
a genetic algorithm in order to learn from previous games.
According to the people behind the program they
“... have put a genetic learning algorithm online. A “background" GA generates players by
having the computer play itself. A "foreground" GA leads the evolutionary process,
evaluating players by their performance against real people.”
It is very hard to beat the computer at Tron at present. I only succeeded two out of
approximately 50 times. The on-line game Tron is a good example of successful utilization of
Figure 9 Tron - Computer winning rate
The evolution of the Tron program over time according to the authors.
The other explored site, the GA Playground was similar in that it provided on-line Java
applets for evaluation of algorithms. This site provided more freedom to choose different
algorithms and parameters on different, user-selected, problems.
One example of an interesting problem was the Travelling Salesman Problem, which was
implemented in three different cases (All cities on a circle, Cities in Bavaria and Capitals of
the US). These examples gave good insight on how the applet worked.
Figure 10 GA Playground’s TSP solving algorithm
GA Playground has a very nice and adjustable user interface that allows for different setups. Some features require that the
program is downloaded and run as an application.
There are several other interesting approaches to machine learning then the ones mentioned
above. The Probably Approximately Correct (PAC) learning, is one good model for learning.
The Bayesian learning model is another. It is based around Bayes theorem for calculating
posterior probability P(h|D), from prior probability P(h), together with P(D) and P(D|h).
A third promising model is reinforcement learning. It is closely related to dynamic
programming and frequently used to solve optimization problems. The Q algorithm is an
interesting example from this category.
There are many more and since this is a fairly new science, even more are sure to come.
I have throughout my personal project studied a variety of different algorithms that have
shown to been more or less useful for their different objectives. Some systems have been
pure genetic algorithms or pure artificial neural networks, while others have integrated
different approaches in an attempt to get the best of each algorithm. Different algorithms
have had different advantages and disadvantages, i.e. decision trees are better suited for
discrete valued environments. I have found that accurate knowledge about the
characteristics of the problem and basic knowledge about the algorithms is essential to find a
good algorithm for the task. Some problems are better suited for machine learning algorithms
than others. This may be because there is still a long way to go in the science of machine
learning, or because some of the expectations on machine learning are too high. For
instance Stuart/Norvig suggests in Artificial Intelligence – a modern approach that it might
always be worth trying a simple implementation of an artificial neural network or even a
genetic algorithm on a problem just to see if it will work. Our knowledge of how, for example,
neural networks work is very limited, especially with recurrent networks.
There are other important aspects where knowledge is important. For example the
occurrence of overfitting is a trap that anyone dealing with machine learning should be aware
of. Even if the algorithm is perfect, the handling of the set of training data is still a dubious
Norvig, Peter & Russell, Stuart (1995). Artificial intelligence – a modern approach, USA
Mitchell, M., Tom (1997). Machine learning. McGraw-Hill, USA.