1. Machine Learning
August 17, 2008
In this course, we study how a computer automatically can learn to
perform tasks that it is not explicitly programmed for. For example, given
medical information such as EKG for a few thousand patients, a computer
can automatically learn to identify the ones with various forms of heart
disease. Of course, a highly relevant question is how accurately that a
computer then is able to diagnose new patients, which is said to be the
generalizing ability of the synthesized model.
The course book is Machine Learning by Tom M. Mitchell and selected
papers that supplement it, for example regarding genetic algorithms and
automatic programming.
In order to use machine learning in practice, it is necessary with hands-on
experience, which in this course is provided by three projects that students
should carry out individually or in cooperation with one or two other stu-
dents.
At the start of the course, each student selects a machine learning prob-
lem which is to be processed with decision trees in the first project, neural
nets in the second one and with automatic programming in the last one.
When selecting a problem, consider the following.
1. Do you have any hobby or other area of interest for which machine
learning may be useful? Can you collect data yourself or obtain data
in some other way?
2. There are hundreds of more or less ready data sets on the internet, for
example
http://www.ics.uci.edu/~mlearn/MLSummary.html
3. Are there any commercial, scientific or other applications of your data
set?
Here is a brief description of the projects where one purpose of the first
two is to compare the decision tree learning software C5.0 with so-called
neural nets.
1
2. Project 1
In this project, we use C5.0 which is a commercial tool for synthesis of
decision trees and sets of IF-THEN rules. C5.0 is installed on the Linux
machine frigg.hiof.no but also available for Microsoft Windows.
The data set that you have chosen will need to be converted to the input
format for C5.0. Note that many of the ready made data sets already are
on this format.
You are required to present your work on this project for the rest of the
class with a 10 – 15 minute talk scheduled betwen 10 and 12 on Monday
September 15th, 2008.
Describe the problem to be solved and your data set and previous work
by others on the same problem. What applications does the problem have?
Which attributes are used and how are they converted to suitable input for
C5.0? How do you interpret the output from C5.0 for your data set? Try to
characterize the generalizing ability of the models generated by C5.0. How
sensitive is C5.0 to missing attributes or less training data? Are trees or
rules best as models? Does boosting improve the classification?
Project 2
In this project, we either use the neural network toolbox in MATLAB or
neural net software in C that you write yourself using an automatic differ-
entiation library and possibly also a numerical optimization library.
Split the data set in one for training, one for valiadtion and a third one
for testing.
How do you code the input and output to be suitable for a neural net-
work? What alternative codings are there?
How is the result on the training, validation and test sets influenced by
the number of nodes in the hidden layer and the number of epochs? Try a
few different numerical optimization methods, for example gradient descent
and quasi-Newton methods.
Compare neural nets with C5.0 for your problem.
The date for presentation of this project will be determined later.
Hopefully, the collected work of the class with two different machine
learning methods and a collection of problems will illuminate the pros and
cons of these methods for various applications. It will also contribute to the
practical experience and skill that cannot be obtained by only reading the
textbook.
Project 3
This project is to either automatically generate programs for a number of
small and traditional programming tasks or to use automatic programming
2
3. for the same data set as in projects 1 and 2. Project 3 is described more
fully in its own document to be found at
http://www-ia.hiof.no/~rolando/ML
The grading of the course is based both on the projects (65%) and a
theory exam (35%).
When the projects have started, each group will get its own supervision
time on Thursdays whereas the Monday lectures will continue throughout
the fall semester.
Contribute to an interesting and entertaining course by asking questions
or contributing with your own views during lectures and by being active in
practical problem solving!
3