Machine Learning


                              August 17, 2008


    In this course, we study how a computer automatically can learn to
perform tasks that it is not explicitly programmed for. For example, given
medical information such as EKG for a few thousand patients, a computer
can automatically learn to identify the ones with various forms of heart
disease. Of course, a highly relevant question is how accurately that a
computer then is able to diagnose new patients, which is said to be the
generalizing ability of the synthesized model.
    The course book is Machine Learning by Tom M. Mitchell and selected
papers that supplement it, for example regarding genetic algorithms and
automatic programming.
    In order to use machine learning in practice, it is necessary with hands-on
experience, which in this course is provided by three projects that students
should carry out individually or in cooperation with one or two other stu-
dents.
    At the start of the course, each student selects a machine learning prob-
lem which is to be processed with decision trees in the first project, neural
nets in the second one and with automatic programming in the last one.
When selecting a problem, consider the following.

  1. Do you have any hobby or other area of interest for which machine
     learning may be useful? Can you collect data yourself or obtain data
     in some other way?

  2. There are hundreds of more or less ready data sets on the internet, for
     example

      http://www.ics.uci.edu/~mlearn/MLSummary.html

  3. Are there any commercial, scientific or other applications of your data
     set?

   Here is a brief description of the projects where one purpose of the first
two is to compare the decision tree learning software C5.0 with so-called
neural nets.

                                      1
Project 1
In this project, we use C5.0 which is a commercial tool for synthesis of
decision trees and sets of IF-THEN rules. C5.0 is installed on the Linux
machine frigg.hiof.no but also available for Microsoft Windows.
    The data set that you have chosen will need to be converted to the input
format for C5.0. Note that many of the ready made data sets already are
on this format.
    You are required to present your work on this project for the rest of the
class with a 10 – 15 minute talk scheduled betwen 10 and 12 on Monday
September 15th, 2008.
    Describe the problem to be solved and your data set and previous work
by others on the same problem. What applications does the problem have?
Which attributes are used and how are they converted to suitable input for
C5.0? How do you interpret the output from C5.0 for your data set? Try to
characterize the generalizing ability of the models generated by C5.0. How
sensitive is C5.0 to missing attributes or less training data? Are trees or
rules best as models? Does boosting improve the classification?

Project 2
In this project, we either use the neural network toolbox in MATLAB or
neural net software in C that you write yourself using an automatic differ-
entiation library and possibly also a numerical optimization library.
    Split the data set in one for training, one for valiadtion and a third one
for testing.
    How do you code the input and output to be suitable for a neural net-
work? What alternative codings are there?
    How is the result on the training, validation and test sets influenced by
the number of nodes in the hidden layer and the number of epochs? Try a
few different numerical optimization methods, for example gradient descent
and quasi-Newton methods.
    Compare neural nets with C5.0 for your problem.
    The date for presentation of this project will be determined later.
    Hopefully, the collected work of the class with two different machine
learning methods and a collection of problems will illuminate the pros and
cons of these methods for various applications. It will also contribute to the
practical experience and skill that cannot be obtained by only reading the
textbook.

Project 3
This project is to either automatically generate programs for a number of
small and traditional programming tasks or to use automatic programming



                                      2
for the same data set as in projects 1 and 2. Project 3 is described more
fully in its own document to be found at

http://www-ia.hiof.no/~rolando/ML

    The grading of the course is based both on the projects (65%) and a
theory exam (35%).
    When the projects have started, each group will get its own supervision
time on Thursdays whereas the Monday lectures will continue throughout
the fall semester.
    Contribute to an interesting and entertaining course by asking questions
or contributing with your own views during lectures and by being active in
practical problem solving!




                                     3

Machine Learning

  • 1.
    Machine Learning August 17, 2008 In this course, we study how a computer automatically can learn to perform tasks that it is not explicitly programmed for. For example, given medical information such as EKG for a few thousand patients, a computer can automatically learn to identify the ones with various forms of heart disease. Of course, a highly relevant question is how accurately that a computer then is able to diagnose new patients, which is said to be the generalizing ability of the synthesized model. The course book is Machine Learning by Tom M. Mitchell and selected papers that supplement it, for example regarding genetic algorithms and automatic programming. In order to use machine learning in practice, it is necessary with hands-on experience, which in this course is provided by three projects that students should carry out individually or in cooperation with one or two other stu- dents. At the start of the course, each student selects a machine learning prob- lem which is to be processed with decision trees in the first project, neural nets in the second one and with automatic programming in the last one. When selecting a problem, consider the following. 1. Do you have any hobby or other area of interest for which machine learning may be useful? Can you collect data yourself or obtain data in some other way? 2. There are hundreds of more or less ready data sets on the internet, for example http://www.ics.uci.edu/~mlearn/MLSummary.html 3. Are there any commercial, scientific or other applications of your data set? Here is a brief description of the projects where one purpose of the first two is to compare the decision tree learning software C5.0 with so-called neural nets. 1
  • 2.
    Project 1 In thisproject, we use C5.0 which is a commercial tool for synthesis of decision trees and sets of IF-THEN rules. C5.0 is installed on the Linux machine frigg.hiof.no but also available for Microsoft Windows. The data set that you have chosen will need to be converted to the input format for C5.0. Note that many of the ready made data sets already are on this format. You are required to present your work on this project for the rest of the class with a 10 – 15 minute talk scheduled betwen 10 and 12 on Monday September 15th, 2008. Describe the problem to be solved and your data set and previous work by others on the same problem. What applications does the problem have? Which attributes are used and how are they converted to suitable input for C5.0? How do you interpret the output from C5.0 for your data set? Try to characterize the generalizing ability of the models generated by C5.0. How sensitive is C5.0 to missing attributes or less training data? Are trees or rules best as models? Does boosting improve the classification? Project 2 In this project, we either use the neural network toolbox in MATLAB or neural net software in C that you write yourself using an automatic differ- entiation library and possibly also a numerical optimization library. Split the data set in one for training, one for valiadtion and a third one for testing. How do you code the input and output to be suitable for a neural net- work? What alternative codings are there? How is the result on the training, validation and test sets influenced by the number of nodes in the hidden layer and the number of epochs? Try a few different numerical optimization methods, for example gradient descent and quasi-Newton methods. Compare neural nets with C5.0 for your problem. The date for presentation of this project will be determined later. Hopefully, the collected work of the class with two different machine learning methods and a collection of problems will illuminate the pros and cons of these methods for various applications. It will also contribute to the practical experience and skill that cannot be obtained by only reading the textbook. Project 3 This project is to either automatically generate programs for a number of small and traditional programming tasks or to use automatic programming 2
  • 3.
    for the samedata set as in projects 1 and 2. Project 3 is described more fully in its own document to be found at http://www-ia.hiof.no/~rolando/ML The grading of the course is based both on the projects (65%) and a theory exam (35%). When the projects have started, each group will get its own supervision time on Thursdays whereas the Monday lectures will continue throughout the fall semester. Contribute to an interesting and entertaining course by asking questions or contributing with your own views during lectures and by being active in practical problem solving! 3