SlideShare a Scribd company logo
1 of 7
BCB 444/544
Lab 10 (11/8/07)

Machine Learning

Due Monday 11/12/07 by 5 pm – email to terrible@iastate.edu

Objectives
1. Experiment with applying machine learning algorithms to biological problems.
2. Learn about how to set up a machine learning experiment.

Introduction
Machine learning combines principles from computer science, statistics, psychology, and
other disciplines to develop computer programs for specific tasks. The tasks that
machine learning programs have been developed for vary widely, from diagnosing cancer
to driving a car. In biology, machine learning approaches are very popular for problems
such as protein secondary structure prediction, gene prediction, analyzing microarray
data, and many others. Machine learning is often quite effective, especially on problems
that have a lot of data available. Molecular biology certainly has lots of data.

A note about our training and test set files:

The data set we are using in this lab is a set of RNA-binding proteins. Our input data is
15 amino acids from the protein sequence and a label of 1 or 0 indicating whether the
central amino acid in the list of 15 binds to RNA or not (1 means binding, 0 means not
binding). The training set contains an equal number of RNA-binding and non-binding
residues, which is not the natural distribution. The entire data set contains only about
20% binding residues. We are using a set with equal numbers of binding and non-
binding residues to make things a little easier. The test set is a single protein sequence,
the 50S ribosomal protein L20 from E. coli. We will use this test case to see how well
our classifiers we build perform on a protein sequence not in the training set.

Exercises

Before we get started on the exercises, we need to learn a little about machine learning
experiments. The first concept is training and testing. In order to estimate the
performance of any classifier, we need to train the classifier on some data and then
measure performance on some other data. There are a few ways to do this. In the lab
today, we will use cross validation and a separate test set.

Go to http://en.wikipedia.org/wiki/Cross_validation and read about cross validation.

1. What is K-fold cross validation? What data is used for training? What data is
used for testing?

The most important point in training and testing is that the same data can never be used
in both the training set and the test set. Usually, we have limited data and want to use as
much as possible in training, which is why we do cross validation experiments. In our
lab today, we will give ourselves the luxury of a test case that is not in the training set to
test our classifiers on.

Algorithms

Read the sections on Naïve Bayes, J48 Decision tree, and SVM here:

http://www.d.umn.edu/~padhy005/Chapter5.html

2. What assumption is used in the Naïve Bayes classifier?

3. What criterion does the decision tree classifier use to decide which attribute to
put first in the decision tree?

4. What is the purpose of the kernel function in a SVM classifier?

5. Based on what you read, which method(s) can a human interpret? What
method(s) can a human not interpret, i.e., “black box” method(s)?

6. According to this web page, which algorithm tends to have the highest
classification accuracy?

Experiments
In this lab, we will be using the program Weka. Weka is a program that contains
implementations of many machine learning algorithms in a standard framework that
makes it easy to experiment with many methods. If you are in the computer lab in 1340
MBB, Weka is already installed on the machines. If you are working from home, you
will have to download and install Weka.

Weka is available at:

http://www.cs.waikato.ac.nz/ml/weka/

The instructions should be fairly easy to follow for installing Weka on your computer. If
you have trouble, send me an email and I may be able to help you. Or come into the lab
and use the machines here. The lab in MBB is open most of the time; our class is the
only one that currently uses this room.

Running Weka:

Some final notes on what we will do with Weka before the instructions for how to do it.

First, Weka implements a lot of different algorithms. We will use Naïve Bayes, J48
decision tree, and SVM and then you will get to choose a fourth algorithm. Each
algorithm has quite a few parameters that can be changed, and as we have seen all
semester, changing parameters can drastically change the results. That being said, we
will accept the default parameters for all of the algorithms, with only one small tweak
that will be described later.

Second, Weka allows you to run cross validation experiments or use a supplied test set
(along with some other options). We will do both in this lab.

Finally, a short primer on how to read the results that Weka gives you. A typical results
section looks like this:
The output includes a lot of information, some useful, some not so useful. The top of the
output shows the information about your data set, the algorithm used, information about
the model produced after training (which is really only useful if you are using an
algorithm that is interpretable by humans), and finally the amount of time it took to build
the model. The most useful section for our purposes is at the bottom, which has the
performance statistics. For this lab, all of the results you are required to fill into the
tables can be read directly from the output as long as you know where to find them. The
table asks for accuracy, which in the Weka output is listed as “Correctly Classified
Instances.” The next entries in your results table are TPRate, FPRate, Precision, and
Recall, which can be read in the section called “Detailed Accuracy By Class.” These
values are listed for both classes (1 and 0 for RNA-binding and non-RNA-binding
respectively). I only want to see the values for class 1.

The final numbers for your results table are TP (true positive), FP (false positive), FN
(false negative), and TN (true negative). TP means we predicted RNA-binding and it
actually is RNA-binding. FP means we predicted RNA-binding and it is not actually
RNA-binding. FN means we predicted non-RNA-binding and it actually is RNA-
binding. TN means we predicted non-RNA-binding and it actually is non-RNA-binding.
Our correct predictions are TP and TN, our incorrect predictions are FP and FN. The
counts of TP, FP, FN, and TN can be found in the section called “Confusion Matrix.”
Our confusion matrix shows four numbers, the top left corner shows the number of TP
predictions, the top right number is FP, bottom left is FN, and bottom right is TN.

Finally, on to running some programs. For the lab machines, you can simply double-
click on the weka.jar file on the desktop.

Click on the button that says “Explorer” to get started.

We will use the following files:

Training set

Test set

Click on the Open file button and choose the training set file.

Click on the Classify tab to get to the classification algorithms.

To choose the algorithm, click on the Choose button near the top in the classifier section.
Click on bayes, then NaiveBayes. Be sure that Cross validation is selected, and change
the number in the box from 10 to 5. Then click on the Start button to run the classifier.
Record the performance in the table below.

To run the predictions on the test case, click the circle next to Supplied test set, then click
the Set… button and choose the test file. Then click Start to build the classifier and make
predictions on our test case. Record the performance in the table below.

For our next algorithm, we will use the J48 decision tree. To choose the algorithm, click
on the Choose button near the top in the classifier section. Click on trees, then J48. Be
sure that Cross validation is selected, and make sure the number in the box is 5. Then
click on the Start button to run the classifier. Record the performance in the table below.
To run the predictions on the test case, click the circle next to Supplied test set, then click
the Set… button and choose the test file. Then click Start to build the classifier and make
predictions on our test case. Record the performance in the table below.


For our next algorithm, we will use a SVM. To choose the algorithm, click on the
Choose button near the top in the classifier section. Click on functions, then SMO. Be
sure that Cross validation is selected, and make sure the number in the box is 5. Then
click on the Start button to run the classifier. Record the performance in the table below.

To run the predictions on the test case, click the circle next to Supplied test set, then click
the Set… button and choose the test file. Then click Start to build the classifier and make
predictions on our test case. Record the performance in the table below.

Next, we will run the SVM algorithm again using a different kernel function. To change
the kernel function, click on the text box next to the Choose button at the top. This will
bring up a window showing the algorithm parameters. At the bottom of this window,
there is a line that says “useRBF.” Change the value in this box to true to use the RBF
kernel function. Click OK. Be sure that Cross validation is selected, and make sure the
number in the box is 5. Then click on the Start button to run the classifier. Record the
performance in the table below.

To run the predictions on the test case, click the circle next to Supplied test set, then click
the Set… button and choose the test file. Then click Start to build the classifier and make
predictions on our test case. Record the performance in the table below.

Finally, choose a different algorithm and run both 5 fold cross validation and predictions
on the test case, and add the performance values to the tables in the blank line at the
bottom. Be sure to include the name of the algorithm you chose. You can choose any of
the algorithms available in Weka.

(Side note – some of the algorithms will not work on this data set. They will produce an
error message saying something about incompatibility. If this happens to you, simply
choose a different algorithm. Also, some algorithms run much faster than others. If the
algorithm you chose is taking much longer than our SVM runs, you may want to choose a
different algorithm.)

To find out more about the algorithms, you can select the algorithm with the Choose
button at the top, then click on the text box near the Choose button (just like we did for
the SVM when we changed to the RBF kernel). In the window that opens up, there is a
section called “About” that gives a one line description of the algorithm. Click on the
More button in this section to get more information, including a reference to a paper
describing the algorithm. Another option for finding out more about the algorithm is to
do an internet search with the name of the algorithm.
5 Fold Cross Validation Results
Algorithm Accuracy TPRate FPRate Precision Recall             TP    FP     FN    TN
   NB
   J48
  SVM
  SVM
  RBF


                           Test Case Results
Algorithm Accuracy TPRate FPRate Precision Recall             TP    FP     FN    TN
   NB
   J48
  SVM
  SVM
  RBF




7. What algorithm did the best and under what conditions?



8. Did the cross validation results indicate accurately what performance on the test
case would be?


9. Briefly describe what the algorithm you chose does.

More Related Content

Viewers also liked

Fryderyk Zosi
Fryderyk ZosiFryderyk Zosi
Fryderyk ZosiSP ...
 
Recognizing and Organizing Opinions Expressed in the World ...
Recognizing and Organizing Opinions Expressed in the World ...Recognizing and Organizing Opinions Expressed in the World ...
Recognizing and Organizing Opinions Expressed in the World ...butest
 
lecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationbutest
 
MACHINE LEARNING METHODS FOR THE
MACHINE LEARNING METHODS FOR THEMACHINE LEARNING METHODS FOR THE
MACHINE LEARNING METHODS FOR THEbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 

Viewers also liked (6)

Fryderyk Zosi
Fryderyk ZosiFryderyk Zosi
Fryderyk Zosi
 
Recognizing and Organizing Opinions Expressed in the World ...
Recognizing and Organizing Opinions Expressed in the World ...Recognizing and Organizing Opinions Expressed in the World ...
Recognizing and Organizing Opinions Expressed in the World ...
 
lecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentation
 
MACHINE LEARNING METHODS FOR THE
MACHINE LEARNING METHODS FOR THEMACHINE LEARNING METHODS FOR THE
MACHINE LEARNING METHODS FOR THE
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
5k ser joven es...
5k ser joven es...5k ser joven es...
5k ser joven es...
 

Similar to Lab 10.doc

Pairwise testing
Pairwise testingPairwise testing
Pairwise testingKanoah
 
BMI214_Assignment2_S..
BMI214_Assignment2_S..BMI214_Assignment2_S..
BMI214_Assignment2_S..butest
 
Blackboxtesting 02 An Example Test Series
Blackboxtesting 02 An Example Test SeriesBlackboxtesting 02 An Example Test Series
Blackboxtesting 02 An Example Test Seriesnazeer pasha
 
Testers Desk Presentation
Testers Desk PresentationTesters Desk Presentation
Testers Desk PresentationQuality Testing
 
one complete report from all the 4 labs.pdf
one complete report from all the 4 labs.pdfone complete report from all the 4 labs.pdf
one complete report from all the 4 labs.pdfstudy help
 
one complete report from all the 4 labs.pdf
one complete report from all the 4 labs.pdfone complete report from all the 4 labs.pdf
one complete report from all the 4 labs.pdfstudy help
 
Quality Assurance 2: Searching for Bugs
Quality Assurance 2: Searching for BugsQuality Assurance 2: Searching for Bugs
Quality Assurance 2: Searching for BugsMarc Miquel
 
5 Essential Tips for Load Testing Beginners
5 Essential Tips for Load Testing Beginners5 Essential Tips for Load Testing Beginners
5 Essential Tips for Load Testing BeginnersNeotys
 
Simulating data to gain insights into power and p-hacking
Simulating data to gain insights intopower and p-hackingSimulating data to gain insights intopower and p-hacking
Simulating data to gain insights into power and p-hackingDorothy Bishop
 
Approaches to unraveling a complex test problem
Approaches to unraveling a complex test problemApproaches to unraveling a complex test problem
Approaches to unraveling a complex test problemJohan Hoberg
 
Using the Machine to predict Testability
Using the Machine to predict TestabilityUsing the Machine to predict Testability
Using the Machine to predict TestabilityMiguel Lopez
 
Muwanika rogers (software testing) muni university
Muwanika rogers (software testing) muni universityMuwanika rogers (software testing) muni university
Muwanika rogers (software testing) muni universityrogers muwanika
 
assignment2.doc
assignment2.docassignment2.doc
assignment2.docbutest
 
House Price Estimation as a Function Fitting Problem with using ANN Approach
House Price Estimation as a Function Fitting Problem with using ANN ApproachHouse Price Estimation as a Function Fitting Problem with using ANN Approach
House Price Estimation as a Function Fitting Problem with using ANN ApproachYusuf Uzun
 

Similar to Lab 10.doc (20)

Pairwise testing
Pairwise testingPairwise testing
Pairwise testing
 
BMI214_Assignment2_S..
BMI214_Assignment2_S..BMI214_Assignment2_S..
BMI214_Assignment2_S..
 
Blackboxtesting 02 An Example Test Series
Blackboxtesting 02 An Example Test SeriesBlackboxtesting 02 An Example Test Series
Blackboxtesting 02 An Example Test Series
 
Testers Desk Presentation
Testers Desk PresentationTesters Desk Presentation
Testers Desk Presentation
 
one complete report from all the 4 labs.pdf
one complete report from all the 4 labs.pdfone complete report from all the 4 labs.pdf
one complete report from all the 4 labs.pdf
 
one complete report from all the 4 labs.pdf
one complete report from all the 4 labs.pdfone complete report from all the 4 labs.pdf
one complete report from all the 4 labs.pdf
 
Quality Assurance 2: Searching for Bugs
Quality Assurance 2: Searching for BugsQuality Assurance 2: Searching for Bugs
Quality Assurance 2: Searching for Bugs
 
SD & D Testing
SD & D TestingSD & D Testing
SD & D Testing
 
5 Essential Tips for Load Testing Beginners
5 Essential Tips for Load Testing Beginners5 Essential Tips for Load Testing Beginners
5 Essential Tips for Load Testing Beginners
 
Why Unit Testingl
Why Unit TestinglWhy Unit Testingl
Why Unit Testingl
 
Why unit testingl
Why unit testinglWhy unit testingl
Why unit testingl
 
Why Unit Testingl
Why Unit TestinglWhy Unit Testingl
Why Unit Testingl
 
TDD Best Practices
TDD Best PracticesTDD Best Practices
TDD Best Practices
 
Simulating data to gain insights into power and p-hacking
Simulating data to gain insights intopower and p-hackingSimulating data to gain insights intopower and p-hacking
Simulating data to gain insights into power and p-hacking
 
Testing
TestingTesting
Testing
 
Approaches to unraveling a complex test problem
Approaches to unraveling a complex test problemApproaches to unraveling a complex test problem
Approaches to unraveling a complex test problem
 
Using the Machine to predict Testability
Using the Machine to predict TestabilityUsing the Machine to predict Testability
Using the Machine to predict Testability
 
Muwanika rogers (software testing) muni university
Muwanika rogers (software testing) muni universityMuwanika rogers (software testing) muni university
Muwanika rogers (software testing) muni university
 
assignment2.doc
assignment2.docassignment2.doc
assignment2.doc
 
House Price Estimation as a Function Fitting Problem with using ANN Approach
House Price Estimation as a Function Fitting Problem with using ANN ApproachHouse Price Estimation as a Function Fitting Problem with using ANN Approach
House Price Estimation as a Function Fitting Problem with using ANN Approach
 

More from butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 
Download
DownloadDownload
Downloadbutest
 

More from butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 
Download
DownloadDownload
Download
 

Lab 10.doc

  • 1. BCB 444/544 Lab 10 (11/8/07) Machine Learning Due Monday 11/12/07 by 5 pm – email to terrible@iastate.edu Objectives 1. Experiment with applying machine learning algorithms to biological problems. 2. Learn about how to set up a machine learning experiment. Introduction Machine learning combines principles from computer science, statistics, psychology, and other disciplines to develop computer programs for specific tasks. The tasks that machine learning programs have been developed for vary widely, from diagnosing cancer to driving a car. In biology, machine learning approaches are very popular for problems such as protein secondary structure prediction, gene prediction, analyzing microarray data, and many others. Machine learning is often quite effective, especially on problems that have a lot of data available. Molecular biology certainly has lots of data. A note about our training and test set files: The data set we are using in this lab is a set of RNA-binding proteins. Our input data is 15 amino acids from the protein sequence and a label of 1 or 0 indicating whether the central amino acid in the list of 15 binds to RNA or not (1 means binding, 0 means not binding). The training set contains an equal number of RNA-binding and non-binding residues, which is not the natural distribution. The entire data set contains only about 20% binding residues. We are using a set with equal numbers of binding and non- binding residues to make things a little easier. The test set is a single protein sequence, the 50S ribosomal protein L20 from E. coli. We will use this test case to see how well our classifiers we build perform on a protein sequence not in the training set. Exercises Before we get started on the exercises, we need to learn a little about machine learning experiments. The first concept is training and testing. In order to estimate the performance of any classifier, we need to train the classifier on some data and then measure performance on some other data. There are a few ways to do this. In the lab today, we will use cross validation and a separate test set. Go to http://en.wikipedia.org/wiki/Cross_validation and read about cross validation. 1. What is K-fold cross validation? What data is used for training? What data is used for testing? The most important point in training and testing is that the same data can never be used
  • 2. in both the training set and the test set. Usually, we have limited data and want to use as much as possible in training, which is why we do cross validation experiments. In our lab today, we will give ourselves the luxury of a test case that is not in the training set to test our classifiers on. Algorithms Read the sections on Naïve Bayes, J48 Decision tree, and SVM here: http://www.d.umn.edu/~padhy005/Chapter5.html 2. What assumption is used in the Naïve Bayes classifier? 3. What criterion does the decision tree classifier use to decide which attribute to put first in the decision tree? 4. What is the purpose of the kernel function in a SVM classifier? 5. Based on what you read, which method(s) can a human interpret? What method(s) can a human not interpret, i.e., “black box” method(s)? 6. According to this web page, which algorithm tends to have the highest classification accuracy? Experiments In this lab, we will be using the program Weka. Weka is a program that contains implementations of many machine learning algorithms in a standard framework that makes it easy to experiment with many methods. If you are in the computer lab in 1340 MBB, Weka is already installed on the machines. If you are working from home, you will have to download and install Weka. Weka is available at: http://www.cs.waikato.ac.nz/ml/weka/ The instructions should be fairly easy to follow for installing Weka on your computer. If you have trouble, send me an email and I may be able to help you. Or come into the lab and use the machines here. The lab in MBB is open most of the time; our class is the only one that currently uses this room. Running Weka: Some final notes on what we will do with Weka before the instructions for how to do it. First, Weka implements a lot of different algorithms. We will use Naïve Bayes, J48 decision tree, and SVM and then you will get to choose a fourth algorithm. Each algorithm has quite a few parameters that can be changed, and as we have seen all
  • 3. semester, changing parameters can drastically change the results. That being said, we will accept the default parameters for all of the algorithms, with only one small tweak that will be described later. Second, Weka allows you to run cross validation experiments or use a supplied test set (along with some other options). We will do both in this lab. Finally, a short primer on how to read the results that Weka gives you. A typical results section looks like this:
  • 4. The output includes a lot of information, some useful, some not so useful. The top of the output shows the information about your data set, the algorithm used, information about the model produced after training (which is really only useful if you are using an algorithm that is interpretable by humans), and finally the amount of time it took to build the model. The most useful section for our purposes is at the bottom, which has the
  • 5. performance statistics. For this lab, all of the results you are required to fill into the tables can be read directly from the output as long as you know where to find them. The table asks for accuracy, which in the Weka output is listed as “Correctly Classified Instances.” The next entries in your results table are TPRate, FPRate, Precision, and Recall, which can be read in the section called “Detailed Accuracy By Class.” These values are listed for both classes (1 and 0 for RNA-binding and non-RNA-binding respectively). I only want to see the values for class 1. The final numbers for your results table are TP (true positive), FP (false positive), FN (false negative), and TN (true negative). TP means we predicted RNA-binding and it actually is RNA-binding. FP means we predicted RNA-binding and it is not actually RNA-binding. FN means we predicted non-RNA-binding and it actually is RNA- binding. TN means we predicted non-RNA-binding and it actually is non-RNA-binding. Our correct predictions are TP and TN, our incorrect predictions are FP and FN. The counts of TP, FP, FN, and TN can be found in the section called “Confusion Matrix.” Our confusion matrix shows four numbers, the top left corner shows the number of TP predictions, the top right number is FP, bottom left is FN, and bottom right is TN. Finally, on to running some programs. For the lab machines, you can simply double- click on the weka.jar file on the desktop. Click on the button that says “Explorer” to get started. We will use the following files: Training set Test set Click on the Open file button and choose the training set file. Click on the Classify tab to get to the classification algorithms. To choose the algorithm, click on the Choose button near the top in the classifier section. Click on bayes, then NaiveBayes. Be sure that Cross validation is selected, and change the number in the box from 10 to 5. Then click on the Start button to run the classifier. Record the performance in the table below. To run the predictions on the test case, click the circle next to Supplied test set, then click the Set… button and choose the test file. Then click Start to build the classifier and make predictions on our test case. Record the performance in the table below. For our next algorithm, we will use the J48 decision tree. To choose the algorithm, click on the Choose button near the top in the classifier section. Click on trees, then J48. Be sure that Cross validation is selected, and make sure the number in the box is 5. Then click on the Start button to run the classifier. Record the performance in the table below.
  • 6. To run the predictions on the test case, click the circle next to Supplied test set, then click the Set… button and choose the test file. Then click Start to build the classifier and make predictions on our test case. Record the performance in the table below. For our next algorithm, we will use a SVM. To choose the algorithm, click on the Choose button near the top in the classifier section. Click on functions, then SMO. Be sure that Cross validation is selected, and make sure the number in the box is 5. Then click on the Start button to run the classifier. Record the performance in the table below. To run the predictions on the test case, click the circle next to Supplied test set, then click the Set… button and choose the test file. Then click Start to build the classifier and make predictions on our test case. Record the performance in the table below. Next, we will run the SVM algorithm again using a different kernel function. To change the kernel function, click on the text box next to the Choose button at the top. This will bring up a window showing the algorithm parameters. At the bottom of this window, there is a line that says “useRBF.” Change the value in this box to true to use the RBF kernel function. Click OK. Be sure that Cross validation is selected, and make sure the number in the box is 5. Then click on the Start button to run the classifier. Record the performance in the table below. To run the predictions on the test case, click the circle next to Supplied test set, then click the Set… button and choose the test file. Then click Start to build the classifier and make predictions on our test case. Record the performance in the table below. Finally, choose a different algorithm and run both 5 fold cross validation and predictions on the test case, and add the performance values to the tables in the blank line at the bottom. Be sure to include the name of the algorithm you chose. You can choose any of the algorithms available in Weka. (Side note – some of the algorithms will not work on this data set. They will produce an error message saying something about incompatibility. If this happens to you, simply choose a different algorithm. Also, some algorithms run much faster than others. If the algorithm you chose is taking much longer than our SVM runs, you may want to choose a different algorithm.) To find out more about the algorithms, you can select the algorithm with the Choose button at the top, then click on the text box near the Choose button (just like we did for the SVM when we changed to the RBF kernel). In the window that opens up, there is a section called “About” that gives a one line description of the algorithm. Click on the More button in this section to get more information, including a reference to a paper describing the algorithm. Another option for finding out more about the algorithm is to do an internet search with the name of the algorithm.
  • 7. 5 Fold Cross Validation Results Algorithm Accuracy TPRate FPRate Precision Recall TP FP FN TN NB J48 SVM SVM RBF Test Case Results Algorithm Accuracy TPRate FPRate Precision Recall TP FP FN TN NB J48 SVM SVM RBF 7. What algorithm did the best and under what conditions? 8. Did the cross validation results indicate accurately what performance on the test case would be? 9. Briefly describe what the algorithm you chose does.