The document provides instructions for a machine learning lab experiment using the Weka machine learning software. Students are asked to run several classifiers on a dataset containing RNA-binding protein sequences to predict whether amino acids bind to RNA or not. Classifiers include Naive Bayes, J48 decision tree, support vector machine (SVM) with linear and RBF kernels. Students record performance metrics from 5-fold cross validation and testing on a separate protein sequence, and analyze which classifier worked best.
The presentation covers the use of Scalable Predictive Analysis in Critically Ill Patients using a Visual Open Data Analysis Platform (RapidMiner).
With the accumulation of large amounts of health related data, predictive analytics could stimulate the transformation of reactive medicine towards Predictive, Preventive and Personalized (PPPM) Medicine, ultimately affecting both cost and quality of care. However, high-dimensionality and high-
complexity of the data involved, prevents data-driven methods from easy translation into clinically relevant models. Additionally, the application of cutting edge predictive methods and data manipulation require substantial programming skills, limiting its direct exploitation by medical domain experts. This leaves a gap between potential and actual data usage. The presentation addresses the problem by focusing on an open, visual environment, suited to be applied by the medical community (RapidMiner). As a showcase, a framework was developed for the meaningful use of data from critical care patients by integrating the MIMIC-II / III database in a data mining environment (RapidMiner) supporting scalable predictive analytics using visual tools (RapidMiner’s Radoop extension). Guided by the CRoss-Industry Standard Process for Data Mining (CRISP-DM), the ETL process (Extract, Transform, Load) was initiated by retrieving data from the MIMIC-II tables of interest. Using visual tools for ETL on Hadoop and predictive modeling in RapidMiner, robust processes for automatic building, parameter optimization and evaluation of various
predictive models, under different feature selection schemes can be developed. Because these processes can be easily adopted in other projects, this environment is attractive for scalable predictive analytics in health research.
Presentation at Laboratory for Computational Physiology (LCP)
Massachusetts Institute of Technology (MIT),
Building E25 room 101; December 8th 12-noon
Sven Van Poucke, MD, Anesthesiologist, Emergency Physician
Department of Anesthesiology, Intensive Care, Emergency Medicine and Pain Therapy, Ziekenhuis Oost-Limburg, Genk, Belgium
Mineração de dados com RapidMiner + WEKA - ClusterizaçãoJoão Gabriel Lima
Nesta apresentação, apresento um passo a passo prático de como clusterizar e mais importante que isso, como interpretar os resultados aplicando isso para auxiliar a tomada de decisão.
No final temos um exercício de fixação muito interessante que nos dá a oportunidade de aplicar os conhecimentos adquiridos.
jgabriel.ufpa@gmail.com
Frequent Pattern growth algorithm provides better performance than Apriori algorithm. This approach used to detect frequent itemsets in database. It has two phase. In first phase, it constructs a suffix tree and in next, it starts mining recursively.
The recursion process is shown in details in presentation with figure.
The slides cover:
An Overview of RapidMiner Studio interface
Importing a dataset
Descriptive statistics and visualisation
Data modelling
Model evaluation
Data cleaning
Adding R script
In computer science, all-pairs testing or pairwise testing is a combinatorial method of software testing that, for each pair of input parameters to a system (typically, a software algorithm), tests all possible discrete combinations of those parameters.
The presentation covers the use of Scalable Predictive Analysis in Critically Ill Patients using a Visual Open Data Analysis Platform (RapidMiner).
With the accumulation of large amounts of health related data, predictive analytics could stimulate the transformation of reactive medicine towards Predictive, Preventive and Personalized (PPPM) Medicine, ultimately affecting both cost and quality of care. However, high-dimensionality and high-
complexity of the data involved, prevents data-driven methods from easy translation into clinically relevant models. Additionally, the application of cutting edge predictive methods and data manipulation require substantial programming skills, limiting its direct exploitation by medical domain experts. This leaves a gap between potential and actual data usage. The presentation addresses the problem by focusing on an open, visual environment, suited to be applied by the medical community (RapidMiner). As a showcase, a framework was developed for the meaningful use of data from critical care patients by integrating the MIMIC-II / III database in a data mining environment (RapidMiner) supporting scalable predictive analytics using visual tools (RapidMiner’s Radoop extension). Guided by the CRoss-Industry Standard Process for Data Mining (CRISP-DM), the ETL process (Extract, Transform, Load) was initiated by retrieving data from the MIMIC-II tables of interest. Using visual tools for ETL on Hadoop and predictive modeling in RapidMiner, robust processes for automatic building, parameter optimization and evaluation of various
predictive models, under different feature selection schemes can be developed. Because these processes can be easily adopted in other projects, this environment is attractive for scalable predictive analytics in health research.
Presentation at Laboratory for Computational Physiology (LCP)
Massachusetts Institute of Technology (MIT),
Building E25 room 101; December 8th 12-noon
Sven Van Poucke, MD, Anesthesiologist, Emergency Physician
Department of Anesthesiology, Intensive Care, Emergency Medicine and Pain Therapy, Ziekenhuis Oost-Limburg, Genk, Belgium
Mineração de dados com RapidMiner + WEKA - ClusterizaçãoJoão Gabriel Lima
Nesta apresentação, apresento um passo a passo prático de como clusterizar e mais importante que isso, como interpretar os resultados aplicando isso para auxiliar a tomada de decisão.
No final temos um exercício de fixação muito interessante que nos dá a oportunidade de aplicar os conhecimentos adquiridos.
jgabriel.ufpa@gmail.com
Frequent Pattern growth algorithm provides better performance than Apriori algorithm. This approach used to detect frequent itemsets in database. It has two phase. In first phase, it constructs a suffix tree and in next, it starts mining recursively.
The recursion process is shown in details in presentation with figure.
The slides cover:
An Overview of RapidMiner Studio interface
Importing a dataset
Descriptive statistics and visualisation
Data modelling
Model evaluation
Data cleaning
Adding R script
In computer science, all-pairs testing or pairwise testing is a combinatorial method of software testing that, for each pair of input parameters to a system (typically, a software algorithm), tests all possible discrete combinations of those parameters.
Quality Assurance 2: Searching for BugsMarc Miquel
In this presentation we introduce the most useful testing techniques in order to find bugs (ad hoc testing, combinatorial testing, test flow diagram, cleanroom testing and testing trees).
These slides were prepared by Dr. Marc Miquel. All the materials used in them are referenced to their authors.
Developing a web or mobile application takes a lot of effort, but all that effort can go down the drain quickly if you improperly load test the application or completely skip testing. Load testing applications is important and a necessary step in the pre-production stage.
New applications, ones that have not yet made it to the production stage, likely don’t have a performance benchmark established. You don’t typically know what to expect with a new app, which is why before you do a larger load test on any application you first do some baseline testing. This will allow you to establish some benchmarks and pick out any performance issues before you place a larger load on the app. For example, if your app crashes with just five users, you have a problem. Look to the application architects to determine if any service level agreements have been set for the application during design.
Once you have done some baseline testing you are ready to load test your application to determine its performance levels under heavier load. Here are 5 essential tips for starting load testing on an application.
Ever tried doing Test First Test Driven Development? Ever failed? TDD is not easy to get right. Here's some practical advice on doing BDD and TDD correctly. This presentation attempts to explain to you why, what, and how you should test, tell you about the FIRST principles of tests, the connections of unit testing and the SOLID principles, writing testable code, test doubles, the AAA of unit testing, and some practical ideas about structuring tests.
Simulating data to gain insights intopower and p-hackingDorothy Bishop
Very basic introduction to simulating data to illustrate issues affecting reproducibility. Uses Excel and R, but assumes no prior knowledge of R. Please let me know of errors or things that need better explanation.
Approaches to unraveling a complex test problemJohan Hoberg
When testing a complex system you are often faced with complex test problems. Cause and effect cannot be deduced in advance, only in retrospect.
According to the Cynefin framework, the general approach to tackle complexity is probe-sense-respond. Try something, analyze the outcome, and based on that outcome, try something else. This is the basis of all my approaches to begin unraveling complex test problems. But how do I select my test scope for a specific complex test problem?
This publication is to help software engineering students to understand the basis in software testing. Software testing is an inevitable process in software development.
1. BCB 444/544
Lab 10 (11/8/07)
Machine Learning
Due Monday 11/12/07 by 5 pm – email to terrible@iastate.edu
Objectives
1. Experiment with applying machine learning algorithms to biological problems.
2. Learn about how to set up a machine learning experiment.
Introduction
Machine learning combines principles from computer science, statistics, psychology, and
other disciplines to develop computer programs for specific tasks. The tasks that
machine learning programs have been developed for vary widely, from diagnosing cancer
to driving a car. In biology, machine learning approaches are very popular for problems
such as protein secondary structure prediction, gene prediction, analyzing microarray
data, and many others. Machine learning is often quite effective, especially on problems
that have a lot of data available. Molecular biology certainly has lots of data.
A note about our training and test set files:
The data set we are using in this lab is a set of RNA-binding proteins. Our input data is
15 amino acids from the protein sequence and a label of 1 or 0 indicating whether the
central amino acid in the list of 15 binds to RNA or not (1 means binding, 0 means not
binding). The training set contains an equal number of RNA-binding and non-binding
residues, which is not the natural distribution. The entire data set contains only about
20% binding residues. We are using a set with equal numbers of binding and non-
binding residues to make things a little easier. The test set is a single protein sequence,
the 50S ribosomal protein L20 from E. coli. We will use this test case to see how well
our classifiers we build perform on a protein sequence not in the training set.
Exercises
Before we get started on the exercises, we need to learn a little about machine learning
experiments. The first concept is training and testing. In order to estimate the
performance of any classifier, we need to train the classifier on some data and then
measure performance on some other data. There are a few ways to do this. In the lab
today, we will use cross validation and a separate test set.
Go to http://en.wikipedia.org/wiki/Cross_validation and read about cross validation.
1. What is K-fold cross validation? What data is used for training? What data is
used for testing?
The most important point in training and testing is that the same data can never be used
2. in both the training set and the test set. Usually, we have limited data and want to use as
much as possible in training, which is why we do cross validation experiments. In our
lab today, we will give ourselves the luxury of a test case that is not in the training set to
test our classifiers on.
Algorithms
Read the sections on Naïve Bayes, J48 Decision tree, and SVM here:
http://www.d.umn.edu/~padhy005/Chapter5.html
2. What assumption is used in the Naïve Bayes classifier?
3. What criterion does the decision tree classifier use to decide which attribute to
put first in the decision tree?
4. What is the purpose of the kernel function in a SVM classifier?
5. Based on what you read, which method(s) can a human interpret? What
method(s) can a human not interpret, i.e., “black box” method(s)?
6. According to this web page, which algorithm tends to have the highest
classification accuracy?
Experiments
In this lab, we will be using the program Weka. Weka is a program that contains
implementations of many machine learning algorithms in a standard framework that
makes it easy to experiment with many methods. If you are in the computer lab in 1340
MBB, Weka is already installed on the machines. If you are working from home, you
will have to download and install Weka.
Weka is available at:
http://www.cs.waikato.ac.nz/ml/weka/
The instructions should be fairly easy to follow for installing Weka on your computer. If
you have trouble, send me an email and I may be able to help you. Or come into the lab
and use the machines here. The lab in MBB is open most of the time; our class is the
only one that currently uses this room.
Running Weka:
Some final notes on what we will do with Weka before the instructions for how to do it.
First, Weka implements a lot of different algorithms. We will use Naïve Bayes, J48
decision tree, and SVM and then you will get to choose a fourth algorithm. Each
algorithm has quite a few parameters that can be changed, and as we have seen all
3. semester, changing parameters can drastically change the results. That being said, we
will accept the default parameters for all of the algorithms, with only one small tweak
that will be described later.
Second, Weka allows you to run cross validation experiments or use a supplied test set
(along with some other options). We will do both in this lab.
Finally, a short primer on how to read the results that Weka gives you. A typical results
section looks like this:
4. The output includes a lot of information, some useful, some not so useful. The top of the
output shows the information about your data set, the algorithm used, information about
the model produced after training (which is really only useful if you are using an
algorithm that is interpretable by humans), and finally the amount of time it took to build
the model. The most useful section for our purposes is at the bottom, which has the
5. performance statistics. For this lab, all of the results you are required to fill into the
tables can be read directly from the output as long as you know where to find them. The
table asks for accuracy, which in the Weka output is listed as “Correctly Classified
Instances.” The next entries in your results table are TPRate, FPRate, Precision, and
Recall, which can be read in the section called “Detailed Accuracy By Class.” These
values are listed for both classes (1 and 0 for RNA-binding and non-RNA-binding
respectively). I only want to see the values for class 1.
The final numbers for your results table are TP (true positive), FP (false positive), FN
(false negative), and TN (true negative). TP means we predicted RNA-binding and it
actually is RNA-binding. FP means we predicted RNA-binding and it is not actually
RNA-binding. FN means we predicted non-RNA-binding and it actually is RNA-
binding. TN means we predicted non-RNA-binding and it actually is non-RNA-binding.
Our correct predictions are TP and TN, our incorrect predictions are FP and FN. The
counts of TP, FP, FN, and TN can be found in the section called “Confusion Matrix.”
Our confusion matrix shows four numbers, the top left corner shows the number of TP
predictions, the top right number is FP, bottom left is FN, and bottom right is TN.
Finally, on to running some programs. For the lab machines, you can simply double-
click on the weka.jar file on the desktop.
Click on the button that says “Explorer” to get started.
We will use the following files:
Training set
Test set
Click on the Open file button and choose the training set file.
Click on the Classify tab to get to the classification algorithms.
To choose the algorithm, click on the Choose button near the top in the classifier section.
Click on bayes, then NaiveBayes. Be sure that Cross validation is selected, and change
the number in the box from 10 to 5. Then click on the Start button to run the classifier.
Record the performance in the table below.
To run the predictions on the test case, click the circle next to Supplied test set, then click
the Set… button and choose the test file. Then click Start to build the classifier and make
predictions on our test case. Record the performance in the table below.
For our next algorithm, we will use the J48 decision tree. To choose the algorithm, click
on the Choose button near the top in the classifier section. Click on trees, then J48. Be
sure that Cross validation is selected, and make sure the number in the box is 5. Then
click on the Start button to run the classifier. Record the performance in the table below.
6. To run the predictions on the test case, click the circle next to Supplied test set, then click
the Set… button and choose the test file. Then click Start to build the classifier and make
predictions on our test case. Record the performance in the table below.
For our next algorithm, we will use a SVM. To choose the algorithm, click on the
Choose button near the top in the classifier section. Click on functions, then SMO. Be
sure that Cross validation is selected, and make sure the number in the box is 5. Then
click on the Start button to run the classifier. Record the performance in the table below.
To run the predictions on the test case, click the circle next to Supplied test set, then click
the Set… button and choose the test file. Then click Start to build the classifier and make
predictions on our test case. Record the performance in the table below.
Next, we will run the SVM algorithm again using a different kernel function. To change
the kernel function, click on the text box next to the Choose button at the top. This will
bring up a window showing the algorithm parameters. At the bottom of this window,
there is a line that says “useRBF.” Change the value in this box to true to use the RBF
kernel function. Click OK. Be sure that Cross validation is selected, and make sure the
number in the box is 5. Then click on the Start button to run the classifier. Record the
performance in the table below.
To run the predictions on the test case, click the circle next to Supplied test set, then click
the Set… button and choose the test file. Then click Start to build the classifier and make
predictions on our test case. Record the performance in the table below.
Finally, choose a different algorithm and run both 5 fold cross validation and predictions
on the test case, and add the performance values to the tables in the blank line at the
bottom. Be sure to include the name of the algorithm you chose. You can choose any of
the algorithms available in Weka.
(Side note – some of the algorithms will not work on this data set. They will produce an
error message saying something about incompatibility. If this happens to you, simply
choose a different algorithm. Also, some algorithms run much faster than others. If the
algorithm you chose is taking much longer than our SVM runs, you may want to choose a
different algorithm.)
To find out more about the algorithms, you can select the algorithm with the Choose
button at the top, then click on the text box near the Choose button (just like we did for
the SVM when we changed to the RBF kernel). In the window that opens up, there is a
section called “About” that gives a one line description of the algorithm. Click on the
More button in this section to get more information, including a reference to a paper
describing the algorithm. Another option for finding out more about the algorithm is to
do an internet search with the name of the algorithm.
7. 5 Fold Cross Validation Results
Algorithm Accuracy TPRate FPRate Precision Recall TP FP FN TN
NB
J48
SVM
SVM
RBF
Test Case Results
Algorithm Accuracy TPRate FPRate Precision Recall TP FP FN TN
NB
J48
SVM
SVM
RBF
7. What algorithm did the best and under what conditions?
8. Did the cross validation results indicate accurately what performance on the test
case would be?
9. Briefly describe what the algorithm you chose does.