Weka tutorial
Speaker:楊明翰
What is Weka?
A collection of machine learning algorithms for data
mining tasks
Weka contains tools for
• data pre-processing,
• classification, regression,
• clustering,
• association rules, and
• visualization.
Suggestion: Version 3.5.8
What can it help in your hw1?
• Visualization
• Data analysis
• Easy to try different classifiers
But………..
If you want to get better performance, you still
have to implement many things ,such as cross
validation, parameters selection , and clustering .
P.S. You are free to use anything to complete the
homework.
Explorer
Classifier
Black : build in
Red: supported but need to
download by user
Installation guide for libsvm :
http://www.cs.iastate.edu/~
yasser/wlsvm/
Use Weka in your Java code
The most common components you might want to
use, are
– Instances - your data
– Filter - for pre-processing the data
– Classifier/Clusterer - is built on the processed
data
– Evaluating - how good is the classifier/clusterer?
– Attribute selection - removing irrelevant
attributes from your data
Arff format
@relation KDDCUP
@attribute Ground-Truth {-1.0,1.0}
@attribute Image-Finding-ID numeric
@attribute Study-Finding-ID numeric
@attribute Image-ID numeric
@attribute Study-ID numeric
@attribute LeftBreast {0.0,1.0}
@attribute MLO {0.0,1.0}
@attribute X-location numeric
@attribute Y-location numeric
@attribute X-nipple-location numeric
@attribute Y-nipple-location numeric
@attribute att1 numeric
@attribute att2 numeric
…
@attribute att117 numeric
@attribute serialNumber numeric
@data
-1.0,0.0,0.0,0,150,0.0,0.0,1732.0,2380.0,1356.0,2106.0,-1.196111E-1,4.764423E-2,2.27225E-1,2.511147E-1,-6.94537E-2,-7.478557E-2,5.444844E-
1,8.050464E-1,4.708327E-2,1.310514E0,-1.871811E-1,-4.098435E-1,-2.669971E-1,2.50289E-1,-2.438625E-1,8.022098E-2,8.098504E-1,9.880441E-
2,3.374689E-4,-6.384426E-1,1.108627E0,1.043443E0,-1.612419E0,-5.633943E-1,-4.357306E-1,-4.572176E-1,8.236916E-2,5.218327E-1,1.922271E-
1,4.565068E-1,-8.969028E-1,-4.403602E-1,1.41807E-1,-2.252249E-1,2.34936E-1,6.527024E-1,-5.750284E-1,-5.676962E-1,-5.344064E-1,-1.513411E-
1,7.280352E-1,7.21983E-1,6.978422E-1,5.667439E-1,3.273161E-3,-6.958107E-2,7.912039E-
1,1.659563E0,1.192391E0,1.173782E0,1.145927E0,1.645195E0,-5.52926E-1,-1.424765E-1,-1.416166E-1,-1.396449E-1,-1.374919E-1,-5.500465E-1,-
3.0028E-2,2.788235E-1,1.178261E0,2.937468E-1,3.483202E-1,3.941773E-1,4.250069E-1,3.226059E-1,2.569432E-1,5.522287E-
1,1.811639E0,1.844379E0,1.188755E0,1.86738E0,-1.05269E0,1.434895E-2,5.235738E-3,-4.779273E-3,-9.884836E-2,-9.526174E-1,-3.106309E-
1,1.434759E0,1.486669E0,3.402836E-1,5.323643E-1,-3.38767E-1,-3.644332E-1,7.650664E-3,3.811143E-2,5.595391E-2,-3.589534E-1,-6.765502E-1,-
6.669187E-1,-6.591878E-1,-2.893004E-1,1.048242E0,-7.317548E-1,-1.985699E-1,4.513422E-1,1.06145E0,4.777854E-
1,1.267896E0,1.350758E0,1.337705E0,1.385917E0,1.091785E0,1.289325E0,5.511991E-1,-8.125907E-1,1.050196E0,-4.338815E-1,-4.664211E-
1,6.203229E-1,-6.020947E-1,5.299978E-1,2.989034E-1,-7.676021E-2,1.5216E-1,-3.001498E-1,0
Instances
import weka.core.Instances;
import java.io.BufferedReader;
import java.io.FileReader;
...
Instances data = new Instances( new BufferedReader( new
FileReader("/some/where/data.arff")));
// setting class attribute
data.setClassIndex(data.numAttributes() - 1);
// The class index indicate the target attribute used for
classification.
filters
import weka.core.Instances;
import weka.filters.Filter;
import weka.filters.unsupervised.attribute.Remove;
...
String[] options = new String[2];
options[0] = "-R"; // "range"
options[1] = "1"; // first attribute
Remove remove = new Remove(); // new instance of filter
remove.setOptions(options); // set options
remove.setInputFormat(data); // inform filter about dataset AFTER
setting options
Instances newData = Filter.useFilter(data, remove); // apply filter
classifier
import weka.classifiers.functions.LibSVM;
...
String[] options = String[] options =
weka.core.Utils.splitOptions("-S 0 -K 2 -D 3 -G 0.0 -R 0.0 -N 0.5
-M 40.0 -C 1.0 -E 0.0010 -P 0.1 -B");
LibSVM classifier = new LibSVM(); // new instance of tree
classifier.setOptions(options); // set the options
classifier.buildClassifier(data); // build classifier
Classifying instances
Instances unlabeled=…//load from somewhere
…
for (int i = 0; i < unlabeled.numInstances(); i++) {
Instance ins=unlabeled.instance(i);
clsLabel = classifier.classifyInstance(ins); //get predict label
double[] prob_array=classifier.distributionForInstance(ins);
//get probability for each category
}
Example:weka+libsvm+5 folds CV
public static void main(String[] args) throws Exception {
PrintWriter pw_score=new PrintWriter( new FileOutputStream ("c:tempscore.txt"));
PrintWriter pw_label=new PrintWriter(new FileOutputStream ("c:templabel.txt"));
PrintWriter pw_pid=new PrintWriter(new FileOutputStream ("c:temppid.txt"));
Instances data = new Instances(
new BufferedReader(
new FileReader("C:tempTrainSet_sn.arff")));
Remove remove = new Remove(); // new instance of filter
remove.setOptions(weka.core.Utils.splitOptions("-R 2-11,129"));// set options
remove.setInputFormat(data); // inform filter about dataset AFTER setting options
Int seed = 2; // the seed for randomizing the data
int folds = 5; // the number of folds to generate, >=2
data.setClassIndex(0); // first attribute is groundtruth
Instances randData;
Random rand = new Random(seed); // create seeded number generator
randData = new Instances(data); // create copy of original data
randData.randomize(rand); // randomize data with number generator
for(int n=0;n<folds;n++){
Instances train = randData.trainCV(folds, n);
Instances test = randData.testCV(folds, n);
System.out.println("Fold "+n+"train "+train.numInstances()+"test "+test.numInstances());
String[] options = weka.core.Utils.splitOptions("-S 0 -K 2 -D 3 -G 0.0 -R 0.0 -N 0.5 -M 40.0 -C
1.0 -E 0.0010 -P 0.1 -B");
LibSVM classifier=new LibSVM();
classifier.setOptions(options);
FilteredClassifier fc = new FilteredClassifier();
fc.setFilter(remove);
fc.setClassifier(classifier);
fc.buildClassifier(train);
for(int i=0;i<test.numInstances();i++)
{
double[] tmp=(double[])fc.distributionForInstance(test.instance(i));
//tmp[0] :prob of negtive
//tmp[1] :prob of positive
pw_label.println(test.instance(i).attribute(0).value((int)test.instance(i).value(0))); //ground
truth
pw_score.println(tmp[1]); //predict value
pw_pid.println((int)test.instance(i).value(4)); //study-ID
}}
FROC
Algorithm:
1. Load “predicted score”, “ground truth”, and “patient id”.
2. Initialize :
“Detected_patients = [ ]
Sorting rows
( priority “predicted score” > “ground truth” > “patient id” in descending order).
3. For each row,
If ground truth is negative, x+=1
Else // get a positive point
If patient is not in “Detected_patients, //get a new positive patient
y+=1 and add patient_id to Detected_patients
else //patient is found before
do nothing
4. Normalize
x => 0~ average false alarm per image i.e. X is divided by total image numbers
y => 0~1 i.e. Y is divided by patients numbers
5. Calculate the area under the curve
FROC tools-JAVA
java -cp bin mslab.kddcup2008.roc.ROC score.txt label.txt pid.txt
score.txt : predict label for each point . i.e. probability for being
positive
label.txt : ground truth for each point
pid.txt : patient ID for each point
FROC tools-Matlab
• Matlab matlab function
– [Pd_patient_wise,FA_per_image,AUC] =
get_ROC_KDD(p,Y,PID,fa_low,fa_high)
• Pd_patient_wise
– The y location of each point on the curve.
• FA_per_image
– The x location of each point on the curve.
• AUC
• p – Predicted label
• Y – Ground truth
• PID – Patient ID
– Plot(FA_per_image,Pd_patient_wise);
FROC curve example
The result of above example:
• AUC = 0.0782
Measurements by Points:
• TP = 237
• FN = 386
• FP = 108
• TN = 101563
• precision = 0.6870
• recall = 0.3804
• FScore = 0.4897
Reference:
Use weka in your java code
Generating cross-validation folds
Download:
Example code
Java roc code
matlab roc code

saihw1_weka_tutorial.pptx - Machine Discovery and Social Network ...

  • 1.
  • 2.
    What is Weka? Acollection of machine learning algorithms for data mining tasks Weka contains tools for • data pre-processing, • classification, regression, • clustering, • association rules, and • visualization. Suggestion: Version 3.5.8
  • 3.
    What can ithelp in your hw1? • Visualization • Data analysis • Easy to try different classifiers But……….. If you want to get better performance, you still have to implement many things ,such as cross validation, parameters selection , and clustering . P.S. You are free to use anything to complete the homework.
  • 4.
  • 5.
    Classifier Black : buildin Red: supported but need to download by user Installation guide for libsvm : http://www.cs.iastate.edu/~ yasser/wlsvm/
  • 6.
    Use Weka inyour Java code The most common components you might want to use, are – Instances - your data – Filter - for pre-processing the data – Classifier/Clusterer - is built on the processed data – Evaluating - how good is the classifier/clusterer? – Attribute selection - removing irrelevant attributes from your data
  • 7.
    Arff format @relation KDDCUP @attributeGround-Truth {-1.0,1.0} @attribute Image-Finding-ID numeric @attribute Study-Finding-ID numeric @attribute Image-ID numeric @attribute Study-ID numeric @attribute LeftBreast {0.0,1.0} @attribute MLO {0.0,1.0} @attribute X-location numeric @attribute Y-location numeric @attribute X-nipple-location numeric @attribute Y-nipple-location numeric @attribute att1 numeric @attribute att2 numeric … @attribute att117 numeric @attribute serialNumber numeric @data -1.0,0.0,0.0,0,150,0.0,0.0,1732.0,2380.0,1356.0,2106.0,-1.196111E-1,4.764423E-2,2.27225E-1,2.511147E-1,-6.94537E-2,-7.478557E-2,5.444844E- 1,8.050464E-1,4.708327E-2,1.310514E0,-1.871811E-1,-4.098435E-1,-2.669971E-1,2.50289E-1,-2.438625E-1,8.022098E-2,8.098504E-1,9.880441E- 2,3.374689E-4,-6.384426E-1,1.108627E0,1.043443E0,-1.612419E0,-5.633943E-1,-4.357306E-1,-4.572176E-1,8.236916E-2,5.218327E-1,1.922271E- 1,4.565068E-1,-8.969028E-1,-4.403602E-1,1.41807E-1,-2.252249E-1,2.34936E-1,6.527024E-1,-5.750284E-1,-5.676962E-1,-5.344064E-1,-1.513411E- 1,7.280352E-1,7.21983E-1,6.978422E-1,5.667439E-1,3.273161E-3,-6.958107E-2,7.912039E- 1,1.659563E0,1.192391E0,1.173782E0,1.145927E0,1.645195E0,-5.52926E-1,-1.424765E-1,-1.416166E-1,-1.396449E-1,-1.374919E-1,-5.500465E-1,- 3.0028E-2,2.788235E-1,1.178261E0,2.937468E-1,3.483202E-1,3.941773E-1,4.250069E-1,3.226059E-1,2.569432E-1,5.522287E- 1,1.811639E0,1.844379E0,1.188755E0,1.86738E0,-1.05269E0,1.434895E-2,5.235738E-3,-4.779273E-3,-9.884836E-2,-9.526174E-1,-3.106309E- 1,1.434759E0,1.486669E0,3.402836E-1,5.323643E-1,-3.38767E-1,-3.644332E-1,7.650664E-3,3.811143E-2,5.595391E-2,-3.589534E-1,-6.765502E-1,- 6.669187E-1,-6.591878E-1,-2.893004E-1,1.048242E0,-7.317548E-1,-1.985699E-1,4.513422E-1,1.06145E0,4.777854E- 1,1.267896E0,1.350758E0,1.337705E0,1.385917E0,1.091785E0,1.289325E0,5.511991E-1,-8.125907E-1,1.050196E0,-4.338815E-1,-4.664211E- 1,6.203229E-1,-6.020947E-1,5.299978E-1,2.989034E-1,-7.676021E-2,1.5216E-1,-3.001498E-1,0
  • 8.
    Instances import weka.core.Instances; import java.io.BufferedReader; importjava.io.FileReader; ... Instances data = new Instances( new BufferedReader( new FileReader("/some/where/data.arff"))); // setting class attribute data.setClassIndex(data.numAttributes() - 1); // The class index indicate the target attribute used for classification.
  • 9.
    filters import weka.core.Instances; import weka.filters.Filter; importweka.filters.unsupervised.attribute.Remove; ... String[] options = new String[2]; options[0] = "-R"; // "range" options[1] = "1"; // first attribute Remove remove = new Remove(); // new instance of filter remove.setOptions(options); // set options remove.setInputFormat(data); // inform filter about dataset AFTER setting options Instances newData = Filter.useFilter(data, remove); // apply filter
  • 10.
    classifier import weka.classifiers.functions.LibSVM; ... String[] options= String[] options = weka.core.Utils.splitOptions("-S 0 -K 2 -D 3 -G 0.0 -R 0.0 -N 0.5 -M 40.0 -C 1.0 -E 0.0010 -P 0.1 -B"); LibSVM classifier = new LibSVM(); // new instance of tree classifier.setOptions(options); // set the options classifier.buildClassifier(data); // build classifier
  • 11.
    Classifying instances Instances unlabeled=…//loadfrom somewhere … for (int i = 0; i < unlabeled.numInstances(); i++) { Instance ins=unlabeled.instance(i); clsLabel = classifier.classifyInstance(ins); //get predict label double[] prob_array=classifier.distributionForInstance(ins); //get probability for each category }
  • 12.
    Example:weka+libsvm+5 folds CV publicstatic void main(String[] args) throws Exception { PrintWriter pw_score=new PrintWriter( new FileOutputStream ("c:tempscore.txt")); PrintWriter pw_label=new PrintWriter(new FileOutputStream ("c:templabel.txt")); PrintWriter pw_pid=new PrintWriter(new FileOutputStream ("c:temppid.txt")); Instances data = new Instances( new BufferedReader( new FileReader("C:tempTrainSet_sn.arff"))); Remove remove = new Remove(); // new instance of filter remove.setOptions(weka.core.Utils.splitOptions("-R 2-11,129"));// set options remove.setInputFormat(data); // inform filter about dataset AFTER setting options Int seed = 2; // the seed for randomizing the data int folds = 5; // the number of folds to generate, >=2 data.setClassIndex(0); // first attribute is groundtruth Instances randData; Random rand = new Random(seed); // create seeded number generator randData = new Instances(data); // create copy of original data randData.randomize(rand); // randomize data with number generator
  • 13.
    for(int n=0;n<folds;n++){ Instances train= randData.trainCV(folds, n); Instances test = randData.testCV(folds, n); System.out.println("Fold "+n+"train "+train.numInstances()+"test "+test.numInstances()); String[] options = weka.core.Utils.splitOptions("-S 0 -K 2 -D 3 -G 0.0 -R 0.0 -N 0.5 -M 40.0 -C 1.0 -E 0.0010 -P 0.1 -B"); LibSVM classifier=new LibSVM(); classifier.setOptions(options); FilteredClassifier fc = new FilteredClassifier(); fc.setFilter(remove); fc.setClassifier(classifier); fc.buildClassifier(train); for(int i=0;i<test.numInstances();i++) { double[] tmp=(double[])fc.distributionForInstance(test.instance(i)); //tmp[0] :prob of negtive //tmp[1] :prob of positive pw_label.println(test.instance(i).attribute(0).value((int)test.instance(i).value(0))); //ground truth pw_score.println(tmp[1]); //predict value pw_pid.println((int)test.instance(i).value(4)); //study-ID }}
  • 14.
    FROC Algorithm: 1. Load “predictedscore”, “ground truth”, and “patient id”. 2. Initialize : “Detected_patients = [ ] Sorting rows ( priority “predicted score” > “ground truth” > “patient id” in descending order). 3. For each row, If ground truth is negative, x+=1 Else // get a positive point If patient is not in “Detected_patients, //get a new positive patient y+=1 and add patient_id to Detected_patients else //patient is found before do nothing 4. Normalize x => 0~ average false alarm per image i.e. X is divided by total image numbers y => 0~1 i.e. Y is divided by patients numbers 5. Calculate the area under the curve
  • 15.
    FROC tools-JAVA java -cpbin mslab.kddcup2008.roc.ROC score.txt label.txt pid.txt score.txt : predict label for each point . i.e. probability for being positive label.txt : ground truth for each point pid.txt : patient ID for each point
  • 16.
    FROC tools-Matlab • Matlabmatlab function – [Pd_patient_wise,FA_per_image,AUC] = get_ROC_KDD(p,Y,PID,fa_low,fa_high) • Pd_patient_wise – The y location of each point on the curve. • FA_per_image – The x location of each point on the curve. • AUC • p – Predicted label • Y – Ground truth • PID – Patient ID – Plot(FA_per_image,Pd_patient_wise);
  • 17.
  • 18.
    The result ofabove example: • AUC = 0.0782 Measurements by Points: • TP = 237 • FN = 386 • FP = 108 • TN = 101563 • precision = 0.6870 • recall = 0.3804 • FScore = 0.4897
  • 19.
    Reference: Use weka inyour java code Generating cross-validation folds Download: Example code Java roc code matlab roc code