Weka tutorial
Speaker:楊明翰
What is Weka?
A collection of machine learning algorithms for data
mining tasks
Weka contains tools for
• data pre-process...
What can it help in your hw1?
• Visualization
• Data analysis
• Easy to try different classifiers
But………..
If you want to ...
Explorer
Classifier
Black : build in
Red: supported but need to
download by user
Installation guide for libsvm :
http://www.cs.iast...
Use Weka in your Java code
The most common components you might want to
use, are
– Instances - your data
– Filter - for pr...
Arff format
@relation KDDCUP
@attribute Ground-Truth {-1.0,1.0}
@attribute Image-Finding-ID numeric
@attribute Study-Findi...
Instances
import weka.core.Instances;
import java.io.BufferedReader;
import java.io.FileReader;
...
Instances data = new I...
filters
import weka.core.Instances;
import weka.filters.Filter;
import weka.filters.unsupervised.attribute.Remove;
...
Str...
classifier
import weka.classifiers.functions.LibSVM;
...
String[] options = String[] options =
weka.core.Utils.splitOption...
Classifying instances
Instances unlabeled=…//load from somewhere
…
for (int i = 0; i < unlabeled.numInstances(); i++) {
In...
Example:weka+libsvm+5 folds CV
public static void main(String[] args) throws Exception {
PrintWriter pw_score=new PrintWri...
for(int n=0;n<folds;n++){
Instances train = randData.trainCV(folds, n);
Instances test = randData.testCV(folds, n);
System...
FROC
Algorithm:
1. Load “predicted score”, “ground truth”, and “patient id”.
2. Initialize :
“Detected_patients = [ ]
Sort...
FROC tools-JAVA
java -cp bin mslab.kddcup2008.roc.ROC score.txt label.txt pid.txt
score.txt : predict label for each point...
FROC tools-Matlab
• Matlab matlab function
– [Pd_patient_wise,FA_per_image,AUC] =
get_ROC_KDD(p,Y,PID,fa_low,fa_high)
• Pd...
FROC curve example
The result of above example:
• AUC = 0.0782
Measurements by Points:
• TP = 237
• FN = 386
• FP = 108
• TN = 101563
• preci...
Reference:
Use weka in your java code
Generating cross-validation folds
Download:
Example code
Java roc code
matlab roc co...
Upcoming SlideShare
Loading in...5
×

saihw1_weka_tutorial.pptx - Machine Discovery and Social Network ...

652

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
652
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
13
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

saihw1_weka_tutorial.pptx - Machine Discovery and Social Network ...

  1. 1. Weka tutorial Speaker:楊明翰
  2. 2. What is Weka? A collection of machine learning algorithms for data mining tasks Weka contains tools for • data pre-processing, • classification, regression, • clustering, • association rules, and • visualization. Suggestion: Version 3.5.8
  3. 3. What can it help in your hw1? • Visualization • Data analysis • Easy to try different classifiers But……….. If you want to get better performance, you still have to implement many things ,such as cross validation, parameters selection , and clustering . P.S. You are free to use anything to complete the homework.
  4. 4. Explorer
  5. 5. Classifier Black : build in Red: supported but need to download by user Installation guide for libsvm : http://www.cs.iastate.edu/~ yasser/wlsvm/
  6. 6. Use Weka in your Java code The most common components you might want to use, are – Instances - your data – Filter - for pre-processing the data – Classifier/Clusterer - is built on the processed data – Evaluating - how good is the classifier/clusterer? – Attribute selection - removing irrelevant attributes from your data
  7. 7. Arff format @relation KDDCUP @attribute Ground-Truth {-1.0,1.0} @attribute Image-Finding-ID numeric @attribute Study-Finding-ID numeric @attribute Image-ID numeric @attribute Study-ID numeric @attribute LeftBreast {0.0,1.0} @attribute MLO {0.0,1.0} @attribute X-location numeric @attribute Y-location numeric @attribute X-nipple-location numeric @attribute Y-nipple-location numeric @attribute att1 numeric @attribute att2 numeric … @attribute att117 numeric @attribute serialNumber numeric @data -1.0,0.0,0.0,0,150,0.0,0.0,1732.0,2380.0,1356.0,2106.0,-1.196111E-1,4.764423E-2,2.27225E-1,2.511147E-1,-6.94537E-2,-7.478557E-2,5.444844E- 1,8.050464E-1,4.708327E-2,1.310514E0,-1.871811E-1,-4.098435E-1,-2.669971E-1,2.50289E-1,-2.438625E-1,8.022098E-2,8.098504E-1,9.880441E- 2,3.374689E-4,-6.384426E-1,1.108627E0,1.043443E0,-1.612419E0,-5.633943E-1,-4.357306E-1,-4.572176E-1,8.236916E-2,5.218327E-1,1.922271E- 1,4.565068E-1,-8.969028E-1,-4.403602E-1,1.41807E-1,-2.252249E-1,2.34936E-1,6.527024E-1,-5.750284E-1,-5.676962E-1,-5.344064E-1,-1.513411E- 1,7.280352E-1,7.21983E-1,6.978422E-1,5.667439E-1,3.273161E-3,-6.958107E-2,7.912039E- 1,1.659563E0,1.192391E0,1.173782E0,1.145927E0,1.645195E0,-5.52926E-1,-1.424765E-1,-1.416166E-1,-1.396449E-1,-1.374919E-1,-5.500465E-1,- 3.0028E-2,2.788235E-1,1.178261E0,2.937468E-1,3.483202E-1,3.941773E-1,4.250069E-1,3.226059E-1,2.569432E-1,5.522287E- 1,1.811639E0,1.844379E0,1.188755E0,1.86738E0,-1.05269E0,1.434895E-2,5.235738E-3,-4.779273E-3,-9.884836E-2,-9.526174E-1,-3.106309E- 1,1.434759E0,1.486669E0,3.402836E-1,5.323643E-1,-3.38767E-1,-3.644332E-1,7.650664E-3,3.811143E-2,5.595391E-2,-3.589534E-1,-6.765502E-1,- 6.669187E-1,-6.591878E-1,-2.893004E-1,1.048242E0,-7.317548E-1,-1.985699E-1,4.513422E-1,1.06145E0,4.777854E- 1,1.267896E0,1.350758E0,1.337705E0,1.385917E0,1.091785E0,1.289325E0,5.511991E-1,-8.125907E-1,1.050196E0,-4.338815E-1,-4.664211E- 1,6.203229E-1,-6.020947E-1,5.299978E-1,2.989034E-1,-7.676021E-2,1.5216E-1,-3.001498E-1,0
  8. 8. Instances import weka.core.Instances; import java.io.BufferedReader; import java.io.FileReader; ... Instances data = new Instances( new BufferedReader( new FileReader("/some/where/data.arff"))); // setting class attribute data.setClassIndex(data.numAttributes() - 1); // The class index indicate the target attribute used for classification.
  9. 9. filters import weka.core.Instances; import weka.filters.Filter; import weka.filters.unsupervised.attribute.Remove; ... String[] options = new String[2]; options[0] = "-R"; // "range" options[1] = "1"; // first attribute Remove remove = new Remove(); // new instance of filter remove.setOptions(options); // set options remove.setInputFormat(data); // inform filter about dataset AFTER setting options Instances newData = Filter.useFilter(data, remove); // apply filter
  10. 10. classifier import weka.classifiers.functions.LibSVM; ... String[] options = String[] options = weka.core.Utils.splitOptions("-S 0 -K 2 -D 3 -G 0.0 -R 0.0 -N 0.5 -M 40.0 -C 1.0 -E 0.0010 -P 0.1 -B"); LibSVM classifier = new LibSVM(); // new instance of tree classifier.setOptions(options); // set the options classifier.buildClassifier(data); // build classifier
  11. 11. Classifying instances Instances unlabeled=…//load from somewhere … for (int i = 0; i < unlabeled.numInstances(); i++) { Instance ins=unlabeled.instance(i); clsLabel = classifier.classifyInstance(ins); //get predict label double[] prob_array=classifier.distributionForInstance(ins); //get probability for each category }
  12. 12. Example:weka+libsvm+5 folds CV public static void main(String[] args) throws Exception { PrintWriter pw_score=new PrintWriter( new FileOutputStream ("c:tempscore.txt")); PrintWriter pw_label=new PrintWriter(new FileOutputStream ("c:templabel.txt")); PrintWriter pw_pid=new PrintWriter(new FileOutputStream ("c:temppid.txt")); Instances data = new Instances( new BufferedReader( new FileReader("C:tempTrainSet_sn.arff"))); Remove remove = new Remove(); // new instance of filter remove.setOptions(weka.core.Utils.splitOptions("-R 2-11,129"));// set options remove.setInputFormat(data); // inform filter about dataset AFTER setting options Int seed = 2; // the seed for randomizing the data int folds = 5; // the number of folds to generate, >=2 data.setClassIndex(0); // first attribute is groundtruth Instances randData; Random rand = new Random(seed); // create seeded number generator randData = new Instances(data); // create copy of original data randData.randomize(rand); // randomize data with number generator
  13. 13. for(int n=0;n<folds;n++){ Instances train = randData.trainCV(folds, n); Instances test = randData.testCV(folds, n); System.out.println("Fold "+n+"train "+train.numInstances()+"test "+test.numInstances()); String[] options = weka.core.Utils.splitOptions("-S 0 -K 2 -D 3 -G 0.0 -R 0.0 -N 0.5 -M 40.0 -C 1.0 -E 0.0010 -P 0.1 -B"); LibSVM classifier=new LibSVM(); classifier.setOptions(options); FilteredClassifier fc = new FilteredClassifier(); fc.setFilter(remove); fc.setClassifier(classifier); fc.buildClassifier(train); for(int i=0;i<test.numInstances();i++) { double[] tmp=(double[])fc.distributionForInstance(test.instance(i)); //tmp[0] :prob of negtive //tmp[1] :prob of positive pw_label.println(test.instance(i).attribute(0).value((int)test.instance(i).value(0))); //ground truth pw_score.println(tmp[1]); //predict value pw_pid.println((int)test.instance(i).value(4)); //study-ID }}
  14. 14. FROC Algorithm: 1. Load “predicted score”, “ground truth”, and “patient id”. 2. Initialize : “Detected_patients = [ ] Sorting rows ( priority “predicted score” > “ground truth” > “patient id” in descending order). 3. For each row, If ground truth is negative, x+=1 Else // get a positive point If patient is not in “Detected_patients, //get a new positive patient y+=1 and add patient_id to Detected_patients else //patient is found before do nothing 4. Normalize x => 0~ average false alarm per image i.e. X is divided by total image numbers y => 0~1 i.e. Y is divided by patients numbers 5. Calculate the area under the curve
  15. 15. FROC tools-JAVA java -cp bin mslab.kddcup2008.roc.ROC score.txt label.txt pid.txt score.txt : predict label for each point . i.e. probability for being positive label.txt : ground truth for each point pid.txt : patient ID for each point
  16. 16. FROC tools-Matlab • Matlab matlab function – [Pd_patient_wise,FA_per_image,AUC] = get_ROC_KDD(p,Y,PID,fa_low,fa_high) • Pd_patient_wise – The y location of each point on the curve. • FA_per_image – The x location of each point on the curve. • AUC • p – Predicted label • Y – Ground truth • PID – Patient ID – Plot(FA_per_image,Pd_patient_wise);
  17. 17. FROC curve example
  18. 18. The result of above example: • AUC = 0.0782 Measurements by Points: • TP = 237 • FN = 386 • FP = 108 • TN = 101563 • precision = 0.6870 • recall = 0.3804 • FScore = 0.4897
  19. 19. Reference: Use weka in your java code Generating cross-validation folds Download: Example code Java roc code matlab roc code
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×