Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Like this? Share it with your network

Share

saihw1_weka_tutorial.pptx - Machine Discovery and Social Network ...

  • 848 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
848
On Slideshare
848
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
4
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Weka tutorial
    Speaker:楊明翰
  • 2. What is Weka?
    A collection of machine learningalgorithms for data mining tasks
    Weka contains tools for
    data pre-processing,
    classification, regression,
    clustering,
    association rules, and
    visualization.
    Suggestion: Version 3.5.8
  • 3. What can it help in your hw1?
    Visualization
    Data analysis
    Easy to try different classifiers
    But………..
    If you want to get better performance, you still have to implement many things ,such as cross validation, parameters selection , and clustering .
    P.S. You are free to use anything to complete the homework.
  • 4. Explorer
  • 5. Classifier
    Black : build in
    Red: supported but need to download by user
    Installation guide for libsvm :
    http://www.cs.iastate.edu/~yasser/wlsvm/
  • 6. Use Weka in your Java code
    The most common components you might want to use, are
    Instances - your data
    Filter - for pre-processing the data
    Classifier/Clusterer - is built on the processed data
    Evaluating - how good is the classifier/clusterer?
    Attribute selection- removing irrelevant attributes from your data
  • 7. Arff format
    @relation KDDCUP
    @attribute Ground-Truth {-1.0,1.0}
    @attribute Image-Finding-ID numeric
    @attribute Study-Finding-ID numeric
    @attribute Image-ID numeric
    @attribute Study-ID numeric
    @attribute LeftBreast {0.0,1.0}
    @attribute MLO {0.0,1.0}
    @attribute X-location numeric
    @attribute Y-location numeric
    @attribute X-nipple-location numeric
    @attribute Y-nipple-location numeric
    @attribute att1 numeric
    @attribute att2 numeric

    @attribute att117 numeric
    @attribute serialNumber numeric
    @data
    -1.0,0.0,0.0,0,150,0.0,0.0,1732.0,2380.0,1356.0,2106.0,-1.196111E-1,4.764423E-2,2.27225E-1,2.511147E-1,-6.94537E-2,-7.478557E-2,5.444844E-1,8.050464E-1,4.708327E-2,1.310514E0,-1.871811E-1,-4.098435E-1,-2.669971E-1,2.50289E-1,-2.438625E-1,8.022098E-2,8.098504E-1,9.880441E-2,3.374689E-4,-6.384426E-1,1.108627E0,1.043443E0,-1.612419E0,-5.633943E-1,-4.357306E-1,-4.572176E-1,8.236916E-2,5.218327E-1,1.922271E-1,4.565068E-1,-8.969028E-1,-4.403602E-1,1.41807E-1,-2.252249E-1,2.34936E-1,6.527024E-1,-5.750284E-1,-5.676962E-1,-5.344064E-1,-1.513411E-1,7.280352E-1,7.21983E-1,6.978422E-1,5.667439E-1,3.273161E-3,-6.958107E-2,7.912039E-1,1.659563E0,1.192391E0,1.173782E0,1.145927E0,1.645195E0,-5.52926E-1,-1.424765E-1,-1.416166E-1,-1.396449E-1,-1.374919E-1,-5.500465E-1,-3.0028E-2,2.788235E-1,1.178261E0,2.937468E-1,3.483202E-1,3.941773E-1,4.250069E-1,3.226059E-1,2.569432E-1,5.522287E-1,1.811639E0,1.844379E0,1.188755E0,1.86738E0,-1.05269E0,1.434895E-2,5.235738E-3,-4.779273E-3,-9.884836E-2,-9.526174E-1,-3.106309E-1,1.434759E0,1.486669E0,3.402836E-1,5.323643E-1,-3.38767E-1,-3.644332E-1,7.650664E-3,3.811143E-2,5.595391E-2,-3.589534E-1,-6.765502E-1,-6.669187E-1,-6.591878E-1,-2.893004E-1,1.048242E0,-7.317548E-1,-1.985699E-1,4.513422E-1,1.06145E0,4.777854E-1,1.267896E0,1.350758E0,1.337705E0,1.385917E0,1.091785E0,1.289325E0,5.511991E-1,-8.125907E-1,1.050196E0,-4.338815E-1,-4.664211E-1,6.203229E-1,-6.020947E-1,5.299978E-1,2.989034E-1,-7.676021E-2,1.5216E-1,-3.001498E-1,0
  • 8. Instances
    import weka.core.Instances;
    import java.io.BufferedReader;
    import java.io.FileReader;
    ...
    Instances data = new Instances( new BufferedReader( new FileReader("/some/where/data.arff")));
    // setting class attribute
    data.setClassIndex(data.numAttributes() - 1);
    // The class index indicate the target attribute used for classification.
  • 9. filters
    import weka.core.Instances;
    import weka.filters.Filter;
    import weka.filters.unsupervised.attribute.Remove;
    ...
    String[] options = new String[2];
    options[0] = "-R"; // "range"
    options[1] = "1"; // first attribute
    Remove remove = new Remove(); // new instance of filter remove.setOptions(options); // set options remove.setInputFormat(data); // inform filter about dataset AFTERsetting options
    Instances newData = Filter.useFilter(data, remove); // apply filter
  • 10. classifier
    import weka.classifiers.functions.LibSVM;
    ...
    String[] options = String[] options = weka.core.Utils.splitOptions("-S 0 -K 2 -D 3 -G 0.0 -R 0.0 -N 0.5 -M 40.0 -C 1.0 -E 0.0010 -P 0.1 -B");
    LibSVM classifier = new LibSVM(); // new instance of tree classifier.setOptions(options); // set the options classifier.buildClassifier(data); // build classifier
  • 11. Classifying instances
    Instances unlabeled=…//load from somewhere

    for (inti = 0; i < unlabeled.numInstances(); i++) {
    Instance ins=unlabeled.instance(i);
    clsLabel = classifier.classifyInstance(ins); //get predict label
    double[] prob_array=classifier.distributionForInstance(ins);
    //get probability for each category
    }
  • 12. Example:weka+libsvm+5 folds CV
    public static void main(String[] args) throws Exception {
    PrintWriterpw_score=new PrintWriter( new FileOutputStream ("c:empcore.txt"));
    PrintWriterpw_label=new PrintWriter(new FileOutputStream ("c:empabel.txt"));
    PrintWriterpw_pid=new PrintWriter(new FileOutputStream ("c:empid.txt"));
    Instances data = new Instances(
    new BufferedReader(
    new FileReader("C:emprainSet_sn.arff")));
    Remove remove = new Remove(); // new instance of filter
    remove.setOptions(weka.core.Utils.splitOptions("-R 2-11,129"));// set options
    remove.setInputFormat(data); // inform filter about dataset AFTER setting options
    Int seed = 2; // the seed for randomizing the data
    int folds = 5; // the number of folds to generate, >=2
    data.setClassIndex(0); // first attribute is groundtruth
    Instances randData;
    Random rand = new Random(seed); // create seeded number generator
    randData = new Instances(data); // create copy of original data
    randData.randomize(rand); // randomize data with number generator
  • 13. for(int n=0;n<folds;n++){
    Instances train = randData.trainCV(folds, n);
    Instances test = randData.testCV(folds, n);
    System.out.println("Fold "+n+"train "+train.numInstances()+"test "+test.numInstances());
    String[] options = weka.core.Utils.splitOptions("-S 0 -K 2 -D 3 -G 0.0 -R 0.0 -N 0.5 -M 40.0 -C 1.0 -E 0.0010 -P 0.1 -B");
    LibSVM classifier=new LibSVM();
    classifier.setOptions(options);
    FilteredClassifierfc = new FilteredClassifier();
    fc.setFilter(remove);
    fc.setClassifier(classifier);
    fc.buildClassifier(train);
    for(inti=0;i<test.numInstances();i++)
    {
    double[] tmp=(double[])fc.distributionForInstance(test.instance(i));
    //tmp[0] :prob of negtive
    //tmp[1] :prob of positive
    pw_label.println(test.instance(i).attribute(0).value((int)test.instance(i).value(0))); //ground truth
    pw_score.println(tmp[1]); //predict value
    pw_pid.println((int)test.instance(i).value(4)); //study-ID
    }}
  • 14. FROC
    Algorithm:
    Load “predicted score”, “ground truth”, and “patient id”.
    Initialize :
    “Detected_patients= [ ]
    Sorting rows
    ( priority “predicted score” > “ground truth” > “patient id” in descending order).
    For each row,
    If ground truth is negative, x+=1
    Else // get a positive point
    If patient is not in “Detected_patients, //get a new positive patient
    y+=1 and add patient_id to Detected_patients
    else //patient is found before
    do nothing
    Normalize
    x => 0~ average false alarm per image i.e. X is divided by total image numbers
    y => 0~1 i.e. Y is divided by patients numbers
    Calculate the area under the curve
  • 15. FROC tools-JAVA
    java -cp bin mslab.kddcup2008.roc.ROC score.txt label.txt pid.txt
    score.txt : predict label for each point . i.e. probability for being positive
    label.txt : ground truth for each point
    pid.txt : patient ID for each point
  • 16. FROC tools-Matlab
    Matlabmatlab function
    [Pd_patient_wise,FA_per_image,AUC] = get_ROC_KDD(p,Y,PID,fa_low,fa_high)
    Pd_patient_wise
    The y location of each point on the curve.
    FA_per_image
    The x location of each point on the curve.
    AUC
    p – Predicted label
    Y – Ground truth
    PID – Patient ID
    Plot(FA_per_image,Pd_patient_wise);
  • 17. FROC curve example
  • 18. The result of above example:
    AUC = 0.0782
    Measurements by Points:
    TP = 237
    FN = 386
    FP = 108
    TN = 101563
    precision = 0.6870
    recall = 0.3804
    FScore = 0.4897
  • 19. Reference:
    Use weka in your java code
    Generating cross-validation folds
    Download:
    Example code
    Java roc code
    matlab roc code