The university of
Poonch
Data Mining
Bs(Cs) 6th semester
Contact
Lenses
 Weka stands for Waikato Environment for
knowledge.
 Weka contains tools for data pre- processing,
classification, regression and clustering.
 Weka is a collection of machine learning
algorithm for data mining task.
From window desktop:
 click start, choose All programs,
choose Weka 3-7 to start Weka.
 Then the first interface window
appear.
 Explorer is used for pre-
processing, attribute selection,
learning and visualization.
 When we select explorer the
environment that will open is:
 Now I click on open file to open a
data file from the folder where
data files are stored.
 Then I select my dataset
“CONTACT LENSES”
 Every instance consist a number
of attributes
 First we choose filter.
 There are two filters:
 Supervised
 unsupervised.
 We then selected unsupervised filter:
 In unsupervised filter there are two options
 Instance
 attribute
 We selected attribute:
 There are many attributes but we choose the attribute
that is Nominal To Binary.
 Firstly there is a simple classifier ZeroR.
 Determines the most common class
 Or the median (in the case of numeric
values)
 Tests how well the class can be predicted
without considering other attributes
Use training set:
 The classifier is evaluated on how well it predicts
the class of the instances it was trained on.
Supplied test set:
 The classifier is evaluated on how well it
predicts the class of a set of instances loaded from a
file. Clicking the Set... Button brings up a dialog
allowing you to choose the file to test on.
Percentage split:
• The classifier is evaluated on how well it
predicts a certain percentage of the data which
is held out for testing. The amount of data held
out depends on the value entered in the % field.
Cross-validation (CV):
 The classifier is evaluated by cross-validation,
using the number of folds that are entered in
the Folds text field.
 Having 10 folds means 90% of full data is
used for training (and 10% for testing) in
each fold test.
 cross-validation produces a fair estimation of
test performance.
 When we choose supplied test set data it
gives the same result as when we choose
training set. The results are same of both
supplied test set and training set.
 The True Positive (TP) rate is the proportion of
examples which were classified as class x, among all
examples which truly have class x, i.e. how much part
of the class was captured. It is equivalent to Recall. In
the confusion matrix, this is the diagonal element
divided by the sum over the relevant row,
i.e.4/(4+0+1)=0.8 for class soft and 1/(0+1+3)=0.425
for class hard 4/(4+0+1)=0.8 for none class in our
example.
 The False Positive (FP) rate is the proportion of
examples which were classified as class x, but belong
to a different class, among all examples which are not
of class x. In the matrix, this is the column sum of class
x minus the diagonal element, divided by the rows
sums of all other classes; i.e. 1/1+2+12=0.053 for
class soft and 1/1+0+4=0.8 for class hard.
 The Precision is the proportion of the examples
which truly have class x among all those which
were classified as class x. In the matrix, this is
the diagonal element divided by the sum over
the relevant column, i.e. 4/(4+0+1)=0.8 for
class soft and 1/(0+1+3)=0.333 for class hard
class 12/(12+3+1)=0.75 for class none
2*Precision*Recall / (Precision + Recall)
A combined measure for precision and
Recall for class soft (2*0.8*0.8)/(0.8+0.8)=0.8 for
class hard (2*0.333*0.25)/(0.333+0.8)=0.286 for
class none (2*0.75*0.8)/(0.75+0.8)=0.774
 Accuracy is measured by the area under the
ROC curve. An area of 1 represents a perfect
test; an area of .5 represents a worthless test. A
rough guide for classifying the accuracy of a
diagnostic test is the traditional academic point
system: .90-1 = excellent (A)
Recall:
All the documents that have exactly
retrieved from the query.It is equivalent to TP.
 I can change the folds in cross
validation.
 If I change the folds from 10 to 5
then its means that the folds are 80%
trained.

How use weka tool

  • 1.
    The university of Poonch DataMining Bs(Cs) 6th semester
  • 2.
  • 3.
     Weka standsfor Waikato Environment for knowledge.  Weka contains tools for data pre- processing, classification, regression and clustering.  Weka is a collection of machine learning algorithm for data mining task.
  • 4.
    From window desktop: click start, choose All programs, choose Weka 3-7 to start Weka.  Then the first interface window appear.
  • 6.
     Explorer isused for pre- processing, attribute selection, learning and visualization.  When we select explorer the environment that will open is:
  • 8.
     Now Iclick on open file to open a data file from the folder where data files are stored.  Then I select my dataset “CONTACT LENSES”  Every instance consist a number of attributes
  • 11.
     First wechoose filter.  There are two filters:  Supervised  unsupervised.  We then selected unsupervised filter:  In unsupervised filter there are two options  Instance  attribute  We selected attribute:  There are many attributes but we choose the attribute that is Nominal To Binary.
  • 14.
     Firstly thereis a simple classifier ZeroR.  Determines the most common class  Or the median (in the case of numeric values)  Tests how well the class can be predicted without considering other attributes
  • 16.
    Use training set: The classifier is evaluated on how well it predicts the class of the instances it was trained on. Supplied test set:  The classifier is evaluated on how well it predicts the class of a set of instances loaded from a file. Clicking the Set... Button brings up a dialog allowing you to choose the file to test on.
  • 17.
    Percentage split: • Theclassifier is evaluated on how well it predicts a certain percentage of the data which is held out for testing. The amount of data held out depends on the value entered in the % field. Cross-validation (CV):  The classifier is evaluated by cross-validation, using the number of folds that are entered in the Folds text field.
  • 18.
     Having 10folds means 90% of full data is used for training (and 10% for testing) in each fold test.  cross-validation produces a fair estimation of test performance.
  • 26.
     When wechoose supplied test set data it gives the same result as when we choose training set. The results are same of both supplied test set and training set.
  • 30.
     The TruePositive (TP) rate is the proportion of examples which were classified as class x, among all examples which truly have class x, i.e. how much part of the class was captured. It is equivalent to Recall. In the confusion matrix, this is the diagonal element divided by the sum over the relevant row, i.e.4/(4+0+1)=0.8 for class soft and 1/(0+1+3)=0.425 for class hard 4/(4+0+1)=0.8 for none class in our example.
  • 31.
     The FalsePositive (FP) rate is the proportion of examples which were classified as class x, but belong to a different class, among all examples which are not of class x. In the matrix, this is the column sum of class x minus the diagonal element, divided by the rows sums of all other classes; i.e. 1/1+2+12=0.053 for class soft and 1/1+0+4=0.8 for class hard.
  • 32.
     The Precisionis the proportion of the examples which truly have class x among all those which were classified as class x. In the matrix, this is the diagonal element divided by the sum over the relevant column, i.e. 4/(4+0+1)=0.8 for class soft and 1/(0+1+3)=0.333 for class hard class 12/(12+3+1)=0.75 for class none
  • 33.
    2*Precision*Recall / (Precision+ Recall) A combined measure for precision and Recall for class soft (2*0.8*0.8)/(0.8+0.8)=0.8 for class hard (2*0.333*0.25)/(0.333+0.8)=0.286 for class none (2*0.75*0.8)/(0.75+0.8)=0.774
  • 34.
     Accuracy ismeasured by the area under the ROC curve. An area of 1 represents a perfect test; an area of .5 represents a worthless test. A rough guide for classifying the accuracy of a diagnostic test is the traditional academic point system: .90-1 = excellent (A) Recall: All the documents that have exactly retrieved from the query.It is equivalent to TP.
  • 35.
     I canchange the folds in cross validation.  If I change the folds from 10 to 5 then its means that the folds are 80% trained.