Upcoming SlideShare
Loading in …5
×

# WEKA: Credibility Evaluating Whats Been Learned

5,779 views

Published on

Credibility Evaluating Whats Been Learned

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
Your message goes here
• Be the first to comment

No Downloads
Views
Total views
5,779
On SlideShare
0
From Embeds
0
Number of Embeds
252
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

### WEKA: Credibility Evaluating Whats Been Learned

1. 1. Credibility: Evaluating what’s Been Learned<br />
2. 2. Training and Testing<br />We measure the success of a classification procedure by using error rates (or equivalent success rates)<br />Measuring success rate using training set is highly optimistic<br />The error rate on training set is called resubstitution error<br />We have a separate test set for calculating success error<br />Test set should be independent of the training set<br />Also some time to improve our classification technique we use a validation set<br />When we hold out some part of training set for testing (which is now not used for training), this process is called holdout procedure<br />
3. 3. Predicting performance<br />Expected success rate = 100 – error rate (If error rate is also in percentage)<br />We want the true success rate<br />Calculation of true success rate<br />Suppose we have expected success rate(f) = s/n, where s is the number of success out of a total n instances<br />For large value of n, f follows a normal distribution<br />Now we will predict the true success rate (p) based on the confidence percentage we want<br /> For example say our f = 75%, then p will lie in [73.2%,76.7%] with 80% confidence<br />
4. 4. Predicting performance<br />Now using properties of statistics we know that the mean of f is p and the variance is p(1-p)/n<br />To use normal distribution we will have to make the mean of f = 0 and standard deviation = 1 <br />So suppose our confidence = c% and we want to calculate p<br />We will use the two tailed property of normal distribution<br />And also that the are covered by normal distribution is taken as 100% so the are we will leave is 100 - c<br />
5. 5. Predicting performance<br />Finally after all the manipulations we have ,true success rate as:<br />Here,<br /> p -&gt; true success rate<br /> f - &gt; expected success rate<br /> N -&gt; Number of instances<br /> Z -&gt; Factor derived from a normal distribution table using the 100-c measure<br />
6. 6. Cross validation<br />We use cross validation when amount of data is small and we need to have independent training and test set from it<br />It is important that each class is represented in its actual proportions in the training and test set: Stratification <br />An important cross validation technique is stratified 10 fold cross validation, where the instance set is divided into 10 folds<br />We have 10 iterations with taking a different single fold for testing and the rest 9 folds for training, averaging the error of the 10 iterations<br />Problem: Computationally intensive <br />
7. 7. Other estimates<br />Leave-one-out:Steps<br />One instance is left for testing and the rest are used for training<br />This is iterated for all the instances and the errors are averaged<br />Leave-one-out:Advantage<br />We use larger training sets<br />Leave-one-out:Disadvantage<br />Computationally intensive<br />Cannot be stratified<br />
8. 8. Other estimates<br />0.632 Bootstrap<br />Dataset of n samples is sampled n times, with replacements, to give another dataset with n instances<br />There will be some repeated instances in the second set<br />Here error is defined as:<br />e = 0.632x(error in test instances) + 0.368x(error in training instances)<br />
9. 9. Comparing data mining methods<br />Till now we were dealing with performance prediction<br />Now we will look at methods to compare algorithms, to see which one did better<br />We cant directly use Error rate to predict which algorithm is better as the error rate might have been calculated on different data sets<br />So to compare algorithms we need some statistical tests<br />We use Student’s t- test to do this. This test help us to figure out if the mean error of two algorithm are different or not for a given confidence level<br />
10. 10. Comparing data mining methods<br />We will use paired t-test which is a slight modification of student’s t-test<br />Paired t-test<br />Suppose we have unlimited data, do the following:<br />Find k data sets from the unlimited data we have<br />Use cross validation with each technique to get the respective outcomes: x1, x2, x3,….,xk and y1,y2,y3,……,yk<br />mx = mean of x values and similarly my<br />di = xi – yi<br />Using t-statistic:<br />
11. 11. Comparing data mining methods<br />Based on the value of k we get a degree of freedom, which enables us to figure out a z for a particular confidence value<br />If t &lt;= (-z) or t &gt;= (z) then, the two means differ significantly <br />In case t = 0 then they don’t differ, we call this null hypothesis<br />
12. 12. Predicting Probabilities<br />Till now we were considering a scheme which when applied, results in either a correct or an incorrect prediction. This is called 0 – loss function<br />Now we will deal with the success incase of algorithms that outputs probability distribution for e.g. Naïve Bayes<br />
13. 13. Predicting Probabilities<br />Quadratic loss function:<br />For a single instance there are k out comes or classes<br />Probability vector: p1,p2,….,pk<br />The actual out come vector is: a1,a2,a3,…..ak (where the actual outcome will be 1, rest all 0)<br />We have to minimize the quadratic loss function given by:<br />The minimum will be achieved when the probability vector is the true probability vector<br />
14. 14. Predicting Probabilities<br />Informational loss function:<br />Given by:<br />–log(pi)<br />Minimum is again reached at true probabilities<br />Differences between Quadratic loss and Informational loss<br />While quadratic loss takes all probabilities under consideration, Informational loss is based only on the class probability <br />While quadratic loss is bounded as its maximum output is 2, Informational loss is unbounded as it can output values up to infinity<br />
15. 15. Counting the cost<br />Different outcomes might have different cost<br />For example in loan decision, the cost of lending to a defaulter is far greater that the lost-business cost of refusing a loan to a non defaulter<br />Suppose we have two class prediction. Outcomes can be: <br />
16. 16. Counting the cost<br />True positive rate: TP/(TP+FN)<br />False positive rate: FP/(FP+TN)<br />Overall success rate: Number of correct classification / Total Number of classification<br />Error rate = 1 – success rate<br />In multiclass case we have a confusion matrix like (actual and a random one): <br />
17. 17. Counting the cost<br />These are the actual and the random outcome of a three class problem<br />The diagonal represents the successful cases<br />Kappa statistic = (D-observed - D-actual) / (D-perfect - D-actual)<br />Here kappa statistic = (140 – 82)/(200-82) = 49.2%<br />Kappa is used to measure the agreement between predicted and observed categorizations of a dataset, while correcting for agreements that occurs by chance<br />Does not take cost into account<br />
18. 18. Classification with costs<br />Example Cost matrices (just gives us the number of errors):<br />Success rate is measured by average cost per prediction<br />We try to minimize the costs<br />Expected costs: dot products of vectors of class probabilities and appropriate column in cost matrix<br />
19. 19. Classification with costs<br />Steps to take cost into consideration while testing:<br />First use a learning method to get the probability vector (like Naïve Bayes)<br /> Now multiple the probability vector to each column of a cost matrix one by one so as to get the cost for each class/column<br />Select the class with the minimum(or maximum!!) cost<br />
20. 20. Cost sensitive learning<br />Till now we included the cost factor during evaluation<br />We will incorporate costs into the learning phase of a method<br />We can change the ratio of instances in the training set so as to take care of costs<br />For example we can do replication of a instances of particular class so that our learning method will give us a model with less errors of that class<br />
21. 21. Lift Charts<br />In practice, costs are rarely known<br />In marketing terminology the response rate is referred to as the lift factor<br />We compare probable scenarios to make decisions<br />A lift chart allows visual comparison<br />Example: promotional mail out to 1,000,000 households<br />Mail to all: 0.1%response (1000)<br />Some data mining tool identifies subset of 100, 000 of which 0.4% respond (400)<br />A lift of 4<br />
22. 22. Lift Charts<br />Steps to calculate lift factor:<br />We decide a sample size<br />Now we arrange our data in decreasing order of the predicted probability of a class (the one which we will base our lift factor on: positive class)<br />We calculate:<br />Sample success proportion = Number of positive instances / Sample size <br />Lift factor = Sample success proportion / Data success proportion<br />We calculate lift factor for different sample size to get Lift Charts<br />
23. 23. Lift Charts<br />A hypothetical lift chart<br />
24. 24. Lift Charts<br />In the lift chart we will like to stay towards the upper left corner<br />The diagonal line is the curve for random samples without using sorted data<br />Any good selection will keep the lift curve above the diagonal <br />
25. 25. ROC Curves<br />Stands for receiver operating characteristic<br />Difference to lift charts:<br />Y axis showspercentage of true positive <br />X axis shows percentage of false positives in samples<br />ROC is a jagged curve<br />It can be smoothened out by cross validation<br />
26. 26. ROC Curves<br />A ROC curve<br />
27. 27. ROC Curves<br />Ways to generate cost curves<br />(Consider the previous diagram for reference)<br />First way:<br />Get the probability distribution over different folds of data<br />Sort the data in decreasing order of the probability of yes class<br />Select a point on X-axis and for that number of no, get the number of yes for each probability distribution<br />Average the number of yes from all the folds and plot it<br />
28. 28. ROC Curves<br />Second way:<br />Get the probability distribution over different folds of data<br />Sort the data in decreasing order of the probability of yes class<br />Select a point on X-axis and for that number of no, get the number of yes for each probability distribution<br />Plot a ROC for each fold individually <br />Average all the ROCs<br />
29. 29. ROC Curves<br />ROC curves for two schemes <br />
30. 30. ROC Curves<br />In the previous ROC curves:<br />For a small, focused sample, use method A<br />For a large one, use method B<br />In between, choose between A and B with appropriate probabilities<br />
31. 31. Recall – precision curves<br />In case of a search query:<br />Recall = number of documents retrieved that are relevant / total number of documents that are relevant<br />Precision = number of documents retrieved that are relevant / total number of documents that are retrieved<br />
32. 32. A summary<br /> Different measures used to evaluate the false positive versus the false negative tradeoff<br />
33. 33. Cost curves<br />Cost curves plot expected costs directly<br />Example for case with uniform costs (i.e. error):<br />
34. 34. Cost curves<br />Example with costs:<br />
35. 35. Cost curves<br />C[+|-] is the cost of predicting + when the instance is –<br />C[-|+] is the cost of predicting - when the instance is +<br />
36. 36. Minimum Description Length Principle<br />The description length is defined as:<br />Space required to describe a theory + space required to describe the theory’s mistakes<br />Theory = Classifier and mistakes = errors on the training data<br />We try to minimize the description length<br />MDL theory is the one that compresses the data the most. I.e to compress a data set we generate a model and then store the model and its mistakes<br />We need to compute:<br />Size of the model<br />Space needed to encode the error<br />
37. 37. Minimum Description Length Principle<br />The 2nd one is easy. Just use informational loss function<br />For 1st we need a method to encode the model<br />L[T] = “length” of the theory<br />L[E|T] = training set encoded wrt the theory <br />
38. 38. Minimum Description Length Principle<br />MDL and clustering<br />Description length of theory: bits needed to encode the clusters. E.g. cluster centers<br />Description length of data given theory: encode cluster membership and position relative to cluster. E.g. distance to cluster centers<br />Works if coding scheme uses less code space for small numbers than for large ones<br />
39. 39. Visit more self help tutorials<br />Pick a tutorial of your choice and browse through it at your own pace.<br />The tutorials section is free, self-guiding and will not involve any additional support.<br />Visit us at www.dataminingtools.net<br />