Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

5,779 views

Published on

Credibility Evaluating Whats Been Learned

Published in:
Technology

No Downloads

Total views

5,779

On SlideShare

0

From Embeds

0

Number of Embeds

252

Shares

0

Downloads

0

Comments

0

Likes

2

No embeds

No notes for slide

- 1. Credibility: Evaluating what’s Been Learned<br />
- 2. Training and Testing<br />We measure the success of a classification procedure by using error rates (or equivalent success rates)<br />Measuring success rate using training set is highly optimistic<br />The error rate on training set is called resubstitution error<br />We have a separate test set for calculating success error<br />Test set should be independent of the training set<br />Also some time to improve our classification technique we use a validation set<br />When we hold out some part of training set for testing (which is now not used for training), this process is called holdout procedure<br />
- 3. Predicting performance<br />Expected success rate = 100 – error rate (If error rate is also in percentage)<br />We want the true success rate<br />Calculation of true success rate<br />Suppose we have expected success rate(f) = s/n, where s is the number of success out of a total n instances<br />For large value of n, f follows a normal distribution<br />Now we will predict the true success rate (p) based on the confidence percentage we want<br /> For example say our f = 75%, then p will lie in [73.2%,76.7%] with 80% confidence<br />
- 4. Predicting performance<br />Now using properties of statistics we know that the mean of f is p and the variance is p(1-p)/n<br />To use normal distribution we will have to make the mean of f = 0 and standard deviation = 1 <br />So suppose our confidence = c% and we want to calculate p<br />We will use the two tailed property of normal distribution<br />And also that the are covered by normal distribution is taken as 100% so the are we will leave is 100 - c<br />
- 5. Predicting performance<br />Finally after all the manipulations we have ,true success rate as:<br />Here,<br /> p -> true success rate<br /> f - > expected success rate<br /> N -> Number of instances<br /> Z -> Factor derived from a normal distribution table using the 100-c measure<br />
- 6. Cross validation<br />We use cross validation when amount of data is small and we need to have independent training and test set from it<br />It is important that each class is represented in its actual proportions in the training and test set: Stratification <br />An important cross validation technique is stratified 10 fold cross validation, where the instance set is divided into 10 folds<br />We have 10 iterations with taking a different single fold for testing and the rest 9 folds for training, averaging the error of the 10 iterations<br />Problem: Computationally intensive <br />
- 7. Other estimates<br />Leave-one-out:Steps<br />One instance is left for testing and the rest are used for training<br />This is iterated for all the instances and the errors are averaged<br />Leave-one-out:Advantage<br />We use larger training sets<br />Leave-one-out:Disadvantage<br />Computationally intensive<br />Cannot be stratified<br />
- 8. Other estimates<br />0.632 Bootstrap<br />Dataset of n samples is sampled n times, with replacements, to give another dataset with n instances<br />There will be some repeated instances in the second set<br />Here error is defined as:<br />e = 0.632x(error in test instances) + 0.368x(error in training instances)<br />
- 9. Comparing data mining methods<br />Till now we were dealing with performance prediction<br />Now we will look at methods to compare algorithms, to see which one did better<br />We cant directly use Error rate to predict which algorithm is better as the error rate might have been calculated on different data sets<br />So to compare algorithms we need some statistical tests<br />We use Student’s t- test to do this. This test help us to figure out if the mean error of two algorithm are different or not for a given confidence level<br />
- 10. Comparing data mining methods<br />We will use paired t-test which is a slight modification of student’s t-test<br />Paired t-test<br />Suppose we have unlimited data, do the following:<br />Find k data sets from the unlimited data we have<br />Use cross validation with each technique to get the respective outcomes: x1, x2, x3,….,xk and y1,y2,y3,……,yk<br />mx = mean of x values and similarly my<br />di = xi – yi<br />Using t-statistic:<br />
- 11. Comparing data mining methods<br />Based on the value of k we get a degree of freedom, which enables us to figure out a z for a particular confidence value<br />If t <= (-z) or t >= (z) then, the two means differ significantly <br />In case t = 0 then they don’t differ, we call this null hypothesis<br />
- 12. Predicting Probabilities<br />Till now we were considering a scheme which when applied, results in either a correct or an incorrect prediction. This is called 0 – loss function<br />Now we will deal with the success incase of algorithms that outputs probability distribution for e.g. Naïve Bayes<br />
- 13. Predicting Probabilities<br />Quadratic loss function:<br />For a single instance there are k out comes or classes<br />Probability vector: p1,p2,….,pk<br />The actual out come vector is: a1,a2,a3,…..ak (where the actual outcome will be 1, rest all 0)<br />We have to minimize the quadratic loss function given by:<br />The minimum will be achieved when the probability vector is the true probability vector<br />
- 14. Predicting Probabilities<br />Informational loss function:<br />Given by:<br />–log(pi)<br />Minimum is again reached at true probabilities<br />Differences between Quadratic loss and Informational loss<br />While quadratic loss takes all probabilities under consideration, Informational loss is based only on the class probability <br />While quadratic loss is bounded as its maximum output is 2, Informational loss is unbounded as it can output values up to infinity<br />
- 15. Counting the cost<br />Different outcomes might have different cost<br />For example in loan decision, the cost of lending to a defaulter is far greater that the lost-business cost of refusing a loan to a non defaulter<br />Suppose we have two class prediction. Outcomes can be: <br />
- 16. Counting the cost<br />True positive rate: TP/(TP+FN)<br />False positive rate: FP/(FP+TN)<br />Overall success rate: Number of correct classification / Total Number of classification<br />Error rate = 1 – success rate<br />In multiclass case we have a confusion matrix like (actual and a random one): <br />
- 17. Counting the cost<br />These are the actual and the random outcome of a three class problem<br />The diagonal represents the successful cases<br />Kappa statistic = (D-observed - D-actual) / (D-perfect - D-actual)<br />Here kappa statistic = (140 – 82)/(200-82) = 49.2%<br />Kappa is used to measure the agreement between predicted and observed categorizations of a dataset, while correcting for agreements that occurs by chance<br />Does not take cost into account<br />
- 18. Classification with costs<br />Example Cost matrices (just gives us the number of errors):<br />Success rate is measured by average cost per prediction<br />We try to minimize the costs<br />Expected costs: dot products of vectors of class probabilities and appropriate column in cost matrix<br />
- 19. Classification with costs<br />Steps to take cost into consideration while testing:<br />First use a learning method to get the probability vector (like Naïve Bayes)<br /> Now multiple the probability vector to each column of a cost matrix one by one so as to get the cost for each class/column<br />Select the class with the minimum(or maximum!!) cost<br />
- 20. Cost sensitive learning<br />Till now we included the cost factor during evaluation<br />We will incorporate costs into the learning phase of a method<br />We can change the ratio of instances in the training set so as to take care of costs<br />For example we can do replication of a instances of particular class so that our learning method will give us a model with less errors of that class<br />
- 21. Lift Charts<br />In practice, costs are rarely known<br />In marketing terminology the response rate is referred to as the lift factor<br />We compare probable scenarios to make decisions<br />A lift chart allows visual comparison<br />Example: promotional mail out to 1,000,000 households<br />Mail to all: 0.1%response (1000)<br />Some data mining tool identifies subset of 100, 000 of which 0.4% respond (400)<br />A lift of 4<br />
- 22. Lift Charts<br />Steps to calculate lift factor:<br />We decide a sample size<br />Now we arrange our data in decreasing order of the predicted probability of a class (the one which we will base our lift factor on: positive class)<br />We calculate:<br />Sample success proportion = Number of positive instances / Sample size <br />Lift factor = Sample success proportion / Data success proportion<br />We calculate lift factor for different sample size to get Lift Charts<br />
- 23. Lift Charts<br />A hypothetical lift chart<br />
- 24. Lift Charts<br />In the lift chart we will like to stay towards the upper left corner<br />The diagonal line is the curve for random samples without using sorted data<br />Any good selection will keep the lift curve above the diagonal <br />
- 25. ROC Curves<br />Stands for receiver operating characteristic<br />Difference to lift charts:<br />Y axis showspercentage of true positive <br />X axis shows percentage of false positives in samples<br />ROC is a jagged curve<br />It can be smoothened out by cross validation<br />
- 26. ROC Curves<br />A ROC curve<br />
- 27. ROC Curves<br />Ways to generate cost curves<br />(Consider the previous diagram for reference)<br />First way:<br />Get the probability distribution over different folds of data<br />Sort the data in decreasing order of the probability of yes class<br />Select a point on X-axis and for that number of no, get the number of yes for each probability distribution<br />Average the number of yes from all the folds and plot it<br />
- 28. ROC Curves<br />Second way:<br />Get the probability distribution over different folds of data<br />Sort the data in decreasing order of the probability of yes class<br />Select a point on X-axis and for that number of no, get the number of yes for each probability distribution<br />Plot a ROC for each fold individually <br />Average all the ROCs<br />
- 29. ROC Curves<br />ROC curves for two schemes <br />
- 30. ROC Curves<br />In the previous ROC curves:<br />For a small, focused sample, use method A<br />For a large one, use method B<br />In between, choose between A and B with appropriate probabilities<br />
- 31. Recall – precision curves<br />In case of a search query:<br />Recall = number of documents retrieved that are relevant / total number of documents that are relevant<br />Precision = number of documents retrieved that are relevant / total number of documents that are retrieved<br />
- 32. A summary<br /> Different measures used to evaluate the false positive versus the false negative tradeoff<br />
- 33. Cost curves<br />Cost curves plot expected costs directly<br />Example for case with uniform costs (i.e. error):<br />
- 34. Cost curves<br />Example with costs:<br />
- 35. Cost curves<br />C[+|-] is the cost of predicting + when the instance is –<br />C[-|+] is the cost of predicting - when the instance is +<br />
- 36. Minimum Description Length Principle<br />The description length is defined as:<br />Space required to describe a theory + space required to describe the theory’s mistakes<br />Theory = Classifier and mistakes = errors on the training data<br />We try to minimize the description length<br />MDL theory is the one that compresses the data the most. I.e to compress a data set we generate a model and then store the model and its mistakes<br />We need to compute:<br />Size of the model<br />Space needed to encode the error<br />
- 37. Minimum Description Length Principle<br />The 2nd one is easy. Just use informational loss function<br />For 1st we need a method to encode the model<br />L[T] = “length” of the theory<br />L[E|T] = training set encoded wrt the theory <br />
- 38. Minimum Description Length Principle<br />MDL and clustering<br />Description length of theory: bits needed to encode the clusters. E.g. cluster centers<br />Description length of data given theory: encode cluster membership and position relative to cluster. E.g. distance to cluster centers<br />Works if coding scheme uses less code space for small numbers than for large ones<br />
- 39. Visit more self help tutorials<br />Pick a tutorial of your choice and browse through it at your own pace.<br />The tutorials section is free, self-guiding and will not involve any additional support.<br />Visit us at www.dataminingtools.net<br />

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment