Data Mining Algorithms and Classification Accuracy Evaluation

Classification
CSC504.3: Identify appropriate data mining algorithms to
solve real world problems.

CONTENT
● Basic Concepts
● Decision Tree Induction
● Naïve Bayesian Classification
● Accuracy and Error measures
● Evaluating the Accuracy of a Classifier:
○ Holdout & Random Subsampling
○ Cross Validation
○ Bootstrap

Regression Analyses
• Regression: technique concerned with predicting
some variables by knowing others
• The process of predicting variable Y using
variable X

Regression
⮚ Uses a variable (x) to predict some outcome
variable (y)
⮚ Tells you how values in y change as a function
of changes in values of x

Correlation and Regression
⮚ Correlation describes the strength of a linear
relationship between two variables
⮚ Linear means “straight line”
⮚ Regression tells us how to draw the straight line
described by the correlation

Regression
⮚ Calculates the “best-fit” line for a certain set of data
The regression line makes the sum of the squares of
the residuals smaller than for any other line
Regression minimizes residuals

By using the least squares method (a procedure
that minimizes the vertical deviations of plotted
points surrounding a straight line) we are
able to construct a best fitting straight line to the
scatter diagram points and then formulate a
regression equation in the form of:
b

Regression Equation
⮚ Regression equation
describes the
regression line
mathematically
■ Intercept
■ Slope

Regressing grades on hours
Predicted final grade in class =
59.95 + 3.17*(number of hours you study per week)

Predict the final grade of…
■ Someone who studies for 12 hours
■ Final grade = 59.95 + (3.17*12)
■ Final grade = 97.99
■ Someone who studies for 1 hour:
■ Final grade = 59.95 + (3.17*1)
Predicted final grade in class = 59.95 + 3.17*(hours of study)

Exercise
A sample of 6 persons was selected the
value of their age ( x variable) and their
weight is demonstrated in the following
table. Find the regression equation and
what is the predicted weight when age is
8.5 years.

Serial no. Age (x) Weight (y)
1
2
3
4
5
6
7
6
8
5
6
9
12
8
12
10
11
13

Answer
Serial no. Age (x) Weight (y) xy X2 Y2
1
2
3
4
5
6
7
6
8
5
6
9
12
8
12
10
11
13
84
48
96
50
66
117
49
36
64
25
36
81
144
64
144
100
121
169
Total 41 66 461 291 742

we create a regression line by plotting two
estimated values for y against their X component,
then extending the line right and left.

EVALUATION OF CLASSIFIERS
● A few tools have been designed to evaluate the performance of a classifier.
● A confusion matrix is a specific table layout that allows visualization of the performance of a
classifier.
● Table shows the confusion matrix for a two-class classifier.
● True positives (TP) are the number of positive instances the classifier correctly identified as
positive.
● False positives (FP) are the number of instances in which the classifier identified as positive
but in reality are negative.
● True negatives (TN) are the number of negative instances the classifier correctly identified
as negative.
● False negatives (FN) are the number of instances classified as negative but in reality are
positive.
● In a two-class classification, a preset threshold may be used to separate positives from
negatives.
● TP and TN are the correct guesses. A good classifier should have large TP and TN and
small (ideally zero) numbers for FP and FN.

The accuracy {or the overall success rate) is a metric defining the rate at which a model has classified
the records correctly. It is defined as the sum of TP and TN divided by the total number of instances, as
shown in Equation
A good model should have a high accuracy score, but having a high accuracy score alone does not
guarantee the model is well established. The following measures can be introduced to better evaluate the
performance of a classifier.

The true positive rate (TPR) shows what percent of positive instances the
classifier
correctly identified. It's also illustrated in Equation

The false negative rate (FNR) shows what percent of positives the classifier
marked as negatives.
It is also known as the miss rate or type II error rate and is shown in Equation.
Note that the sum of TPR and FNR is 1.

A well-performed model should have a high TPR that is ideally 1 and a low FPR
and FNR that are ideally 0.
In reality, it's rare to have TPR = 1, FPR = 0, and FNR = 0, but these measures
are useful to compare the performance of multiple models that are designed for
solving the same problem.

Precision and recall are accuracy metrics used by the information retrieval
community, but they can be used to characterize classifiers in general.
Precision is the percentage of instances marked positive that really are positive,
as shown in Equation

Recall is the percentage of positive instances that were correctly identified. Recall
is equivalent to the TPR.

ROC CURVE
the ROC curve, which is a common tool to evaluate classifiers.
The abbreviation stands for receiver operating characteristic, a term used in signal
detection to characterize the trade-off between hit rate and false-alarm rate over a
noisy channel.
A ROC curve evaluates the performance of a classifier based on the TP and FP,
regardless of other factors such as class distribution and error costs.
The vertical axis is the True Positive Rate (TPR), and the horizontal axis is the
False Positive Rate (FPR).

AREA UNDER CURVE
● Related to the ROC curve is the area under the curve (AUC).
● The AUC is calculated by measuring the area under the
ROC curve.
● Higher AUC scores mean the classifier performs better.
● The score can range from 0.5 (for the diagonal line
TPR=FPR) to 1.0 (with ROC passing through the top-left
corner).

Holdout Method and Random Subsampling
The holdout method is what we have alluded to so far in our discussions about
accuracy. In this method, the given data are randomly partitioned into two independent
sets, a training set and a test set. Typically, two-thirds of the data are allocated to the
training set, and the remaining one-third is allocated to the test set. The training set is
used to derive the model. The model’s accuracy is then estimated with the test set.The
estimate is pessimistic because only a portion of the initial data is used to derive the
model.
Random subsampling is a variation of the holdout method in which the holdout
method is repeated k times. The overall accuracy estimate is taken as the average of
the
accuracies obtained from each iteration.

Data Mining Algorithms and Classification Accuracy Evaluation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Mining Algorithms and Classification Accuracy Evaluation

Similar to Data Mining Algorithms and Classification Accuracy Evaluation (20)

More from nikshaikh786

More from nikshaikh786 (20)

Recently uploaded

Recently uploaded (20)

Data Mining Algorithms and Classification Accuracy Evaluation