2. 4.1 Introduction
โข Prediction can be thought of as classifying an attribute value into one of set of possible classes. It is often
viewed as forecasting a continuous value, while classification forecasts a discrete value.
โข All classification techniques assume some knowledge of the data. Training data consists of sample input data
as well as the classification assignment for each data tuple. Given a database ๐ท of tuples and a set of classes
๐ถ, the classification problem is to define a mapping ๐: ๐ท โ ๐ถ where each tuple is assigned to one class.
โข The problem is implemented in two phases:
โข Create a specific model by evaluating the training data.
โข Apply the model to classifying tuples from the target database.
โข There are three basic methods used to solve the classification problem: 1) specifying boundaries; 2) using
probability distributions; 3) using posterior probabilities.
โข A major issue associated with classification is overfitting. If the classification model fits the data exactly, it
may not be applicable to a broader population.
โข Statistical algorithms are based directly on the use of statistical information. Distance-based algorithms use
similarity or distance measure to perform the classification. Decision trees and NN use those structures. Rule
based classification algorithms generate if-then rules to perform classification.
3. Measuring Performance and Accuracy
โข Classification accuracy is usually calculated by determining the percentage of tuples placed in the
correct class.
โข Given a specific class and a database tuple may or may not be assigned to that class while its
actual membership may or may not be in that class. This gives us four quadrants:
โข True positive (TP): ๐ก๐ predicted to be in ๐ถ๐ and is actually in it.
โข False positive (FP): ๐ก๐ predicted to be in ๐ถ๐ but is not actually in it.
โข True negative (TN): ๐ก๐ not predicted to be in ๐ถ๐ and is not actually in it.
โข False negative (FN): ๐ก๐ not predicted to be in ๐ถ๐ but is actually in it.
โข An OC (operating characteristic) curve or ROC (receiver operating characteristic) curve shows the
relationship between false positives and true positives. The horizontal axis has the percentage of
false positives and the vertical axis has the percentage of true positives for a database sample.
โข A confusion matrix illustrates the accuracy of the solution to a classification problem. Given ๐
classes, a confusion matrix is an ๐ ร ๐ matrix where entry ๐๐,๐ indicates the number of tuples
from ๐ท that were assigned to class ๐ถ๐ but where the correct class is ๐ถ๐.
4. 4.2 Statistical Methods. Regression
โข Regression used for classification deals with estimation (prediction) of an output (class) value based on input values from the
database. It takes a set of data and fits the data to a formula. Classification can be performed using two different approaches: 1)
Division: The data are divided into regions based on class; 2) Prediction: Formulas are generated to predict the output class value.
โข The prediction is an estimate rather than the actual output value. This technique does not work well with nonnumeric data.
โข In cases with noisy, erroneous data, outliers, the observable data may be described as โถ ๐ฆ = ๐0 + ๐1 ๐ฅ1 + โฏ + ๐ ๐ ๐ฅ ๐ + ๐, where ๐ is
a random error with a mean of 0. A method of least squares is used to minimize the least squared error. We first take partial
derivatives with respect to coefficients and set them equal to zero. This approach finds least square estimates ๐0, ๐1, โฏ ๐ ๐ for the
coefficients so that the squared error is minimized for the set of observable values.
โข We can estimate the accuracy of the fit of a linear regression model to the actual data using a mean squared error function.
โข A commonly used regression technique is called logistic regression. Logistic regression fits data to a curve such as:
๐ =
๐(๐0+๐1 ๐ฅ1)
1 + ๐(๐0+๐1 ๐ฅ1)
โข It produces values between 0 and 1 and can be interpreted as probability of class membership. The logarithm is applied to obtain
the logistic function:
log ๐
๐
1 โ ๐
= ๐0 + ๐1 ๐ฅ1
โข Here ๐ is the probability of being in the class and 1 โ ๐ is the probability that it is not. The process chooses values for ๐0 and
๐1 that maximize the probability of observing the given values.
5. Bayesian Classification
โข Assuming that the contribution by all attributes are independent and that each contributes equally to the
classification problem, a classification scheme naive Bayes can be used.
โข Training data can be used to determine prior and conditional probabilities ๐ ๐ถ๐ and ๐(๐ฅ๐|๐ถ๐), as well as
๐ ๐ฅ๐ . From these values Bayes theorem allows us to estimate the posterior probability ๐ ๐ถ๐ ๐ฅ๐ and
๐(๐ถ๐|๐ก๐).
โข This must be done for all attributes and all values
๐ ๐ก๐ ๐ถ๐ =
๐=1
๐
๐(๐ฅ๐๐|๐ถ๐)
โข To calculate ๐(๐ก๐) we estimate the likelihoods for ๐ก๐ in each class and add these values.
โข The posterior probability ๐(๐ถ๐|๐ก๐) is then found for each class. The class with the highest probability is the
one chosen for the tuple.
โข Only one scan of training data is needed, it can handle missing values. In simple relationships this technique
often yields good results.
โข The technique does not handle continuous data. Diving into ranges could be used to solve this problem.
Attributes usually are not independent, so we can use a subset by ignoring those that are dependent.
6. 4.3 Distance-based Algorithms
โข Similarity (or distance) measures may be used to identify the alikeness of different items
in the database. The difficulty lies in how the similarity measures are defined and applied
to the items in the database. Since most measures assume numeric (often discrete) data
types, a mapping from the attribute domain to a subset of integers may be used for
abstract data types.
โข Simple approach assumes that each class ๐๐ is represented by its center or centroid โ
center for its class. The new item is placed in the class with the largest similarity value.
โข K nearest neighbors (KNN) classification scheme requires not only training data, but also
desired classification for each item in it. When a classification is made for a new item, its
distance to each item in the training set must be determined. Only the K closest entries
are considered. The new item is then placed in the class that contains the most items
from this set of K closest items.
โข KNN technique is extremely sensitive to the value of K. A rule of thumb is that
๐พ โค ๐๐ข๐๐๐๐ ๐๐ ๐ก๐๐๐๐๐๐๐ ๐๐ก๐๐๐
8. Solving the classification problem using decision trees is a 2-step process:
โข Decision tree induction: Construct a DT using training data.
โข For each ๐ก๐ โ ๐ท, apply the DT to determine its class.
Attributes in the database schema that are used to label nodes in the tree and
around which the divisions takes place are called the splitting attributes. The
predicates by which the arcs in the tree are labeled are called the splitting
predicates. The major factors in the performance of the DT building algorithm
are: the size of the training set and how the best splitting attribute is chosen.
Algorithm continues adding nodes and arcs to the tree recursively until some
stopping criteria is reached (can be determined differently).
โข Advantages: easy to use, rules are easy to interpret and understand, scale
well for large databases (the tree size is independent of the database size).
โข Disadvantages: do not easily handle continuous data (attribute domains must
be divided into categories (rectangular regions) in order to be handled,
handling missing data is difficult, overfitting may occur (overcome via
pruning), correlations among attributes are ignored by the DT process.
4.4 Decision Tree-based Algorithms
9. โข Choosing splitting attributes. Using the initial training data, the โbestโ splitting attribute
is chosen first. Algorithms differ in how they determine the best attribute and its best
predicates to use for splitting. The choice of attribute involves not only an examination
of the data in the training set but also the informed input of domain experts.
โข Ordering of splitting attributes. The order in which the attributes are chosen is also
important.
โข Splits (number of splits to take). If the domain is continuous or has a large number of
values, the number of splits to use is not easily determined.
โข Tree structure. A balanced shorter tree with the fewest levels is desirable. Multi-way
branching or binary trees (tend to be deeper) can be used.
โข Stopping criteria. The creating of the tree stops when the training data are perfectly
classified. Stopping earlier may be used to prevent overfitting. More levels than needed
would be created in a tree if it is known that there are data distributions not
represented in the training data.
โข Training data. The training data and the tree induction algorithm determine the tree
shape. If training data set is too small, then the generated tree might not be specific
enough to work properly with the more general data. If the training data set is too large,
then the created tree may overfit.
โข Pruning. The DT building algorithms may initially build the tree and then prune it for
more effective classification. Pruning is a modification of the tree by removing
redundant comparisons or sub-trees aiming to achieve better performance.
Issues Faced by DT Algorithms
10. Comparing Decision Trees
The time and space complexity of DT algorithms depends on the size of the training data ๐; the
number of attributes โ; and the shape of the resulting tree. This gives a time complexity to build a
tree of ๐(โ๐ log ๐). The time to classify a database of size ๐ is based on the height of the tree and is
๐ ๐ log ๐ .
11. ID3 Algorithm
โข The technique to building a decision tree attempts to minimize the expected number of
comparisons. It choses splitting attributes with the highest information gain first.
โข Entropy is used to measure the amount of uncertainty or surprise or randomness in a set of data.
Given probabilities of states ๐1, ๐2, โฏ , ๐๐ where ๐=1
๐
๐๐ = 1, entropy is defied as
๐ป ๐1, ๐2, โฏ , ๐๐ =
๐=1
๐
๐๐ log 1 ๐๐
โข Gain is defined as the difference between how much information is needed to make a correct
classification before the split versus how much information is needed after the split. The ID3
algorithm calculates the gain of a particular split by the following formula:
Gain ๐ท, ๐ = ๐ป ๐ท โ
๐=1
๐
๐(๐ท๐)๐ป(๐ท๐)
โข ID3 approach favors attributes with many divisions and thus may lead to overfitting In the
extreme, an attribute that has a unique value for each tuple in the training set would be the best
because there would be only one tuple (and thus one class) for each division.
12. Entropy
a) log 1 ๐ shows the amount of surprise as the probability ๐ ranges from 0 to 1.
b) ๐ log 1 ๐ shows the expected information based on probability ๐ of an event.
c) ๐ log 1 ๐ + (1 โ ๐) log 1 (1 โ ๐) shows the value of entropy. To measure the information
associated with a division, we add information associated with both events, while taking into
account the probability that each occurs.
13. C4.5, C5.0 and CART
โข In C4.5 splitting is based on GainRatio as opposed to Gain, which ensures a larger than average information gain
๐บ๐๐๐ ๐ ๐๐ก๐๐ ๐ท, ๐ =
Gain(๐ท, ๐)
H
๐ท1
๐ท
, โฏ ,
๐ท๐
๐ท
โข C5.0 is based on boosting. Boosting is an approach to combining different classifiers. It does not always help when the training
data contains a lot of noise. Boosting works by creating multiple training sets from one training set. Thus, multiple classifiers are
actually constructed. Each classifier is assigned a vote, voting is performed, and the target tuple is assigned to the class with the
most number of votes.
โข Classification and regression trees (CART) is a technique that generates a binary decision tree. Entropy is used as a measure to
choose the best splitting attribute and criterion, however, only 2 children are created. At each step, an exhaustive search
determines the best split defined by:
ฮฆ ๐ ๐ก = 2๐๐ฟ ๐๐
๐=1
๐
๐ ๐ถ๐|๐ก ๐ฟ โ ๐ ๐ถ๐|๐ก ๐ .
โข This formula is evaluated at the current node ๐ก, and for each possible splitting attribute and criterion ๐ . Here ๐ฟ and ๐ are the
probability that a tuple ๐ก will be on the left or right side of the tree. ๐ ๐ถ๐|๐ก ๐ฟ or ๐ ๐ถ๐|๐ก ๐ is the probability that a tuple is in this
class ๐ถ๐ and in the left or right sub-tree. CART forces that an ordering of the attributes must be used, and it also contains a pruning
strategy.
14. โข There are two primary pruning strategies: 1) subtree replacement: a subtree is replaced by a leaf
node. This results in an error rate close to that of the original tree. It works from the bottom of
the tree up to the root; 2) subtree raising: replaces a sub-tree by its most used subtree. Here a
subtree is raised from its current location to a node higher up in the tree. We must determine the
increase in error rate for this replacement.
Pruning
15. Scalable DT Techniques
โข SPRINT (Scalable PaRallelizable Induction of decision Trees). A gini index is
used to find the best split. Here gini for a database ๐ท is defined as
gini ๐ท = 1 โ ๐๐
2
, where ๐๐ is the frequency of class ๐ถ๐ in ๐ท. The
goodness of a split of ๐ท into subsets ๐ท1and ๐ท2 is defined by
๐๐๐๐ ๐ ๐๐๐๐ก ๐ท =
๐1
๐
gini(๐ท1) +
๐2
๐
gini(๐ท2)
The split with the best gini value is chosen.
โข The RainForest approach allows a choice of split attribute without needing
a training set. For each node of a DT, a table called the attribute-value class
(AVC) label group is used. The table summarizes for an attribute the count
of entries per class or attribute value grouping. Thus, the AVC table
summarizes the information needed to determine splitting attributes.
16. 4.5 Neural Network-based Algorithms
Solving a classification problem using NNs involves several steps:
โข Determine the number of output nodes, what attributes should be used as input, the number of hidden
layers, the weights (labels) and functions to be used. Certain attribute values from the tuple are input into
the directed graph at the corresponding source nodes. There often is one sink node for each class.
โข For each tuple in the training set, propagate it though the network and evaluate the output prediction. The
projected classification made by the graph can be compared with the actual classification. If the prediction is
accurate, we adjust the labels to ensure that this prediction has a higher output weight the next time. If the
prediction is not correct, we adjust the weights to provide a lower output value for this class.
โข Propagate each tuple through the network and make the appropriate classification. The output value that is
generated indicates the probability that the corresponding input tuple belongs to that class. The tuple will
then be assigned to the class with the highest probability of membership.
Advantages: 1) NNs are more robust (especially in noisy environments) than DTs because of the weights; 2) the
NN improves its performance by learning. This may continue even after the training set has been applied; 3)
the use of NNs can be parallelized for better performance; 4) there is a low error rate and thus a high degree of
accuracy once the appropriate training has been performed.
Disadvantages: 1) NNs are difficult to understand; 2) Generating rules from NNs is not straightforward; 3) input
attribute values must be numeric; 4) testing, verification; 5) overfitting may occur; 6) the learning phase may
fail to converge, the result is an estimate (not optimal).
17. NN Propagation and Error
โข Given a tuple of values input to the NN, ๐ = ๐ฅ1, โฏ , ๐ฅโ , one at each node in the input layer.
Then the summation and activation functions are applied at each node, with an output value
created for each output arc from that node. These values are sent to the subsequent nodes until a
tuple of output values ๐ = ๐ฆ1, โฏ , ๐ฆ ๐ is produced from the nodes in the output layer.
โข Propagation occurs by applying the activation function at each node, which then places the
output value on the arc to be sent as input to the next node. During classification process only
propagation occurs. However, when learning is used after the output of the classification occurs, a
comparison to the known classification is used to determine how to change the weights.
โข A gradient descent technique in modifying the weights can be used to minimize MSE. Assuming
that the output from node ๐ is ๐ฆ๐, but should be ๐๐, the error produced from a node in any layer
can be found by ๐ฆ๐ โ ๐๐ . The mean squared error (MSE) is found by (๐ฆ๐ โ ๐๐)2 2. Thus the total
MSE error over all m output nodes in the NN is:
๐๐๐ธ =
๐=1
๐
(๐ฆ๐ โ ๐๐)2
๐
18. Supervised Learning in NN
โข In the simplest case learning progresses from the output layer backward to the input layer. The
objective of a learning technique is to change the weights based on the output obtained for a
specific input tuple. Weight are changed based on the changes that were made in weights in
subsequent arcs. This backward learning process is called backpropagation.
โข With the batch or offline approach, the weights are changed after all tuples in the training set are
applied and a total MSE is found. With the incremental or online approach, the weights are
changed after each tuple in the training set is applied. The incremental technique is usually
preferred because it requires less space and may actually examine more potential solutions.
โข Suppose for a given node, ๐ , the input weights are represented as a tuple ๐ค1๐, โฏ , ๐ค ๐๐ , while
the input and output values are ๐ฅ1๐, โฏ , ๐ฅ ๐๐ and ๐ฆ๐, respectively. The change in weights using
Hebb rule is represented by ฮ๐ค๐๐ = ๐๐ฅ๐๐ ๐ฆ๐. Here ๐ is a constant often called the learning rate. A
rule of thumb is that c = 1
#๐๐๐ก๐๐๐๐ ๐๐ ๐ก๐๐๐๐๐๐๐ ๐ ๐๐ก
โข Delta rule examines not only the output value ๐ฆ๐ but also the desired value ๐๐ for output. In this
case the change in weight is found by the rule: ฮ๐ค๐๐ = ๐๐ฅ๐๐ ๐๐ โ ๐ฆ๐ . The nice feature of the
delta rule is that is minimizes the error ๐๐ โ ๐ฆ๐ at each node.
19. Gradient Descent
โข Here ๐ is referred to as the learning parameter. It typically
is found in range (0,1), although it may be larger. This
value determines how fast the algorithm learns.
โข We are trying to minimize the error at the output nodes,
while output errors are being propagated backward
through the network.
โข The learning in the gradient descent technique is based on
using the following value for delta at the output layer
ฮ๐ค๐๐ = โ๐
๐๐ธ
๐๐ค๐๐
= โ๐
๐๐ธ
๐๐ฆ๐
๐๐ฆ๐
๐๐๐
๐๐๐
๐๐ค๐๐
โข here the weights ๐ค๐๐ are at one arc coming into ๐ from ๐.
โข So that new adjusted weights become ๐ค๐๐ = ๐ค๐๐ + ฮ๐ค๐๐
โข Assuming sigmoidal activation function for the output layer
ฮ๐ค๐๐ = ๐ ๐๐ โ ๐ฆ๐ ๐ฆ๐ 1 โ ๐ฆ๐ ๐ฆ๐
20. Gradient Descent in the Hidden Layer
โข For node j in the hidden layer the change in the weights for arcs
coming into it:
ฮ๐ค ๐๐ = โ๐
๐๐ธ
๐๐ค ๐๐
=
๐
๐๐ธ
๐๐ฆ ๐
๐๐ฆ ๐
๐๐ ๐
๐๐ ๐
๐๐ฆ๐
๐๐ฆ๐
๐๐๐
๐๐๐
๐๐ค ๐๐
โข Here the variable m ranges over all output nodes with arcs from ๐ .
โข Assuming hyperbolic tangent activation function for the hidden
layer:
ฮ๐ค ๐๐ = ๐๐ฆ ๐
1 โ ๐ฆ๐
2
2
๐
(๐ ๐ โ ๐ฆ ๐)๐ค๐๐ ๐ฆ ๐(1 โ ๐ฆ ๐)
โข Another common formula for the change in weight is
ฮ๐ค๐๐ ๐ก + 1 = โ๐
๐๐ธ
๐๐ค๐๐
+ ๐ผฮ๐ค๐๐(๐ก)
โข Here is called a momentum and is used to prevent oscillation
problems.
21. Perceptrons
โข The simplest NN is called a perceptron. A
perceptrone is a single neuron with multiple
inputs and one output. Step or any other (e.g.,
sigmoidal) activation function can be used.
โข A simple perceptrone can be used to classify
into two classes. Activation function output
value of 1 would be used to classify into one
class, while value of 0 would be used to pass
in the other class.
โข A simple feed forward neural network of
perceptrons is called a multilayer perceptron
(MLP). The neurons are placed in layers with
outputs always flowing toward the output
layer.
22. โข MLP needs no more than 2 hidden layers. Kolmogorovโs theorem states, that a mapping
between two sets of numbers can be performed using a NN with only one hidden layer.
Given ๐ attributes, NN having one input node for each attribute, the hidden layer should
have 2๐ + 1 nodes, each with input from each of the input nodes. The output layer
has one node for each desired output value.
MLP (Multilayer Perceptron)
23. 4.6 Rule-Based Algorithms
โข One way to perform classification is to generate if-then rules that cover all
cases. A classification rule, ๐ = ๐, ๐ , consists of the if or antecedent, ๐
part, and the then ๐ or consequent portion . The antecedent contains a
predicate that can be evaluated as true or false against each tuple in the
database (and in the training data).
โข A DT can always be used to generate rules for each leaf node in the
decision tree. All rules with the same consequent could be combined
together by Oring the antecedents of the simpler rules.
There are some differences:
โข The tree has an implied order in which the splitting is performed.
โข A tree is created based on looking at all classes. When generating rules,
only one class must be examined at a time.
24. 4.6.2 Generating Rules from a NN
โข While the source NN may still be used for classification, the derived rules can be used to
verify or interpret the network. The problem is that the rules do not explicitly exist. They
are buried in the structure of the graph itself. In addition, if learning is still occurring, the
rules themselves are dynamic.
โข The rules generated tend both to be more concise and to have a lower error rate than
rules used with DTs.
โข The basic idea of the RX algorithm is to cluster output node activation values (with the
associated hidden nodes and input); cluster hidden node activation values; generate
rules that describe the output values in terms of the hidden activation values; generate
rules that describe hidden output values in terms of inputs; combine two sets of rules.
โข A major problem with rule extraction is the potential size that these rules should be. For
example, if you have a node with n inputs each having 5 values, there are 5n different
input combinations to this one node alone. To overcome this problem and that of having
continuous ranges of output values from nodes, the output values for both the hidden
and output layers are first discretized. This is accomplished by clustering the values and
dividing continuous values into disjoint ranges.
25. Generating Rules Without a DT or NN
โข These techniques are sometimes called covering algorithms because they
attempt to generate rules exactly cover a specific class. They generate the best
rule possible by optimizing the desired classification probability. Usually the best
attribute-value pair is chosen, as opposed to the best attribute with the tree-
based algorithms.
โข 1R approach generates a simple set of rules that are equivalent to a DT with only
one level. The basic idea is to choose the best attribute to perform the
classification based on the training data. The best is defined here by counting the
number of errors. 1R can handle missing data by adding an additional attribute
value of missing. As with ID3, it tends to chose attributes with a large number of
values leading to overfitting.
โข Another approach to generating rules without first having a DT is called PRISM.
PRISM generates rules fro each class by looking at the training data and adding
rules that completely describe all tuples in that class. Its accuracy is 100 percent.
The algorithm refers to attribute-value pairs.
26. Combining Techniques
โข Multiple independent approaches can be applied to a classification problem, each yielding its
own class prediction. The results of these individual techniques can then be combined. Along with
boosting two other basic techniques can be used to combine classifiers:
โข One approach assumes that there are n independent classifiers and that each generates the
posterior probability ๐๐(๐ถ๐|๐ก๐) for each class. The values are combined with a weighted linear
combination ๐=1
๐
๐ค ๐ ๐๐(๐ถ๐|๐ก๐)
โข Another technique is to choose the classifier that has the best accuracy in a database sample.
This is referred to as a dynamic classifier selection (DCS).
โข Another variation is simple voting: assign the tuple to the class to which a majority of the
classifiers have assigned it.
โข Adaptive classifier combination (ACC) technique. Given a tuple to classify, the neighborhood
around it is first determined, then the tuples in that neighborhood are classified by each classifier,
and finally the accuracy for each class is measured. By examining the accuracy across all classifiers
for each class, the tuple is placed in the class that has the highest local accuracy. In effect, the
class chosen is that to which most of its neighbors are accurately classified independent of
classifier.
27. Combination of Multiple Classifiers in DCS
Any shapes that are darkened indicate an incorrect classification. DCS looks at local
accuracy of each classifier: a) 7 tuples in the neighborhood are correctly classified; b) only
6 are correctly classified. Thus X will be classified according with the first classifier.
28. Summary
โข No one classification technique is always superior to the others.
โข The regression approaches force the data to fit a predefined model. The problem arises
when a linear model is chosen for non linear data.
โข The KNN technique requires only that the data be such, that distances can be calculated.
This can then be applied even to nonnumeric data. Outliers are handled by looking only
at the K nearest neighbors.
โข Bayesian classification assumes that the data attributes are independent with discrete
values.
โข Decision tree techniques are easy to understand, but they may lead to overfitting. To
avoid this, pruning techniques may be needed.
โข ID3 is applicable only to categorical data. C4.5 and C5 allow the use of continuous data
and improved techniques for splitting. CART creates binary trees and thus may result in
very deep trees.
โข All algorithms are ๐(๐) to classify the ๐ items in the dataset.