SlideShare a Scribd company logo
1 of 29
Chapter 4
Classification
4.1 Introduction
โ€ข Prediction can be thought of as classifying an attribute value into one of set of possible classes. It is often
viewed as forecasting a continuous value, while classification forecasts a discrete value.
โ€ข All classification techniques assume some knowledge of the data. Training data consists of sample input data
as well as the classification assignment for each data tuple. Given a database ๐ท of tuples and a set of classes
๐ถ, the classification problem is to define a mapping ๐‘“: ๐ท โ†’ ๐ถ where each tuple is assigned to one class.
โ€ข The problem is implemented in two phases:
โ€ข Create a specific model by evaluating the training data.
โ€ข Apply the model to classifying tuples from the target database.
โ€ข There are three basic methods used to solve the classification problem: 1) specifying boundaries; 2) using
probability distributions; 3) using posterior probabilities.
โ€ข A major issue associated with classification is overfitting. If the classification model fits the data exactly, it
may not be applicable to a broader population.
โ€ข Statistical algorithms are based directly on the use of statistical information. Distance-based algorithms use
similarity or distance measure to perform the classification. Decision trees and NN use those structures. Rule
based classification algorithms generate if-then rules to perform classification.
Measuring Performance and Accuracy
โ€ข Classification accuracy is usually calculated by determining the percentage of tuples placed in the
correct class.
โ€ข Given a specific class and a database tuple may or may not be assigned to that class while its
actual membership may or may not be in that class. This gives us four quadrants:
โ€ข True positive (TP): ๐‘ก๐‘– predicted to be in ๐ถ๐‘— and is actually in it.
โ€ข False positive (FP): ๐‘ก๐‘– predicted to be in ๐ถ๐‘— but is not actually in it.
โ€ข True negative (TN): ๐‘ก๐‘– not predicted to be in ๐ถ๐‘— and is not actually in it.
โ€ข False negative (FN): ๐‘ก๐‘– not predicted to be in ๐ถ๐‘— but is actually in it.
โ€ข An OC (operating characteristic) curve or ROC (receiver operating characteristic) curve shows the
relationship between false positives and true positives. The horizontal axis has the percentage of
false positives and the vertical axis has the percentage of true positives for a database sample.
โ€ข A confusion matrix illustrates the accuracy of the solution to a classification problem. Given ๐‘š
classes, a confusion matrix is an ๐‘š ร— ๐‘š matrix where entry ๐‘๐‘–,๐‘— indicates the number of tuples
from ๐ท that were assigned to class ๐ถ๐‘— but where the correct class is ๐ถ๐‘–.
4.2 Statistical Methods. Regression
โ€ข Regression used for classification deals with estimation (prediction) of an output (class) value based on input values from the
database. It takes a set of data and fits the data to a formula. Classification can be performed using two different approaches: 1)
Division: The data are divided into regions based on class; 2) Prediction: Formulas are generated to predict the output class value.
โ€ข The prediction is an estimate rather than the actual output value. This technique does not work well with nonnumeric data.
โ€ข In cases with noisy, erroneous data, outliers, the observable data may be described as โˆถ ๐‘ฆ = ๐‘0 + ๐‘1 ๐‘ฅ1 + โ‹ฏ + ๐‘ ๐‘› ๐‘ฅ ๐‘› + ๐œ–, where ๐œ– is
a random error with a mean of 0. A method of least squares is used to minimize the least squared error. We first take partial
derivatives with respect to coefficients and set them equal to zero. This approach finds least square estimates ๐‘0, ๐‘1, โ‹ฏ ๐‘ ๐‘› for the
coefficients so that the squared error is minimized for the set of observable values.
โ€ข We can estimate the accuracy of the fit of a linear regression model to the actual data using a mean squared error function.
โ€ข A commonly used regression technique is called logistic regression. Logistic regression fits data to a curve such as:
๐‘ =
๐‘’(๐‘0+๐‘1 ๐‘ฅ1)
1 + ๐‘’(๐‘0+๐‘1 ๐‘ฅ1)
โ€ข It produces values between 0 and 1 and can be interpreted as probability of class membership. The logarithm is applied to obtain
the logistic function:
log ๐‘’
๐‘
1 โˆ’ ๐‘
= ๐‘0 + ๐‘1 ๐‘ฅ1
โ€ข Here ๐‘ is the probability of being in the class and 1 โˆ’ ๐‘ is the probability that it is not. The process chooses values for ๐‘0 and
๐‘1 that maximize the probability of observing the given values.
Bayesian Classification
โ€ข Assuming that the contribution by all attributes are independent and that each contributes equally to the
classification problem, a classification scheme naive Bayes can be used.
โ€ข Training data can be used to determine prior and conditional probabilities ๐‘ƒ ๐ถ๐‘— and ๐‘ƒ(๐‘ฅ๐‘–|๐ถ๐‘—), as well as
๐‘ƒ ๐‘ฅ๐‘– . From these values Bayes theorem allows us to estimate the posterior probability ๐‘ƒ ๐ถ๐‘— ๐‘ฅ๐‘– and
๐‘ƒ(๐ถ๐‘—|๐‘ก๐‘–).
โ€ข This must be done for all attributes and all values
๐‘ƒ ๐‘ก๐‘– ๐ถ๐‘— =
๐‘˜=1
๐‘
๐‘ƒ(๐‘ฅ๐‘–๐‘˜|๐ถ๐‘—)
โ€ข To calculate ๐‘ƒ(๐‘ก๐‘–) we estimate the likelihoods for ๐‘ก๐‘– in each class and add these values.
โ€ข The posterior probability ๐‘ƒ(๐ถ๐‘—|๐‘ก๐‘–) is then found for each class. The class with the highest probability is the
one chosen for the tuple.
โ€ข Only one scan of training data is needed, it can handle missing values. In simple relationships this technique
often yields good results.
โ€ข The technique does not handle continuous data. Diving into ranges could be used to solve this problem.
Attributes usually are not independent, so we can use a subset by ignoring those that are dependent.
4.3 Distance-based Algorithms
โ€ข Similarity (or distance) measures may be used to identify the alikeness of different items
in the database. The difficulty lies in how the similarity measures are defined and applied
to the items in the database. Since most measures assume numeric (often discrete) data
types, a mapping from the attribute domain to a subset of integers may be used for
abstract data types.
โ€ข Simple approach assumes that each class ๐‘๐‘– is represented by its center or centroid โ€“
center for its class. The new item is placed in the class with the largest similarity value.
โ€ข K nearest neighbors (KNN) classification scheme requires not only training data, but also
desired classification for each item in it. When a classification is made for a new item, its
distance to each item in the training set must be determined. Only the K closest entries
are considered. The new item is then placed in the class that contains the most items
from this set of K closest items.
โ€ข KNN technique is extremely sensitive to the value of K. A rule of thumb is that
๐พ โ‰ค ๐‘›๐‘ข๐‘š๐‘๐‘’๐‘Ÿ ๐‘œ๐‘“ ๐‘ก๐‘Ÿ๐‘Ž๐‘–๐‘›๐‘–๐‘›๐‘” ๐‘–๐‘ก๐‘’๐‘š๐‘ 
Centroid-based vs KNN
Solving the classification problem using decision trees is a 2-step process:
โ€ข Decision tree induction: Construct a DT using training data.
โ€ข For each ๐‘ก๐‘– โˆˆ ๐ท, apply the DT to determine its class.
Attributes in the database schema that are used to label nodes in the tree and
around which the divisions takes place are called the splitting attributes. The
predicates by which the arcs in the tree are labeled are called the splitting
predicates. The major factors in the performance of the DT building algorithm
are: the size of the training set and how the best splitting attribute is chosen.
Algorithm continues adding nodes and arcs to the tree recursively until some
stopping criteria is reached (can be determined differently).
โ€ข Advantages: easy to use, rules are easy to interpret and understand, scale
well for large databases (the tree size is independent of the database size).
โ€ข Disadvantages: do not easily handle continuous data (attribute domains must
be divided into categories (rectangular regions) in order to be handled,
handling missing data is difficult, overfitting may occur (overcome via
pruning), correlations among attributes are ignored by the DT process.
4.4 Decision Tree-based Algorithms
โ€ข Choosing splitting attributes. Using the initial training data, the โ€œbestโ€ splitting attribute
is chosen first. Algorithms differ in how they determine the best attribute and its best
predicates to use for splitting. The choice of attribute involves not only an examination
of the data in the training set but also the informed input of domain experts.
โ€ข Ordering of splitting attributes. The order in which the attributes are chosen is also
important.
โ€ข Splits (number of splits to take). If the domain is continuous or has a large number of
values, the number of splits to use is not easily determined.
โ€ข Tree structure. A balanced shorter tree with the fewest levels is desirable. Multi-way
branching or binary trees (tend to be deeper) can be used.
โ€ข Stopping criteria. The creating of the tree stops when the training data are perfectly
classified. Stopping earlier may be used to prevent overfitting. More levels than needed
would be created in a tree if it is known that there are data distributions not
represented in the training data.
โ€ข Training data. The training data and the tree induction algorithm determine the tree
shape. If training data set is too small, then the generated tree might not be specific
enough to work properly with the more general data. If the training data set is too large,
then the created tree may overfit.
โ€ข Pruning. The DT building algorithms may initially build the tree and then prune it for
more effective classification. Pruning is a modification of the tree by removing
redundant comparisons or sub-trees aiming to achieve better performance.
Issues Faced by DT Algorithms
Comparing Decision Trees
The time and space complexity of DT algorithms depends on the size of the training data ๐‘ž; the
number of attributes โ„Ž; and the shape of the resulting tree. This gives a time complexity to build a
tree of ๐‘‚(โ„Ž๐‘ž log ๐‘ž). The time to classify a database of size ๐‘› is based on the height of the tree and is
๐‘‚ ๐‘› log ๐‘ž .
ID3 Algorithm
โ€ข The technique to building a decision tree attempts to minimize the expected number of
comparisons. It choses splitting attributes with the highest information gain first.
โ€ข Entropy is used to measure the amount of uncertainty or surprise or randomness in a set of data.
Given probabilities of states ๐‘1, ๐‘2, โ‹ฏ , ๐‘๐‘  where ๐‘–=1
๐‘ 
๐‘๐‘– = 1, entropy is defied as
๐ป ๐‘1, ๐‘2, โ‹ฏ , ๐‘๐‘  =
๐‘–=1
๐‘ 
๐‘๐‘– log 1 ๐‘๐‘–
โ€ข Gain is defined as the difference between how much information is needed to make a correct
classification before the split versus how much information is needed after the split. The ID3
algorithm calculates the gain of a particular split by the following formula:
Gain ๐ท, ๐‘† = ๐ป ๐ท โˆ’
๐‘–=1
๐‘ 
๐‘ƒ(๐ท๐‘–)๐ป(๐ท๐‘–)
โ€ข ID3 approach favors attributes with many divisions and thus may lead to overfitting In the
extreme, an attribute that has a unique value for each tuple in the training set would be the best
because there would be only one tuple (and thus one class) for each division.
Entropy
a) log 1 ๐‘ shows the amount of surprise as the probability ๐‘ ranges from 0 to 1.
b) ๐‘ log 1 ๐‘ shows the expected information based on probability ๐‘ of an event.
c) ๐‘ log 1 ๐‘ + (1 โˆ’ ๐‘) log 1 (1 โˆ’ ๐‘) shows the value of entropy. To measure the information
associated with a division, we add information associated with both events, while taking into
account the probability that each occurs.
C4.5, C5.0 and CART
โ€ข In C4.5 splitting is based on GainRatio as opposed to Gain, which ensures a larger than average information gain
๐บ๐‘Ž๐‘–๐‘› ๐‘…๐‘Ž๐‘ก๐‘–๐‘œ ๐ท, ๐‘† =
Gain(๐ท, ๐‘†)
H
๐ท1
๐ท
, โ‹ฏ ,
๐ท๐‘ 
๐ท
โ€ข C5.0 is based on boosting. Boosting is an approach to combining different classifiers. It does not always help when the training
data contains a lot of noise. Boosting works by creating multiple training sets from one training set. Thus, multiple classifiers are
actually constructed. Each classifier is assigned a vote, voting is performed, and the target tuple is assigned to the class with the
most number of votes.
โ€ข Classification and regression trees (CART) is a technique that generates a binary decision tree. Entropy is used as a measure to
choose the best splitting attribute and criterion, however, only 2 children are created. At each step, an exhaustive search
determines the best split defined by:
ฮฆ ๐‘  ๐‘ก = 2๐‘ƒ๐ฟ ๐‘ƒ๐‘…
๐‘—=1
๐‘š
๐‘ƒ ๐ถ๐‘—|๐‘ก ๐ฟ โˆ’ ๐‘ƒ ๐ถ๐‘—|๐‘ก ๐‘… .
โ€ข This formula is evaluated at the current node ๐‘ก, and for each possible splitting attribute and criterion ๐‘  . Here ๐ฟ and ๐‘… are the
probability that a tuple ๐‘ก will be on the left or right side of the tree. ๐‘ƒ ๐ถ๐‘—|๐‘ก ๐ฟ or ๐‘ƒ ๐ถ๐‘—|๐‘ก ๐‘… is the probability that a tuple is in this
class ๐ถ๐‘— and in the left or right sub-tree. CART forces that an ordering of the attributes must be used, and it also contains a pruning
strategy.
โ€ข There are two primary pruning strategies: 1) subtree replacement: a subtree is replaced by a leaf
node. This results in an error rate close to that of the original tree. It works from the bottom of
the tree up to the root; 2) subtree raising: replaces a sub-tree by its most used subtree. Here a
subtree is raised from its current location to a node higher up in the tree. We must determine the
increase in error rate for this replacement.
Pruning
Scalable DT Techniques
โ€ข SPRINT (Scalable PaRallelizable Induction of decision Trees). A gini index is
used to find the best split. Here gini for a database ๐ท is defined as
gini ๐ท = 1 โˆ’ ๐‘๐‘—
2
, where ๐‘๐‘— is the frequency of class ๐ถ๐‘— in ๐ท. The
goodness of a split of ๐ท into subsets ๐ท1and ๐ท2 is defined by
๐‘”๐‘–๐‘›๐‘– ๐‘ ๐‘๐‘™๐‘–๐‘ก ๐ท =
๐‘›1
๐‘›
gini(๐ท1) +
๐‘›2
๐‘›
gini(๐ท2)
The split with the best gini value is chosen.
โ€ข The RainForest approach allows a choice of split attribute without needing
a training set. For each node of a DT, a table called the attribute-value class
(AVC) label group is used. The table summarizes for an attribute the count
of entries per class or attribute value grouping. Thus, the AVC table
summarizes the information needed to determine splitting attributes.
4.5 Neural Network-based Algorithms
Solving a classification problem using NNs involves several steps:
โ€ข Determine the number of output nodes, what attributes should be used as input, the number of hidden
layers, the weights (labels) and functions to be used. Certain attribute values from the tuple are input into
the directed graph at the corresponding source nodes. There often is one sink node for each class.
โ€ข For each tuple in the training set, propagate it though the network and evaluate the output prediction. The
projected classification made by the graph can be compared with the actual classification. If the prediction is
accurate, we adjust the labels to ensure that this prediction has a higher output weight the next time. If the
prediction is not correct, we adjust the weights to provide a lower output value for this class.
โ€ข Propagate each tuple through the network and make the appropriate classification. The output value that is
generated indicates the probability that the corresponding input tuple belongs to that class. The tuple will
then be assigned to the class with the highest probability of membership.
Advantages: 1) NNs are more robust (especially in noisy environments) than DTs because of the weights; 2) the
NN improves its performance by learning. This may continue even after the training set has been applied; 3)
the use of NNs can be parallelized for better performance; 4) there is a low error rate and thus a high degree of
accuracy once the appropriate training has been performed.
Disadvantages: 1) NNs are difficult to understand; 2) Generating rules from NNs is not straightforward; 3) input
attribute values must be numeric; 4) testing, verification; 5) overfitting may occur; 6) the learning phase may
fail to converge, the result is an estimate (not optimal).
NN Propagation and Error
โ€ข Given a tuple of values input to the NN, ๐‘‹ = ๐‘ฅ1, โ‹ฏ , ๐‘ฅโ„Ž , one at each node in the input layer.
Then the summation and activation functions are applied at each node, with an output value
created for each output arc from that node. These values are sent to the subsequent nodes until a
tuple of output values ๐‘Œ = ๐‘ฆ1, โ‹ฏ , ๐‘ฆ ๐‘š is produced from the nodes in the output layer.
โ€ข Propagation occurs by applying the activation function at each node, which then places the
output value on the arc to be sent as input to the next node. During classification process only
propagation occurs. However, when learning is used after the output of the classification occurs, a
comparison to the known classification is used to determine how to change the weights.
โ€ข A gradient descent technique in modifying the weights can be used to minimize MSE. Assuming
that the output from node ๐‘– is ๐‘ฆ๐‘–, but should be ๐‘‘๐‘–, the error produced from a node in any layer
can be found by ๐‘ฆ๐‘– โˆ’ ๐‘‘๐‘– . The mean squared error (MSE) is found by (๐‘ฆ๐‘– โˆ’ ๐‘‘๐‘–)2 2. Thus the total
MSE error over all m output nodes in the NN is:
๐‘€๐‘†๐ธ =
๐‘–=1
๐‘š
(๐‘ฆ๐‘– โˆ’ ๐‘‘๐‘–)2
๐‘š
Supervised Learning in NN
โ€ข In the simplest case learning progresses from the output layer backward to the input layer. The
objective of a learning technique is to change the weights based on the output obtained for a
specific input tuple. Weight are changed based on the changes that were made in weights in
subsequent arcs. This backward learning process is called backpropagation.
โ€ข With the batch or offline approach, the weights are changed after all tuples in the training set are
applied and a total MSE is found. With the incremental or online approach, the weights are
changed after each tuple in the training set is applied. The incremental technique is usually
preferred because it requires less space and may actually examine more potential solutions.
โ€ข Suppose for a given node, ๐‘— , the input weights are represented as a tuple ๐‘ค1๐‘—, โ‹ฏ , ๐‘ค ๐‘˜๐‘— , while
the input and output values are ๐‘ฅ1๐‘—, โ‹ฏ , ๐‘ฅ ๐‘˜๐‘— and ๐‘ฆ๐‘—, respectively. The change in weights using
Hebb rule is represented by ฮ”๐‘ค๐‘–๐‘— = ๐‘๐‘ฅ๐‘–๐‘— ๐‘ฆ๐‘—. Here ๐‘ is a constant often called the learning rate. A
rule of thumb is that c = 1
#๐‘’๐‘›๐‘ก๐‘Ÿ๐‘–๐‘’๐‘  ๐‘–๐‘› ๐‘ก๐‘Ÿ๐‘Ž๐‘–๐‘›๐‘–๐‘›๐‘” ๐‘ ๐‘’๐‘ก
โ€ข Delta rule examines not only the output value ๐‘ฆ๐‘— but also the desired value ๐‘‘๐‘— for output. In this
case the change in weight is found by the rule: ฮ”๐‘ค๐‘–๐‘— = ๐‘๐‘ฅ๐‘–๐‘— ๐‘‘๐‘— โˆ’ ๐‘ฆ๐‘— . The nice feature of the
delta rule is that is minimizes the error ๐‘‘๐‘— โˆ’ ๐‘ฆ๐‘— at each node.
Gradient Descent
โ€ข Here ๐œ‚ is referred to as the learning parameter. It typically
is found in range (0,1), although it may be larger. This
value determines how fast the algorithm learns.
โ€ข We are trying to minimize the error at the output nodes,
while output errors are being propagated backward
through the network.
โ€ข The learning in the gradient descent technique is based on
using the following value for delta at the output layer
ฮ”๐‘ค๐‘—๐‘– = โˆ’๐œ‚
๐œ•๐ธ
๐œ•๐‘ค๐‘—๐‘–
= โˆ’๐œ‚
๐œ•๐ธ
๐œ•๐‘ฆ๐‘–
๐œ•๐‘ฆ๐‘–
๐œ•๐‘†๐‘–
๐œ•๐‘†๐‘–
๐œ•๐‘ค๐‘—๐‘–
โ€ข here the weights ๐‘ค๐‘—๐‘– are at one arc coming into ๐‘– from ๐‘—.
โ€ข So that new adjusted weights become ๐‘ค๐‘—๐‘– = ๐‘ค๐‘—๐‘– + ฮ”๐‘ค๐‘—๐‘–
โ€ข Assuming sigmoidal activation function for the output layer
ฮ”๐‘ค๐‘—๐‘– = ๐œ‚ ๐‘‘๐‘– โˆ’ ๐‘ฆ๐‘– ๐‘ฆ๐‘— 1 โˆ’ ๐‘ฆ๐‘– ๐‘ฆ๐‘–
Gradient Descent in the Hidden Layer
โ€ข For node j in the hidden layer the change in the weights for arcs
coming into it:
ฮ”๐‘ค ๐‘˜๐‘— = โˆ’๐œ‚
๐œ•๐ธ
๐œ•๐‘ค ๐‘˜๐‘—
=
๐‘š
๐œ•๐ธ
๐œ•๐‘ฆ ๐‘š
๐œ•๐‘ฆ ๐‘š
๐œ•๐‘† ๐‘š
๐œ•๐‘† ๐‘š
๐œ•๐‘ฆ๐‘—
๐œ•๐‘ฆ๐‘—
๐œ•๐‘†๐‘—
๐œ•๐‘†๐‘—
๐œ•๐‘ค ๐‘˜๐‘—
โ€ข Here the variable m ranges over all output nodes with arcs from ๐‘— .
โ€ข Assuming hyperbolic tangent activation function for the hidden
layer:
ฮ”๐‘ค ๐‘˜๐‘— = ๐œ‚๐‘ฆ ๐‘˜
1 โˆ’ ๐‘ฆ๐‘—
2
2
๐‘š
(๐‘‘ ๐‘š โˆ’ ๐‘ฆ ๐‘š)๐‘ค๐‘—๐‘š ๐‘ฆ ๐‘š(1 โˆ’ ๐‘ฆ ๐‘š)
โ€ข Another common formula for the change in weight is
ฮ”๐‘ค๐‘—๐‘– ๐‘ก + 1 = โˆ’๐œ‚
๐œ•๐ธ
๐œ•๐‘ค๐‘—๐‘–
+ ๐›ผฮ”๐‘ค๐‘—๐‘–(๐‘ก)
โ€ข Here is called a momentum and is used to prevent oscillation
problems.
Perceptrons
โ€ข The simplest NN is called a perceptron. A
perceptrone is a single neuron with multiple
inputs and one output. Step or any other (e.g.,
sigmoidal) activation function can be used.
โ€ข A simple perceptrone can be used to classify
into two classes. Activation function output
value of 1 would be used to classify into one
class, while value of 0 would be used to pass
in the other class.
โ€ข A simple feed forward neural network of
perceptrons is called a multilayer perceptron
(MLP). The neurons are placed in layers with
outputs always flowing toward the output
layer.
โ€ข MLP needs no more than 2 hidden layers. Kolmogorovโ€™s theorem states, that a mapping
between two sets of numbers can be performed using a NN with only one hidden layer.
Given ๐‘› attributes, NN having one input node for each attribute, the hidden layer should
have 2๐‘› + 1 nodes, each with input from each of the input nodes. The output layer
has one node for each desired output value.
MLP (Multilayer Perceptron)
4.6 Rule-Based Algorithms
โ€ข One way to perform classification is to generate if-then rules that cover all
cases. A classification rule, ๐‘Ÿ = ๐‘Ž, ๐‘ , consists of the if or antecedent, ๐‘Ž
part, and the then ๐‘ or consequent portion . The antecedent contains a
predicate that can be evaluated as true or false against each tuple in the
database (and in the training data).
โ€ข A DT can always be used to generate rules for each leaf node in the
decision tree. All rules with the same consequent could be combined
together by Oring the antecedents of the simpler rules.
There are some differences:
โ€ข The tree has an implied order in which the splitting is performed.
โ€ข A tree is created based on looking at all classes. When generating rules,
only one class must be examined at a time.
4.6.2 Generating Rules from a NN
โ€ข While the source NN may still be used for classification, the derived rules can be used to
verify or interpret the network. The problem is that the rules do not explicitly exist. They
are buried in the structure of the graph itself. In addition, if learning is still occurring, the
rules themselves are dynamic.
โ€ข The rules generated tend both to be more concise and to have a lower error rate than
rules used with DTs.
โ€ข The basic idea of the RX algorithm is to cluster output node activation values (with the
associated hidden nodes and input); cluster hidden node activation values; generate
rules that describe the output values in terms of the hidden activation values; generate
rules that describe hidden output values in terms of inputs; combine two sets of rules.
โ€ข A major problem with rule extraction is the potential size that these rules should be. For
example, if you have a node with n inputs each having 5 values, there are 5n different
input combinations to this one node alone. To overcome this problem and that of having
continuous ranges of output values from nodes, the output values for both the hidden
and output layers are first discretized. This is accomplished by clustering the values and
dividing continuous values into disjoint ranges.
Generating Rules Without a DT or NN
โ€ข These techniques are sometimes called covering algorithms because they
attempt to generate rules exactly cover a specific class. They generate the best
rule possible by optimizing the desired classification probability. Usually the best
attribute-value pair is chosen, as opposed to the best attribute with the tree-
based algorithms.
โ€ข 1R approach generates a simple set of rules that are equivalent to a DT with only
one level. The basic idea is to choose the best attribute to perform the
classification based on the training data. The best is defined here by counting the
number of errors. 1R can handle missing data by adding an additional attribute
value of missing. As with ID3, it tends to chose attributes with a large number of
values leading to overfitting.
โ€ข Another approach to generating rules without first having a DT is called PRISM.
PRISM generates rules fro each class by looking at the training data and adding
rules that completely describe all tuples in that class. Its accuracy is 100 percent.
The algorithm refers to attribute-value pairs.
Combining Techniques
โ€ข Multiple independent approaches can be applied to a classification problem, each yielding its
own class prediction. The results of these individual techniques can then be combined. Along with
boosting two other basic techniques can be used to combine classifiers:
โ€ข One approach assumes that there are n independent classifiers and that each generates the
posterior probability ๐‘ƒ๐‘˜(๐ถ๐‘—|๐‘ก๐‘–) for each class. The values are combined with a weighted linear
combination ๐‘˜=1
๐‘›
๐‘ค ๐‘˜ ๐‘ƒ๐‘˜(๐ถ๐‘—|๐‘ก๐‘–)
โ€ข Another technique is to choose the classifier that has the best accuracy in a database sample.
This is referred to as a dynamic classifier selection (DCS).
โ€ข Another variation is simple voting: assign the tuple to the class to which a majority of the
classifiers have assigned it.
โ€ข Adaptive classifier combination (ACC) technique. Given a tuple to classify, the neighborhood
around it is first determined, then the tuples in that neighborhood are classified by each classifier,
and finally the accuracy for each class is measured. By examining the accuracy across all classifiers
for each class, the tuple is placed in the class that has the highest local accuracy. In effect, the
class chosen is that to which most of its neighbors are accurately classified independent of
classifier.
Combination of Multiple Classifiers in DCS
Any shapes that are darkened indicate an incorrect classification. DCS looks at local
accuracy of each classifier: a) 7 tuples in the neighborhood are correctly classified; b) only
6 are correctly classified. Thus X will be classified according with the first classifier.
Summary
โ€ข No one classification technique is always superior to the others.
โ€ข The regression approaches force the data to fit a predefined model. The problem arises
when a linear model is chosen for non linear data.
โ€ข The KNN technique requires only that the data be such, that distances can be calculated.
This can then be applied even to nonnumeric data. Outliers are handled by looking only
at the K nearest neighbors.
โ€ข Bayesian classification assumes that the data attributes are independent with discrete
values.
โ€ข Decision tree techniques are easy to understand, but they may lead to overfitting. To
avoid this, pruning techniques may be needed.
โ€ข ID3 is applicable only to categorical data. C4.5 and C5 allow the use of continuous data
and improved techniques for splitting. CART creates binary trees and thus may result in
very deep trees.
โ€ข All algorithms are ๐‘‚(๐‘›) to classify the ๐‘› items in the dataset.
References:
Dunham, Margaret H. โ€œData Mining: Introductory and Advanced
Topicsโ€. Pearson Education, Inc., 2003.

More Related Content

What's hot

Data reduction
Data reductionData reduction
Data reductionkalavathisugan
ย 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitiveslavanya marichamy
ย 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine LearningKnoldus Inc.
ย 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering AlgorithmKasun Ranga Wijeweera
ย 
Genetic algorithms in Data Mining
Genetic algorithms in Data MiningGenetic algorithms in Data Mining
Genetic algorithms in Data MiningAtul Khanna
ย 
Neural Networks in Data Mining - โ€œAn Overviewโ€
Neural Networks  in Data Mining -   โ€œAn Overviewโ€Neural Networks  in Data Mining -   โ€œAn Overviewโ€
Neural Networks in Data Mining - โ€œAn Overviewโ€Dr.(Mrs).Gethsiyal Augasta
ย 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysisDataminingTools Inc
ย 
Association rule mining
Association rule miningAssociation rule mining
Association rule miningAcad
ย 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based ClusteringSSA KPI
ย 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision treeKrish_ver2
ย 
lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methodsrajshreemuthiah
ย 
data generalization and summarization
data generalization and summarization data generalization and summarization
data generalization and summarization janani thirupathi
ย 
Design and analysis of algorithms
Design and analysis of algorithmsDesign and analysis of algorithms
Design and analysis of algorithmsDr Geetha Mohan
ย 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsPrashanth Guntal
ย 
2.5 backpropagation
2.5 backpropagation2.5 backpropagation
2.5 backpropagationKrish_ver2
ย 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clusteringKrish_ver2
ย 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data miningKamal Acharya
ย 
Data mining technique (decision tree)
Data mining technique (decision tree)Data mining technique (decision tree)
Data mining technique (decision tree)Shweta Ghate
ย 

What's hot (20)

Data reduction
Data reductionData reduction
Data reduction
ย 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
ย 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
ย 
Dbscan algorithom
Dbscan algorithomDbscan algorithom
Dbscan algorithom
ย 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine Learning
ย 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
ย 
Genetic algorithms in Data Mining
Genetic algorithms in Data MiningGenetic algorithms in Data Mining
Genetic algorithms in Data Mining
ย 
Neural Networks in Data Mining - โ€œAn Overviewโ€
Neural Networks  in Data Mining -   โ€œAn Overviewโ€Neural Networks  in Data Mining -   โ€œAn Overviewโ€
Neural Networks in Data Mining - โ€œAn Overviewโ€
ย 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysis
ย 
Association rule mining
Association rule miningAssociation rule mining
Association rule mining
ย 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
ย 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision tree
ย 
lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methods
ย 
data generalization and summarization
data generalization and summarization data generalization and summarization
data generalization and summarization
ย 
Design and analysis of algorithms
Design and analysis of algorithmsDesign and analysis of algorithms
Design and analysis of algorithms
ย 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithms
ย 
2.5 backpropagation
2.5 backpropagation2.5 backpropagation
2.5 backpropagation
ย 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
ย 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
ย 
Data mining technique (decision tree)
Data mining technique (decision tree)Data mining technique (decision tree)
Data mining technique (decision tree)
ย 

Similar to 04 Classification in Data Mining

UNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data MiningUNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data MiningNandakumar P
ย 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learningAmAn Singh
ย 
ML SFCSE.pptx
ML SFCSE.pptxML SFCSE.pptx
ML SFCSE.pptxNIKHILGR3
ย 
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...Maninda Edirisooriya
ย 
03 Data Mining Techniques
03 Data Mining Techniques03 Data Mining Techniques
03 Data Mining TechniquesValerii Klymchuk
ย 
Predictive analytics
Predictive analyticsPredictive analytics
Predictive analyticsDinakar nk
ย 
Singular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxSingular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxrajalakshmi5921
ย 
EDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxEDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxrajalakshmi5921
ย 
Data mining techniques unit iv
Data mining techniques unit ivData mining techniques unit iv
Data mining techniques unit ivmalathieswaran29
ย 
Dimensionality Reduction.pptx
Dimensionality Reduction.pptxDimensionality Reduction.pptx
Dimensionality Reduction.pptxPriyadharshiniG41
ย 
dimension reduction.ppt
dimension reduction.pptdimension reduction.ppt
dimension reduction.pptDeadpool120050
ย 
introduction to Statistical Theory.pptx
 introduction to Statistical Theory.pptx introduction to Statistical Theory.pptx
introduction to Statistical Theory.pptxDr.Shweta
ย 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPiyush Srivastava
ย 
Decision tree for data mining and computer
Decision tree for data mining and computerDecision tree for data mining and computer
Decision tree for data mining and computertttiba
ย 
Data discretization
Data discretizationData discretization
Data discretizationHadi M.Abachi
ย 
Machine Learning
Machine LearningMachine Learning
Machine LearningGirish Khanzode
ย 
Unit 3 โ€“ AIML.pptx
Unit 3 โ€“ AIML.pptxUnit 3 โ€“ AIML.pptx
Unit 3 โ€“ AIML.pptxhiblooms
ย 
Using Tree algorithms on machine learning
Using Tree algorithms on machine learningUsing Tree algorithms on machine learning
Using Tree algorithms on machine learningRajasekhar364622
ย 
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdfMachine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdfAdityaSoraut
ย 

Similar to 04 Classification in Data Mining (20)

UNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data MiningUNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data Mining
ย 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
ย 
ML SFCSE.pptx
ML SFCSE.pptxML SFCSE.pptx
ML SFCSE.pptx
ย 
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
ย 
03 Data Mining Techniques
03 Data Mining Techniques03 Data Mining Techniques
03 Data Mining Techniques
ย 
Predictive analytics
Predictive analyticsPredictive analytics
Predictive analytics
ย 
Singular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxSingular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptx
ย 
EDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxEDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptx
ย 
Data mining techniques unit iv
Data mining techniques unit ivData mining techniques unit iv
Data mining techniques unit iv
ย 
Dimensionality Reduction.pptx
Dimensionality Reduction.pptxDimensionality Reduction.pptx
Dimensionality Reduction.pptx
ย 
Random Forest Decision Tree.pptx
Random Forest Decision Tree.pptxRandom Forest Decision Tree.pptx
Random Forest Decision Tree.pptx
ย 
dimension reduction.ppt
dimension reduction.pptdimension reduction.ppt
dimension reduction.ppt
ย 
introduction to Statistical Theory.pptx
 introduction to Statistical Theory.pptx introduction to Statistical Theory.pptx
introduction to Statistical Theory.pptx
ย 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an Organization
ย 
Decision tree for data mining and computer
Decision tree for data mining and computerDecision tree for data mining and computer
Decision tree for data mining and computer
ย 
Data discretization
Data discretizationData discretization
Data discretization
ย 
Machine Learning
Machine LearningMachine Learning
Machine Learning
ย 
Unit 3 โ€“ AIML.pptx
Unit 3 โ€“ AIML.pptxUnit 3 โ€“ AIML.pptx
Unit 3 โ€“ AIML.pptx
ย 
Using Tree algorithms on machine learning
Using Tree algorithms on machine learningUsing Tree algorithms on machine learning
Using Tree algorithms on machine learning
ย 
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdfMachine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
ย 

More from Valerii Klymchuk

Sample presentation slides template
Sample presentation slides templateSample presentation slides template
Sample presentation slides templateValerii Klymchuk
ย 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data MiningValerii Klymchuk
ย 
02 Related Concepts
02 Related Concepts02 Related Concepts
02 Related ConceptsValerii Klymchuk
ย 
03 Data Representation
03 Data Representation03 Data Representation
03 Data RepresentationValerii Klymchuk
ย 
05 Scalar Visualization
05 Scalar Visualization05 Scalar Visualization
05 Scalar VisualizationValerii Klymchuk
ย 
06 Vector Visualization
06 Vector Visualization06 Vector Visualization
06 Vector VisualizationValerii Klymchuk
ย 
07 Tensor Visualization
07 Tensor Visualization07 Tensor Visualization
07 Tensor VisualizationValerii Klymchuk
ย 
Crime Analysis based on Historical and Transportation Data
Crime Analysis based on Historical and Transportation DataCrime Analysis based on Historical and Transportation Data
Crime Analysis based on Historical and Transportation DataValerii Klymchuk
ย 
Artificial Intelligence for Automated Decision Support Project
Artificial Intelligence for Automated Decision Support ProjectArtificial Intelligence for Automated Decision Support Project
Artificial Intelligence for Automated Decision Support ProjectValerii Klymchuk
ย 
Data Warehouse Project
Data Warehouse ProjectData Warehouse Project
Data Warehouse ProjectValerii Klymchuk
ย 

More from Valerii Klymchuk (12)

Sample presentation slides template
Sample presentation slides templateSample presentation slides template
Sample presentation slides template
ย 
Toronto Capstone
Toronto CapstoneToronto Capstone
Toronto Capstone
ย 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data Mining
ย 
02 Related Concepts
02 Related Concepts02 Related Concepts
02 Related Concepts
ย 
03 Data Representation
03 Data Representation03 Data Representation
03 Data Representation
ย 
05 Scalar Visualization
05 Scalar Visualization05 Scalar Visualization
05 Scalar Visualization
ย 
06 Vector Visualization
06 Vector Visualization06 Vector Visualization
06 Vector Visualization
ย 
07 Tensor Visualization
07 Tensor Visualization07 Tensor Visualization
07 Tensor Visualization
ย 
Crime Analysis based on Historical and Transportation Data
Crime Analysis based on Historical and Transportation DataCrime Analysis based on Historical and Transportation Data
Crime Analysis based on Historical and Transportation Data
ย 
Artificial Intelligence for Automated Decision Support Project
Artificial Intelligence for Automated Decision Support ProjectArtificial Intelligence for Automated Decision Support Project
Artificial Intelligence for Automated Decision Support Project
ย 
Data Warehouse Project
Data Warehouse ProjectData Warehouse Project
Data Warehouse Project
ย 
Database Project
Database ProjectDatabase Project
Database Project
ย 

Recently uploaded

PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
ย 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
ย 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
ย 
RS 9000 Call In girls Dwarka Mor (DELHI)โ‡›9711147426๐Ÿ”Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)โ‡›9711147426๐Ÿ”DelhiRS 9000 Call In girls Dwarka Mor (DELHI)โ‡›9711147426๐Ÿ”Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)โ‡›9711147426๐Ÿ”Delhijennyeacort
ย 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
ย 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
ย 
ๅŠž็†ๅญฆไฝ่ฏ็บฝ็บฆๅคงๅญฆๆฏ•ไธš่ฏ(NYUๆฏ•ไธš่ฏไนฆ๏ผ‰ๅŽŸ็‰ˆไธ€ๆฏ”ไธ€
ๅŠž็†ๅญฆไฝ่ฏ็บฝ็บฆๅคงๅญฆๆฏ•ไธš่ฏ(NYUๆฏ•ไธš่ฏไนฆ๏ผ‰ๅŽŸ็‰ˆไธ€ๆฏ”ไธ€ๅŠž็†ๅญฆไฝ่ฏ็บฝ็บฆๅคงๅญฆๆฏ•ไธš่ฏ(NYUๆฏ•ไธš่ฏไนฆ๏ผ‰ๅŽŸ็‰ˆไธ€ๆฏ”ไธ€
ๅŠž็†ๅญฆไฝ่ฏ็บฝ็บฆๅคงๅญฆๆฏ•ไธš่ฏ(NYUๆฏ•ไธš่ฏไนฆ๏ผ‰ๅŽŸ็‰ˆไธ€ๆฏ”ไธ€fhwihughh
ย 
From idea to production in a day โ€“ Leveraging Azure ML and Streamlit to build...
From idea to production in a day โ€“ Leveraging Azure ML and Streamlit to build...From idea to production in a day โ€“ Leveraging Azure ML and Streamlit to build...
From idea to production in a day โ€“ Leveraging Azure ML and Streamlit to build...Florian Roscheck
ย 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
ย 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
ย 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
ย 
ๅŽŸ็‰ˆ1:1ๅฎšๅˆถๅ—ๅๅญ—ๆ˜Ÿๅคงๅญฆๆฏ•ไธš่ฏ๏ผˆSCUๆฏ•ไธš่ฏ๏ผ‰#ๆ–‡ๅ‡ญๆˆ็ปฉๅ•#็œŸๅฎž็•™ไฟกๅญฆๅŽ†่ฎค่ฏๆฐธไน…ๅญ˜ๆกฃ
ๅŽŸ็‰ˆ1:1ๅฎšๅˆถๅ—ๅๅญ—ๆ˜Ÿๅคงๅญฆๆฏ•ไธš่ฏ๏ผˆSCUๆฏ•ไธš่ฏ๏ผ‰#ๆ–‡ๅ‡ญๆˆ็ปฉๅ•#็œŸๅฎž็•™ไฟกๅญฆๅŽ†่ฎค่ฏๆฐธไน…ๅญ˜ๆกฃๅŽŸ็‰ˆ1:1ๅฎšๅˆถๅ—ๅๅญ—ๆ˜Ÿๅคงๅญฆๆฏ•ไธš่ฏ๏ผˆSCUๆฏ•ไธš่ฏ๏ผ‰#ๆ–‡ๅ‡ญๆˆ็ปฉๅ•#็œŸๅฎž็•™ไฟกๅญฆๅŽ†่ฎค่ฏๆฐธไน…ๅญ˜ๆกฃ
ๅŽŸ็‰ˆ1:1ๅฎšๅˆถๅ—ๅๅญ—ๆ˜Ÿๅคงๅญฆๆฏ•ไธš่ฏ๏ผˆSCUๆฏ•ไธš่ฏ๏ผ‰#ๆ–‡ๅ‡ญๆˆ็ปฉๅ•#็œŸๅฎž็•™ไฟกๅญฆๅŽ†่ฎค่ฏๆฐธไน…ๅญ˜ๆกฃ208367051
ย 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
ย 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
ย 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
ย 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
ย 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
ย 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
ย 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
ย 

Recently uploaded (20)

PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
ย 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
ย 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
ย 
RS 9000 Call In girls Dwarka Mor (DELHI)โ‡›9711147426๐Ÿ”Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)โ‡›9711147426๐Ÿ”DelhiRS 9000 Call In girls Dwarka Mor (DELHI)โ‡›9711147426๐Ÿ”Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)โ‡›9711147426๐Ÿ”Delhi
ย 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
ย 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
ย 
ๅŠž็†ๅญฆไฝ่ฏ็บฝ็บฆๅคงๅญฆๆฏ•ไธš่ฏ(NYUๆฏ•ไธš่ฏไนฆ๏ผ‰ๅŽŸ็‰ˆไธ€ๆฏ”ไธ€
ๅŠž็†ๅญฆไฝ่ฏ็บฝ็บฆๅคงๅญฆๆฏ•ไธš่ฏ(NYUๆฏ•ไธš่ฏไนฆ๏ผ‰ๅŽŸ็‰ˆไธ€ๆฏ”ไธ€ๅŠž็†ๅญฆไฝ่ฏ็บฝ็บฆๅคงๅญฆๆฏ•ไธš่ฏ(NYUๆฏ•ไธš่ฏไนฆ๏ผ‰ๅŽŸ็‰ˆไธ€ๆฏ”ไธ€
ๅŠž็†ๅญฆไฝ่ฏ็บฝ็บฆๅคงๅญฆๆฏ•ไธš่ฏ(NYUๆฏ•ไธš่ฏไนฆ๏ผ‰ๅŽŸ็‰ˆไธ€ๆฏ”ไธ€
ย 
From idea to production in a day โ€“ Leveraging Azure ML and Streamlit to build...
From idea to production in a day โ€“ Leveraging Azure ML and Streamlit to build...From idea to production in a day โ€“ Leveraging Azure ML and Streamlit to build...
From idea to production in a day โ€“ Leveraging Azure ML and Streamlit to build...
ย 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
ย 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
ย 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
ย 
ๅŽŸ็‰ˆ1:1ๅฎšๅˆถๅ—ๅๅญ—ๆ˜Ÿๅคงๅญฆๆฏ•ไธš่ฏ๏ผˆSCUๆฏ•ไธš่ฏ๏ผ‰#ๆ–‡ๅ‡ญๆˆ็ปฉๅ•#็œŸๅฎž็•™ไฟกๅญฆๅŽ†่ฎค่ฏๆฐธไน…ๅญ˜ๆกฃ
ๅŽŸ็‰ˆ1:1ๅฎšๅˆถๅ—ๅๅญ—ๆ˜Ÿๅคงๅญฆๆฏ•ไธš่ฏ๏ผˆSCUๆฏ•ไธš่ฏ๏ผ‰#ๆ–‡ๅ‡ญๆˆ็ปฉๅ•#็œŸๅฎž็•™ไฟกๅญฆๅŽ†่ฎค่ฏๆฐธไน…ๅญ˜ๆกฃๅŽŸ็‰ˆ1:1ๅฎšๅˆถๅ—ๅๅญ—ๆ˜Ÿๅคงๅญฆๆฏ•ไธš่ฏ๏ผˆSCUๆฏ•ไธš่ฏ๏ผ‰#ๆ–‡ๅ‡ญๆˆ็ปฉๅ•#็œŸๅฎž็•™ไฟกๅญฆๅŽ†่ฎค่ฏๆฐธไน…ๅญ˜ๆกฃ
ๅŽŸ็‰ˆ1:1ๅฎšๅˆถๅ—ๅๅญ—ๆ˜Ÿๅคงๅญฆๆฏ•ไธš่ฏ๏ผˆSCUๆฏ•ไธš่ฏ๏ผ‰#ๆ–‡ๅ‡ญๆˆ็ปฉๅ•#็œŸๅฎž็•™ไฟกๅญฆๅŽ†่ฎค่ฏๆฐธไน…ๅญ˜ๆกฃ
ย 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
ย 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
ย 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
ย 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
ย 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
ย 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
ย 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
ย 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
ย 

04 Classification in Data Mining

  • 2. 4.1 Introduction โ€ข Prediction can be thought of as classifying an attribute value into one of set of possible classes. It is often viewed as forecasting a continuous value, while classification forecasts a discrete value. โ€ข All classification techniques assume some knowledge of the data. Training data consists of sample input data as well as the classification assignment for each data tuple. Given a database ๐ท of tuples and a set of classes ๐ถ, the classification problem is to define a mapping ๐‘“: ๐ท โ†’ ๐ถ where each tuple is assigned to one class. โ€ข The problem is implemented in two phases: โ€ข Create a specific model by evaluating the training data. โ€ข Apply the model to classifying tuples from the target database. โ€ข There are three basic methods used to solve the classification problem: 1) specifying boundaries; 2) using probability distributions; 3) using posterior probabilities. โ€ข A major issue associated with classification is overfitting. If the classification model fits the data exactly, it may not be applicable to a broader population. โ€ข Statistical algorithms are based directly on the use of statistical information. Distance-based algorithms use similarity or distance measure to perform the classification. Decision trees and NN use those structures. Rule based classification algorithms generate if-then rules to perform classification.
  • 3. Measuring Performance and Accuracy โ€ข Classification accuracy is usually calculated by determining the percentage of tuples placed in the correct class. โ€ข Given a specific class and a database tuple may or may not be assigned to that class while its actual membership may or may not be in that class. This gives us four quadrants: โ€ข True positive (TP): ๐‘ก๐‘– predicted to be in ๐ถ๐‘— and is actually in it. โ€ข False positive (FP): ๐‘ก๐‘– predicted to be in ๐ถ๐‘— but is not actually in it. โ€ข True negative (TN): ๐‘ก๐‘– not predicted to be in ๐ถ๐‘— and is not actually in it. โ€ข False negative (FN): ๐‘ก๐‘– not predicted to be in ๐ถ๐‘— but is actually in it. โ€ข An OC (operating characteristic) curve or ROC (receiver operating characteristic) curve shows the relationship between false positives and true positives. The horizontal axis has the percentage of false positives and the vertical axis has the percentage of true positives for a database sample. โ€ข A confusion matrix illustrates the accuracy of the solution to a classification problem. Given ๐‘š classes, a confusion matrix is an ๐‘š ร— ๐‘š matrix where entry ๐‘๐‘–,๐‘— indicates the number of tuples from ๐ท that were assigned to class ๐ถ๐‘— but where the correct class is ๐ถ๐‘–.
  • 4. 4.2 Statistical Methods. Regression โ€ข Regression used for classification deals with estimation (prediction) of an output (class) value based on input values from the database. It takes a set of data and fits the data to a formula. Classification can be performed using two different approaches: 1) Division: The data are divided into regions based on class; 2) Prediction: Formulas are generated to predict the output class value. โ€ข The prediction is an estimate rather than the actual output value. This technique does not work well with nonnumeric data. โ€ข In cases with noisy, erroneous data, outliers, the observable data may be described as โˆถ ๐‘ฆ = ๐‘0 + ๐‘1 ๐‘ฅ1 + โ‹ฏ + ๐‘ ๐‘› ๐‘ฅ ๐‘› + ๐œ–, where ๐œ– is a random error with a mean of 0. A method of least squares is used to minimize the least squared error. We first take partial derivatives with respect to coefficients and set them equal to zero. This approach finds least square estimates ๐‘0, ๐‘1, โ‹ฏ ๐‘ ๐‘› for the coefficients so that the squared error is minimized for the set of observable values. โ€ข We can estimate the accuracy of the fit of a linear regression model to the actual data using a mean squared error function. โ€ข A commonly used regression technique is called logistic regression. Logistic regression fits data to a curve such as: ๐‘ = ๐‘’(๐‘0+๐‘1 ๐‘ฅ1) 1 + ๐‘’(๐‘0+๐‘1 ๐‘ฅ1) โ€ข It produces values between 0 and 1 and can be interpreted as probability of class membership. The logarithm is applied to obtain the logistic function: log ๐‘’ ๐‘ 1 โˆ’ ๐‘ = ๐‘0 + ๐‘1 ๐‘ฅ1 โ€ข Here ๐‘ is the probability of being in the class and 1 โˆ’ ๐‘ is the probability that it is not. The process chooses values for ๐‘0 and ๐‘1 that maximize the probability of observing the given values.
  • 5. Bayesian Classification โ€ข Assuming that the contribution by all attributes are independent and that each contributes equally to the classification problem, a classification scheme naive Bayes can be used. โ€ข Training data can be used to determine prior and conditional probabilities ๐‘ƒ ๐ถ๐‘— and ๐‘ƒ(๐‘ฅ๐‘–|๐ถ๐‘—), as well as ๐‘ƒ ๐‘ฅ๐‘– . From these values Bayes theorem allows us to estimate the posterior probability ๐‘ƒ ๐ถ๐‘— ๐‘ฅ๐‘– and ๐‘ƒ(๐ถ๐‘—|๐‘ก๐‘–). โ€ข This must be done for all attributes and all values ๐‘ƒ ๐‘ก๐‘– ๐ถ๐‘— = ๐‘˜=1 ๐‘ ๐‘ƒ(๐‘ฅ๐‘–๐‘˜|๐ถ๐‘—) โ€ข To calculate ๐‘ƒ(๐‘ก๐‘–) we estimate the likelihoods for ๐‘ก๐‘– in each class and add these values. โ€ข The posterior probability ๐‘ƒ(๐ถ๐‘—|๐‘ก๐‘–) is then found for each class. The class with the highest probability is the one chosen for the tuple. โ€ข Only one scan of training data is needed, it can handle missing values. In simple relationships this technique often yields good results. โ€ข The technique does not handle continuous data. Diving into ranges could be used to solve this problem. Attributes usually are not independent, so we can use a subset by ignoring those that are dependent.
  • 6. 4.3 Distance-based Algorithms โ€ข Similarity (or distance) measures may be used to identify the alikeness of different items in the database. The difficulty lies in how the similarity measures are defined and applied to the items in the database. Since most measures assume numeric (often discrete) data types, a mapping from the attribute domain to a subset of integers may be used for abstract data types. โ€ข Simple approach assumes that each class ๐‘๐‘– is represented by its center or centroid โ€“ center for its class. The new item is placed in the class with the largest similarity value. โ€ข K nearest neighbors (KNN) classification scheme requires not only training data, but also desired classification for each item in it. When a classification is made for a new item, its distance to each item in the training set must be determined. Only the K closest entries are considered. The new item is then placed in the class that contains the most items from this set of K closest items. โ€ข KNN technique is extremely sensitive to the value of K. A rule of thumb is that ๐พ โ‰ค ๐‘›๐‘ข๐‘š๐‘๐‘’๐‘Ÿ ๐‘œ๐‘“ ๐‘ก๐‘Ÿ๐‘Ž๐‘–๐‘›๐‘–๐‘›๐‘” ๐‘–๐‘ก๐‘’๐‘š๐‘ 
  • 8. Solving the classification problem using decision trees is a 2-step process: โ€ข Decision tree induction: Construct a DT using training data. โ€ข For each ๐‘ก๐‘– โˆˆ ๐ท, apply the DT to determine its class. Attributes in the database schema that are used to label nodes in the tree and around which the divisions takes place are called the splitting attributes. The predicates by which the arcs in the tree are labeled are called the splitting predicates. The major factors in the performance of the DT building algorithm are: the size of the training set and how the best splitting attribute is chosen. Algorithm continues adding nodes and arcs to the tree recursively until some stopping criteria is reached (can be determined differently). โ€ข Advantages: easy to use, rules are easy to interpret and understand, scale well for large databases (the tree size is independent of the database size). โ€ข Disadvantages: do not easily handle continuous data (attribute domains must be divided into categories (rectangular regions) in order to be handled, handling missing data is difficult, overfitting may occur (overcome via pruning), correlations among attributes are ignored by the DT process. 4.4 Decision Tree-based Algorithms
  • 9. โ€ข Choosing splitting attributes. Using the initial training data, the โ€œbestโ€ splitting attribute is chosen first. Algorithms differ in how they determine the best attribute and its best predicates to use for splitting. The choice of attribute involves not only an examination of the data in the training set but also the informed input of domain experts. โ€ข Ordering of splitting attributes. The order in which the attributes are chosen is also important. โ€ข Splits (number of splits to take). If the domain is continuous or has a large number of values, the number of splits to use is not easily determined. โ€ข Tree structure. A balanced shorter tree with the fewest levels is desirable. Multi-way branching or binary trees (tend to be deeper) can be used. โ€ข Stopping criteria. The creating of the tree stops when the training data are perfectly classified. Stopping earlier may be used to prevent overfitting. More levels than needed would be created in a tree if it is known that there are data distributions not represented in the training data. โ€ข Training data. The training data and the tree induction algorithm determine the tree shape. If training data set is too small, then the generated tree might not be specific enough to work properly with the more general data. If the training data set is too large, then the created tree may overfit. โ€ข Pruning. The DT building algorithms may initially build the tree and then prune it for more effective classification. Pruning is a modification of the tree by removing redundant comparisons or sub-trees aiming to achieve better performance. Issues Faced by DT Algorithms
  • 10. Comparing Decision Trees The time and space complexity of DT algorithms depends on the size of the training data ๐‘ž; the number of attributes โ„Ž; and the shape of the resulting tree. This gives a time complexity to build a tree of ๐‘‚(โ„Ž๐‘ž log ๐‘ž). The time to classify a database of size ๐‘› is based on the height of the tree and is ๐‘‚ ๐‘› log ๐‘ž .
  • 11. ID3 Algorithm โ€ข The technique to building a decision tree attempts to minimize the expected number of comparisons. It choses splitting attributes with the highest information gain first. โ€ข Entropy is used to measure the amount of uncertainty or surprise or randomness in a set of data. Given probabilities of states ๐‘1, ๐‘2, โ‹ฏ , ๐‘๐‘  where ๐‘–=1 ๐‘  ๐‘๐‘– = 1, entropy is defied as ๐ป ๐‘1, ๐‘2, โ‹ฏ , ๐‘๐‘  = ๐‘–=1 ๐‘  ๐‘๐‘– log 1 ๐‘๐‘– โ€ข Gain is defined as the difference between how much information is needed to make a correct classification before the split versus how much information is needed after the split. The ID3 algorithm calculates the gain of a particular split by the following formula: Gain ๐ท, ๐‘† = ๐ป ๐ท โˆ’ ๐‘–=1 ๐‘  ๐‘ƒ(๐ท๐‘–)๐ป(๐ท๐‘–) โ€ข ID3 approach favors attributes with many divisions and thus may lead to overfitting In the extreme, an attribute that has a unique value for each tuple in the training set would be the best because there would be only one tuple (and thus one class) for each division.
  • 12. Entropy a) log 1 ๐‘ shows the amount of surprise as the probability ๐‘ ranges from 0 to 1. b) ๐‘ log 1 ๐‘ shows the expected information based on probability ๐‘ of an event. c) ๐‘ log 1 ๐‘ + (1 โˆ’ ๐‘) log 1 (1 โˆ’ ๐‘) shows the value of entropy. To measure the information associated with a division, we add information associated with both events, while taking into account the probability that each occurs.
  • 13. C4.5, C5.0 and CART โ€ข In C4.5 splitting is based on GainRatio as opposed to Gain, which ensures a larger than average information gain ๐บ๐‘Ž๐‘–๐‘› ๐‘…๐‘Ž๐‘ก๐‘–๐‘œ ๐ท, ๐‘† = Gain(๐ท, ๐‘†) H ๐ท1 ๐ท , โ‹ฏ , ๐ท๐‘  ๐ท โ€ข C5.0 is based on boosting. Boosting is an approach to combining different classifiers. It does not always help when the training data contains a lot of noise. Boosting works by creating multiple training sets from one training set. Thus, multiple classifiers are actually constructed. Each classifier is assigned a vote, voting is performed, and the target tuple is assigned to the class with the most number of votes. โ€ข Classification and regression trees (CART) is a technique that generates a binary decision tree. Entropy is used as a measure to choose the best splitting attribute and criterion, however, only 2 children are created. At each step, an exhaustive search determines the best split defined by: ฮฆ ๐‘  ๐‘ก = 2๐‘ƒ๐ฟ ๐‘ƒ๐‘… ๐‘—=1 ๐‘š ๐‘ƒ ๐ถ๐‘—|๐‘ก ๐ฟ โˆ’ ๐‘ƒ ๐ถ๐‘—|๐‘ก ๐‘… . โ€ข This formula is evaluated at the current node ๐‘ก, and for each possible splitting attribute and criterion ๐‘  . Here ๐ฟ and ๐‘… are the probability that a tuple ๐‘ก will be on the left or right side of the tree. ๐‘ƒ ๐ถ๐‘—|๐‘ก ๐ฟ or ๐‘ƒ ๐ถ๐‘—|๐‘ก ๐‘… is the probability that a tuple is in this class ๐ถ๐‘— and in the left or right sub-tree. CART forces that an ordering of the attributes must be used, and it also contains a pruning strategy.
  • 14. โ€ข There are two primary pruning strategies: 1) subtree replacement: a subtree is replaced by a leaf node. This results in an error rate close to that of the original tree. It works from the bottom of the tree up to the root; 2) subtree raising: replaces a sub-tree by its most used subtree. Here a subtree is raised from its current location to a node higher up in the tree. We must determine the increase in error rate for this replacement. Pruning
  • 15. Scalable DT Techniques โ€ข SPRINT (Scalable PaRallelizable Induction of decision Trees). A gini index is used to find the best split. Here gini for a database ๐ท is defined as gini ๐ท = 1 โˆ’ ๐‘๐‘— 2 , where ๐‘๐‘— is the frequency of class ๐ถ๐‘— in ๐ท. The goodness of a split of ๐ท into subsets ๐ท1and ๐ท2 is defined by ๐‘”๐‘–๐‘›๐‘– ๐‘ ๐‘๐‘™๐‘–๐‘ก ๐ท = ๐‘›1 ๐‘› gini(๐ท1) + ๐‘›2 ๐‘› gini(๐ท2) The split with the best gini value is chosen. โ€ข The RainForest approach allows a choice of split attribute without needing a training set. For each node of a DT, a table called the attribute-value class (AVC) label group is used. The table summarizes for an attribute the count of entries per class or attribute value grouping. Thus, the AVC table summarizes the information needed to determine splitting attributes.
  • 16. 4.5 Neural Network-based Algorithms Solving a classification problem using NNs involves several steps: โ€ข Determine the number of output nodes, what attributes should be used as input, the number of hidden layers, the weights (labels) and functions to be used. Certain attribute values from the tuple are input into the directed graph at the corresponding source nodes. There often is one sink node for each class. โ€ข For each tuple in the training set, propagate it though the network and evaluate the output prediction. The projected classification made by the graph can be compared with the actual classification. If the prediction is accurate, we adjust the labels to ensure that this prediction has a higher output weight the next time. If the prediction is not correct, we adjust the weights to provide a lower output value for this class. โ€ข Propagate each tuple through the network and make the appropriate classification. The output value that is generated indicates the probability that the corresponding input tuple belongs to that class. The tuple will then be assigned to the class with the highest probability of membership. Advantages: 1) NNs are more robust (especially in noisy environments) than DTs because of the weights; 2) the NN improves its performance by learning. This may continue even after the training set has been applied; 3) the use of NNs can be parallelized for better performance; 4) there is a low error rate and thus a high degree of accuracy once the appropriate training has been performed. Disadvantages: 1) NNs are difficult to understand; 2) Generating rules from NNs is not straightforward; 3) input attribute values must be numeric; 4) testing, verification; 5) overfitting may occur; 6) the learning phase may fail to converge, the result is an estimate (not optimal).
  • 17. NN Propagation and Error โ€ข Given a tuple of values input to the NN, ๐‘‹ = ๐‘ฅ1, โ‹ฏ , ๐‘ฅโ„Ž , one at each node in the input layer. Then the summation and activation functions are applied at each node, with an output value created for each output arc from that node. These values are sent to the subsequent nodes until a tuple of output values ๐‘Œ = ๐‘ฆ1, โ‹ฏ , ๐‘ฆ ๐‘š is produced from the nodes in the output layer. โ€ข Propagation occurs by applying the activation function at each node, which then places the output value on the arc to be sent as input to the next node. During classification process only propagation occurs. However, when learning is used after the output of the classification occurs, a comparison to the known classification is used to determine how to change the weights. โ€ข A gradient descent technique in modifying the weights can be used to minimize MSE. Assuming that the output from node ๐‘– is ๐‘ฆ๐‘–, but should be ๐‘‘๐‘–, the error produced from a node in any layer can be found by ๐‘ฆ๐‘– โˆ’ ๐‘‘๐‘– . The mean squared error (MSE) is found by (๐‘ฆ๐‘– โˆ’ ๐‘‘๐‘–)2 2. Thus the total MSE error over all m output nodes in the NN is: ๐‘€๐‘†๐ธ = ๐‘–=1 ๐‘š (๐‘ฆ๐‘– โˆ’ ๐‘‘๐‘–)2 ๐‘š
  • 18. Supervised Learning in NN โ€ข In the simplest case learning progresses from the output layer backward to the input layer. The objective of a learning technique is to change the weights based on the output obtained for a specific input tuple. Weight are changed based on the changes that were made in weights in subsequent arcs. This backward learning process is called backpropagation. โ€ข With the batch or offline approach, the weights are changed after all tuples in the training set are applied and a total MSE is found. With the incremental or online approach, the weights are changed after each tuple in the training set is applied. The incremental technique is usually preferred because it requires less space and may actually examine more potential solutions. โ€ข Suppose for a given node, ๐‘— , the input weights are represented as a tuple ๐‘ค1๐‘—, โ‹ฏ , ๐‘ค ๐‘˜๐‘— , while the input and output values are ๐‘ฅ1๐‘—, โ‹ฏ , ๐‘ฅ ๐‘˜๐‘— and ๐‘ฆ๐‘—, respectively. The change in weights using Hebb rule is represented by ฮ”๐‘ค๐‘–๐‘— = ๐‘๐‘ฅ๐‘–๐‘— ๐‘ฆ๐‘—. Here ๐‘ is a constant often called the learning rate. A rule of thumb is that c = 1 #๐‘’๐‘›๐‘ก๐‘Ÿ๐‘–๐‘’๐‘  ๐‘–๐‘› ๐‘ก๐‘Ÿ๐‘Ž๐‘–๐‘›๐‘–๐‘›๐‘” ๐‘ ๐‘’๐‘ก โ€ข Delta rule examines not only the output value ๐‘ฆ๐‘— but also the desired value ๐‘‘๐‘— for output. In this case the change in weight is found by the rule: ฮ”๐‘ค๐‘–๐‘— = ๐‘๐‘ฅ๐‘–๐‘— ๐‘‘๐‘— โˆ’ ๐‘ฆ๐‘— . The nice feature of the delta rule is that is minimizes the error ๐‘‘๐‘— โˆ’ ๐‘ฆ๐‘— at each node.
  • 19. Gradient Descent โ€ข Here ๐œ‚ is referred to as the learning parameter. It typically is found in range (0,1), although it may be larger. This value determines how fast the algorithm learns. โ€ข We are trying to minimize the error at the output nodes, while output errors are being propagated backward through the network. โ€ข The learning in the gradient descent technique is based on using the following value for delta at the output layer ฮ”๐‘ค๐‘—๐‘– = โˆ’๐œ‚ ๐œ•๐ธ ๐œ•๐‘ค๐‘—๐‘– = โˆ’๐œ‚ ๐œ•๐ธ ๐œ•๐‘ฆ๐‘– ๐œ•๐‘ฆ๐‘– ๐œ•๐‘†๐‘– ๐œ•๐‘†๐‘– ๐œ•๐‘ค๐‘—๐‘– โ€ข here the weights ๐‘ค๐‘—๐‘– are at one arc coming into ๐‘– from ๐‘—. โ€ข So that new adjusted weights become ๐‘ค๐‘—๐‘– = ๐‘ค๐‘—๐‘– + ฮ”๐‘ค๐‘—๐‘– โ€ข Assuming sigmoidal activation function for the output layer ฮ”๐‘ค๐‘—๐‘– = ๐œ‚ ๐‘‘๐‘– โˆ’ ๐‘ฆ๐‘– ๐‘ฆ๐‘— 1 โˆ’ ๐‘ฆ๐‘– ๐‘ฆ๐‘–
  • 20. Gradient Descent in the Hidden Layer โ€ข For node j in the hidden layer the change in the weights for arcs coming into it: ฮ”๐‘ค ๐‘˜๐‘— = โˆ’๐œ‚ ๐œ•๐ธ ๐œ•๐‘ค ๐‘˜๐‘— = ๐‘š ๐œ•๐ธ ๐œ•๐‘ฆ ๐‘š ๐œ•๐‘ฆ ๐‘š ๐œ•๐‘† ๐‘š ๐œ•๐‘† ๐‘š ๐œ•๐‘ฆ๐‘— ๐œ•๐‘ฆ๐‘— ๐œ•๐‘†๐‘— ๐œ•๐‘†๐‘— ๐œ•๐‘ค ๐‘˜๐‘— โ€ข Here the variable m ranges over all output nodes with arcs from ๐‘— . โ€ข Assuming hyperbolic tangent activation function for the hidden layer: ฮ”๐‘ค ๐‘˜๐‘— = ๐œ‚๐‘ฆ ๐‘˜ 1 โˆ’ ๐‘ฆ๐‘— 2 2 ๐‘š (๐‘‘ ๐‘š โˆ’ ๐‘ฆ ๐‘š)๐‘ค๐‘—๐‘š ๐‘ฆ ๐‘š(1 โˆ’ ๐‘ฆ ๐‘š) โ€ข Another common formula for the change in weight is ฮ”๐‘ค๐‘—๐‘– ๐‘ก + 1 = โˆ’๐œ‚ ๐œ•๐ธ ๐œ•๐‘ค๐‘—๐‘– + ๐›ผฮ”๐‘ค๐‘—๐‘–(๐‘ก) โ€ข Here is called a momentum and is used to prevent oscillation problems.
  • 21. Perceptrons โ€ข The simplest NN is called a perceptron. A perceptrone is a single neuron with multiple inputs and one output. Step or any other (e.g., sigmoidal) activation function can be used. โ€ข A simple perceptrone can be used to classify into two classes. Activation function output value of 1 would be used to classify into one class, while value of 0 would be used to pass in the other class. โ€ข A simple feed forward neural network of perceptrons is called a multilayer perceptron (MLP). The neurons are placed in layers with outputs always flowing toward the output layer.
  • 22. โ€ข MLP needs no more than 2 hidden layers. Kolmogorovโ€™s theorem states, that a mapping between two sets of numbers can be performed using a NN with only one hidden layer. Given ๐‘› attributes, NN having one input node for each attribute, the hidden layer should have 2๐‘› + 1 nodes, each with input from each of the input nodes. The output layer has one node for each desired output value. MLP (Multilayer Perceptron)
  • 23. 4.6 Rule-Based Algorithms โ€ข One way to perform classification is to generate if-then rules that cover all cases. A classification rule, ๐‘Ÿ = ๐‘Ž, ๐‘ , consists of the if or antecedent, ๐‘Ž part, and the then ๐‘ or consequent portion . The antecedent contains a predicate that can be evaluated as true or false against each tuple in the database (and in the training data). โ€ข A DT can always be used to generate rules for each leaf node in the decision tree. All rules with the same consequent could be combined together by Oring the antecedents of the simpler rules. There are some differences: โ€ข The tree has an implied order in which the splitting is performed. โ€ข A tree is created based on looking at all classes. When generating rules, only one class must be examined at a time.
  • 24. 4.6.2 Generating Rules from a NN โ€ข While the source NN may still be used for classification, the derived rules can be used to verify or interpret the network. The problem is that the rules do not explicitly exist. They are buried in the structure of the graph itself. In addition, if learning is still occurring, the rules themselves are dynamic. โ€ข The rules generated tend both to be more concise and to have a lower error rate than rules used with DTs. โ€ข The basic idea of the RX algorithm is to cluster output node activation values (with the associated hidden nodes and input); cluster hidden node activation values; generate rules that describe the output values in terms of the hidden activation values; generate rules that describe hidden output values in terms of inputs; combine two sets of rules. โ€ข A major problem with rule extraction is the potential size that these rules should be. For example, if you have a node with n inputs each having 5 values, there are 5n different input combinations to this one node alone. To overcome this problem and that of having continuous ranges of output values from nodes, the output values for both the hidden and output layers are first discretized. This is accomplished by clustering the values and dividing continuous values into disjoint ranges.
  • 25. Generating Rules Without a DT or NN โ€ข These techniques are sometimes called covering algorithms because they attempt to generate rules exactly cover a specific class. They generate the best rule possible by optimizing the desired classification probability. Usually the best attribute-value pair is chosen, as opposed to the best attribute with the tree- based algorithms. โ€ข 1R approach generates a simple set of rules that are equivalent to a DT with only one level. The basic idea is to choose the best attribute to perform the classification based on the training data. The best is defined here by counting the number of errors. 1R can handle missing data by adding an additional attribute value of missing. As with ID3, it tends to chose attributes with a large number of values leading to overfitting. โ€ข Another approach to generating rules without first having a DT is called PRISM. PRISM generates rules fro each class by looking at the training data and adding rules that completely describe all tuples in that class. Its accuracy is 100 percent. The algorithm refers to attribute-value pairs.
  • 26. Combining Techniques โ€ข Multiple independent approaches can be applied to a classification problem, each yielding its own class prediction. The results of these individual techniques can then be combined. Along with boosting two other basic techniques can be used to combine classifiers: โ€ข One approach assumes that there are n independent classifiers and that each generates the posterior probability ๐‘ƒ๐‘˜(๐ถ๐‘—|๐‘ก๐‘–) for each class. The values are combined with a weighted linear combination ๐‘˜=1 ๐‘› ๐‘ค ๐‘˜ ๐‘ƒ๐‘˜(๐ถ๐‘—|๐‘ก๐‘–) โ€ข Another technique is to choose the classifier that has the best accuracy in a database sample. This is referred to as a dynamic classifier selection (DCS). โ€ข Another variation is simple voting: assign the tuple to the class to which a majority of the classifiers have assigned it. โ€ข Adaptive classifier combination (ACC) technique. Given a tuple to classify, the neighborhood around it is first determined, then the tuples in that neighborhood are classified by each classifier, and finally the accuracy for each class is measured. By examining the accuracy across all classifiers for each class, the tuple is placed in the class that has the highest local accuracy. In effect, the class chosen is that to which most of its neighbors are accurately classified independent of classifier.
  • 27. Combination of Multiple Classifiers in DCS Any shapes that are darkened indicate an incorrect classification. DCS looks at local accuracy of each classifier: a) 7 tuples in the neighborhood are correctly classified; b) only 6 are correctly classified. Thus X will be classified according with the first classifier.
  • 28. Summary โ€ข No one classification technique is always superior to the others. โ€ข The regression approaches force the data to fit a predefined model. The problem arises when a linear model is chosen for non linear data. โ€ข The KNN technique requires only that the data be such, that distances can be calculated. This can then be applied even to nonnumeric data. Outliers are handled by looking only at the K nearest neighbors. โ€ข Bayesian classification assumes that the data attributes are independent with discrete values. โ€ข Decision tree techniques are easy to understand, but they may lead to overfitting. To avoid this, pruning techniques may be needed. โ€ข ID3 is applicable only to categorical data. C4.5 and C5 allow the use of continuous data and improved techniques for splitting. CART creates binary trees and thus may result in very deep trees. โ€ข All algorithms are ๐‘‚(๐‘›) to classify the ๐‘› items in the dataset.
  • 29. References: Dunham, Margaret H. โ€œData Mining: Introductory and Advanced Topicsโ€. Pearson Education, Inc., 2003.