SlideShare a Scribd company logo
1 of 29
Chapter 4
Classification
4.1 Introduction
β€’ Prediction can be thought of as classifying an attribute value into one of set of possible classes. It is often
viewed as forecasting a continuous value, while classification forecasts a discrete value.
β€’ All classification techniques assume some knowledge of the data. Training data consists of sample input data
as well as the classification assignment for each data tuple. Given a database 𝐷 of tuples and a set of classes
𝐢, the classification problem is to define a mapping 𝑓: 𝐷 β†’ 𝐢 where each tuple is assigned to one class.
β€’ The problem is implemented in two phases:
β€’ Create a specific model by evaluating the training data.
β€’ Apply the model to classifying tuples from the target database.
β€’ There are three basic methods used to solve the classification problem: 1) specifying boundaries; 2) using
probability distributions; 3) using posterior probabilities.
β€’ A major issue associated with classification is overfitting. If the classification model fits the data exactly, it
may not be applicable to a broader population.
β€’ Statistical algorithms are based directly on the use of statistical information. Distance-based algorithms use
similarity or distance measure to perform the classification. Decision trees and NN use those structures. Rule
based classification algorithms generate if-then rules to perform classification.
Measuring Performance and Accuracy
β€’ Classification accuracy is usually calculated by determining the percentage of tuples placed in the
correct class.
β€’ Given a specific class and a database tuple may or may not be assigned to that class while its
actual membership may or may not be in that class. This gives us four quadrants:
β€’ True positive (TP): 𝑑𝑖 predicted to be in 𝐢𝑗 and is actually in it.
β€’ False positive (FP): 𝑑𝑖 predicted to be in 𝐢𝑗 but is not actually in it.
β€’ True negative (TN): 𝑑𝑖 not predicted to be in 𝐢𝑗 and is not actually in it.
β€’ False negative (FN): 𝑑𝑖 not predicted to be in 𝐢𝑗 but is actually in it.
β€’ An OC (operating characteristic) curve or ROC (receiver operating characteristic) curve shows the
relationship between false positives and true positives. The horizontal axis has the percentage of
false positives and the vertical axis has the percentage of true positives for a database sample.
β€’ A confusion matrix illustrates the accuracy of the solution to a classification problem. Given π‘š
classes, a confusion matrix is an π‘š Γ— π‘š matrix where entry 𝑐𝑖,𝑗 indicates the number of tuples
from 𝐷 that were assigned to class 𝐢𝑗 but where the correct class is 𝐢𝑖.
4.2 Statistical Methods. Regression
β€’ Regression used for classification deals with estimation (prediction) of an output (class) value based on input values from the
database. It takes a set of data and fits the data to a formula. Classification can be performed using two different approaches: 1)
Division: The data are divided into regions based on class; 2) Prediction: Formulas are generated to predict the output class value.
β€’ The prediction is an estimate rather than the actual output value. This technique does not work well with nonnumeric data.
β€’ In cases with noisy, erroneous data, outliers, the observable data may be described as ∢ 𝑦 = 𝑐0 + 𝑐1 π‘₯1 + β‹― + 𝑐 𝑛 π‘₯ 𝑛 + πœ–, where πœ– is
a random error with a mean of 0. A method of least squares is used to minimize the least squared error. We first take partial
derivatives with respect to coefficients and set them equal to zero. This approach finds least square estimates 𝑐0, 𝑐1, β‹― 𝑐 𝑛 for the
coefficients so that the squared error is minimized for the set of observable values.
β€’ We can estimate the accuracy of the fit of a linear regression model to the actual data using a mean squared error function.
β€’ A commonly used regression technique is called logistic regression. Logistic regression fits data to a curve such as:
𝑝 =
𝑒(𝑐0+𝑐1 π‘₯1)
1 + 𝑒(𝑐0+𝑐1 π‘₯1)
β€’ It produces values between 0 and 1 and can be interpreted as probability of class membership. The logarithm is applied to obtain
the logistic function:
log 𝑒
𝑝
1 βˆ’ 𝑝
= 𝑐0 + 𝑐1 π‘₯1
β€’ Here 𝑝 is the probability of being in the class and 1 βˆ’ 𝑝 is the probability that it is not. The process chooses values for 𝑐0 and
𝑐1 that maximize the probability of observing the given values.
Bayesian Classification
β€’ Assuming that the contribution by all attributes are independent and that each contributes equally to the
classification problem, a classification scheme naive Bayes can be used.
β€’ Training data can be used to determine prior and conditional probabilities 𝑃 𝐢𝑗 and 𝑃(π‘₯𝑖|𝐢𝑗), as well as
𝑃 π‘₯𝑖 . From these values Bayes theorem allows us to estimate the posterior probability 𝑃 𝐢𝑗 π‘₯𝑖 and
𝑃(𝐢𝑗|𝑑𝑖).
β€’ This must be done for all attributes and all values
𝑃 𝑑𝑖 𝐢𝑗 =
π‘˜=1
𝑝
𝑃(π‘₯π‘–π‘˜|𝐢𝑗)
β€’ To calculate 𝑃(𝑑𝑖) we estimate the likelihoods for 𝑑𝑖 in each class and add these values.
β€’ The posterior probability 𝑃(𝐢𝑗|𝑑𝑖) is then found for each class. The class with the highest probability is the
one chosen for the tuple.
β€’ Only one scan of training data is needed, it can handle missing values. In simple relationships this technique
often yields good results.
β€’ The technique does not handle continuous data. Diving into ranges could be used to solve this problem.
Attributes usually are not independent, so we can use a subset by ignoring those that are dependent.
4.3 Distance-based Algorithms
β€’ Similarity (or distance) measures may be used to identify the alikeness of different items
in the database. The difficulty lies in how the similarity measures are defined and applied
to the items in the database. Since most measures assume numeric (often discrete) data
types, a mapping from the attribute domain to a subset of integers may be used for
abstract data types.
β€’ Simple approach assumes that each class 𝑐𝑖 is represented by its center or centroid –
center for its class. The new item is placed in the class with the largest similarity value.
β€’ K nearest neighbors (KNN) classification scheme requires not only training data, but also
desired classification for each item in it. When a classification is made for a new item, its
distance to each item in the training set must be determined. Only the K closest entries
are considered. The new item is then placed in the class that contains the most items
from this set of K closest items.
β€’ KNN technique is extremely sensitive to the value of K. A rule of thumb is that
𝐾 ≀ π‘›π‘’π‘šπ‘π‘’π‘Ÿ π‘œπ‘“ π‘‘π‘Ÿπ‘Žπ‘–π‘›π‘–π‘›π‘” π‘–π‘‘π‘’π‘šπ‘ 
Centroid-based vs KNN
Solving the classification problem using decision trees is a 2-step process:
β€’ Decision tree induction: Construct a DT using training data.
β€’ For each 𝑑𝑖 ∈ 𝐷, apply the DT to determine its class.
Attributes in the database schema that are used to label nodes in the tree and
around which the divisions takes place are called the splitting attributes. The
predicates by which the arcs in the tree are labeled are called the splitting
predicates. The major factors in the performance of the DT building algorithm
are: the size of the training set and how the best splitting attribute is chosen.
Algorithm continues adding nodes and arcs to the tree recursively until some
stopping criteria is reached (can be determined differently).
β€’ Advantages: easy to use, rules are easy to interpret and understand, scale
well for large databases (the tree size is independent of the database size).
β€’ Disadvantages: do not easily handle continuous data (attribute domains must
be divided into categories (rectangular regions) in order to be handled,
handling missing data is difficult, overfitting may occur (overcome via
pruning), correlations among attributes are ignored by the DT process.
4.4 Decision Tree-based Algorithms
β€’ Choosing splitting attributes. Using the initial training data, the β€œbest” splitting attribute
is chosen first. Algorithms differ in how they determine the best attribute and its best
predicates to use for splitting. The choice of attribute involves not only an examination
of the data in the training set but also the informed input of domain experts.
β€’ Ordering of splitting attributes. The order in which the attributes are chosen is also
important.
β€’ Splits (number of splits to take). If the domain is continuous or has a large number of
values, the number of splits to use is not easily determined.
β€’ Tree structure. A balanced shorter tree with the fewest levels is desirable. Multi-way
branching or binary trees (tend to be deeper) can be used.
β€’ Stopping criteria. The creating of the tree stops when the training data are perfectly
classified. Stopping earlier may be used to prevent overfitting. More levels than needed
would be created in a tree if it is known that there are data distributions not
represented in the training data.
β€’ Training data. The training data and the tree induction algorithm determine the tree
shape. If training data set is too small, then the generated tree might not be specific
enough to work properly with the more general data. If the training data set is too large,
then the created tree may overfit.
β€’ Pruning. The DT building algorithms may initially build the tree and then prune it for
more effective classification. Pruning is a modification of the tree by removing
redundant comparisons or sub-trees aiming to achieve better performance.
Issues Faced by DT Algorithms
Comparing Decision Trees
The time and space complexity of DT algorithms depends on the size of the training data π‘ž; the
number of attributes β„Ž; and the shape of the resulting tree. This gives a time complexity to build a
tree of 𝑂(β„Žπ‘ž log π‘ž). The time to classify a database of size 𝑛 is based on the height of the tree and is
𝑂 𝑛 log π‘ž .
ID3 Algorithm
β€’ The technique to building a decision tree attempts to minimize the expected number of
comparisons. It choses splitting attributes with the highest information gain first.
β€’ Entropy is used to measure the amount of uncertainty or surprise or randomness in a set of data.
Given probabilities of states 𝑝1, 𝑝2, β‹― , 𝑝𝑠 where 𝑖=1
𝑠
𝑝𝑖 = 1, entropy is defied as
𝐻 𝑝1, 𝑝2, β‹― , 𝑝𝑠 =
𝑖=1
𝑠
𝑝𝑖 log 1 𝑝𝑖
β€’ Gain is defined as the difference between how much information is needed to make a correct
classification before the split versus how much information is needed after the split. The ID3
algorithm calculates the gain of a particular split by the following formula:
Gain 𝐷, 𝑆 = 𝐻 𝐷 βˆ’
𝑖=1
𝑠
𝑃(𝐷𝑖)𝐻(𝐷𝑖)
β€’ ID3 approach favors attributes with many divisions and thus may lead to overfitting In the
extreme, an attribute that has a unique value for each tuple in the training set would be the best
because there would be only one tuple (and thus one class) for each division.
Entropy
a) log 1 𝑝 shows the amount of surprise as the probability 𝑝 ranges from 0 to 1.
b) 𝑝 log 1 𝑝 shows the expected information based on probability 𝑝 of an event.
c) 𝑝 log 1 𝑝 + (1 βˆ’ 𝑝) log 1 (1 βˆ’ 𝑝) shows the value of entropy. To measure the information
associated with a division, we add information associated with both events, while taking into
account the probability that each occurs.
C4.5, C5.0 and CART
β€’ In C4.5 splitting is based on GainRatio as opposed to Gain, which ensures a larger than average information gain
πΊπ‘Žπ‘–π‘› π‘…π‘Žπ‘‘π‘–π‘œ 𝐷, 𝑆 =
Gain(𝐷, 𝑆)
H
𝐷1
𝐷
, β‹― ,
𝐷𝑠
𝐷
β€’ C5.0 is based on boosting. Boosting is an approach to combining different classifiers. It does not always help when the training
data contains a lot of noise. Boosting works by creating multiple training sets from one training set. Thus, multiple classifiers are
actually constructed. Each classifier is assigned a vote, voting is performed, and the target tuple is assigned to the class with the
most number of votes.
β€’ Classification and regression trees (CART) is a technique that generates a binary decision tree. Entropy is used as a measure to
choose the best splitting attribute and criterion, however, only 2 children are created. At each step, an exhaustive search
determines the best split defined by:
Ξ¦ 𝑠 𝑑 = 2𝑃𝐿 𝑃𝑅
𝑗=1
π‘š
𝑃 𝐢𝑗|𝑑 𝐿 βˆ’ 𝑃 𝐢𝑗|𝑑 𝑅 .
β€’ This formula is evaluated at the current node 𝑑, and for each possible splitting attribute and criterion 𝑠 . Here 𝐿 and 𝑅 are the
probability that a tuple 𝑑 will be on the left or right side of the tree. 𝑃 𝐢𝑗|𝑑 𝐿 or 𝑃 𝐢𝑗|𝑑 𝑅 is the probability that a tuple is in this
class 𝐢𝑗 and in the left or right sub-tree. CART forces that an ordering of the attributes must be used, and it also contains a pruning
strategy.
β€’ There are two primary pruning strategies: 1) subtree replacement: a subtree is replaced by a leaf
node. This results in an error rate close to that of the original tree. It works from the bottom of
the tree up to the root; 2) subtree raising: replaces a sub-tree by its most used subtree. Here a
subtree is raised from its current location to a node higher up in the tree. We must determine the
increase in error rate for this replacement.
Pruning
Scalable DT Techniques
β€’ SPRINT (Scalable PaRallelizable Induction of decision Trees). A gini index is
used to find the best split. Here gini for a database 𝐷 is defined as
gini 𝐷 = 1 βˆ’ 𝑝𝑗
2
, where 𝑝𝑗 is the frequency of class 𝐢𝑗 in 𝐷. The
goodness of a split of 𝐷 into subsets 𝐷1and 𝐷2 is defined by
𝑔𝑖𝑛𝑖 𝑠𝑝𝑙𝑖𝑑 𝐷 =
𝑛1
𝑛
gini(𝐷1) +
𝑛2
𝑛
gini(𝐷2)
The split with the best gini value is chosen.
β€’ The RainForest approach allows a choice of split attribute without needing
a training set. For each node of a DT, a table called the attribute-value class
(AVC) label group is used. The table summarizes for an attribute the count
of entries per class or attribute value grouping. Thus, the AVC table
summarizes the information needed to determine splitting attributes.
4.5 Neural Network-based Algorithms
Solving a classification problem using NNs involves several steps:
β€’ Determine the number of output nodes, what attributes should be used as input, the number of hidden
layers, the weights (labels) and functions to be used. Certain attribute values from the tuple are input into
the directed graph at the corresponding source nodes. There often is one sink node for each class.
β€’ For each tuple in the training set, propagate it though the network and evaluate the output prediction. The
projected classification made by the graph can be compared with the actual classification. If the prediction is
accurate, we adjust the labels to ensure that this prediction has a higher output weight the next time. If the
prediction is not correct, we adjust the weights to provide a lower output value for this class.
β€’ Propagate each tuple through the network and make the appropriate classification. The output value that is
generated indicates the probability that the corresponding input tuple belongs to that class. The tuple will
then be assigned to the class with the highest probability of membership.
Advantages: 1) NNs are more robust (especially in noisy environments) than DTs because of the weights; 2) the
NN improves its performance by learning. This may continue even after the training set has been applied; 3)
the use of NNs can be parallelized for better performance; 4) there is a low error rate and thus a high degree of
accuracy once the appropriate training has been performed.
Disadvantages: 1) NNs are difficult to understand; 2) Generating rules from NNs is not straightforward; 3) input
attribute values must be numeric; 4) testing, verification; 5) overfitting may occur; 6) the learning phase may
fail to converge, the result is an estimate (not optimal).
NN Propagation and Error
β€’ Given a tuple of values input to the NN, 𝑋 = π‘₯1, β‹― , π‘₯β„Ž , one at each node in the input layer.
Then the summation and activation functions are applied at each node, with an output value
created for each output arc from that node. These values are sent to the subsequent nodes until a
tuple of output values π‘Œ = 𝑦1, β‹― , 𝑦 π‘š is produced from the nodes in the output layer.
β€’ Propagation occurs by applying the activation function at each node, which then places the
output value on the arc to be sent as input to the next node. During classification process only
propagation occurs. However, when learning is used after the output of the classification occurs, a
comparison to the known classification is used to determine how to change the weights.
β€’ A gradient descent technique in modifying the weights can be used to minimize MSE. Assuming
that the output from node 𝑖 is 𝑦𝑖, but should be 𝑑𝑖, the error produced from a node in any layer
can be found by 𝑦𝑖 βˆ’ 𝑑𝑖 . The mean squared error (MSE) is found by (𝑦𝑖 βˆ’ 𝑑𝑖)2 2. Thus the total
MSE error over all m output nodes in the NN is:
𝑀𝑆𝐸 =
𝑖=1
π‘š
(𝑦𝑖 βˆ’ 𝑑𝑖)2
π‘š
Supervised Learning in NN
β€’ In the simplest case learning progresses from the output layer backward to the input layer. The
objective of a learning technique is to change the weights based on the output obtained for a
specific input tuple. Weight are changed based on the changes that were made in weights in
subsequent arcs. This backward learning process is called backpropagation.
β€’ With the batch or offline approach, the weights are changed after all tuples in the training set are
applied and a total MSE is found. With the incremental or online approach, the weights are
changed after each tuple in the training set is applied. The incremental technique is usually
preferred because it requires less space and may actually examine more potential solutions.
β€’ Suppose for a given node, 𝑗 , the input weights are represented as a tuple 𝑀1𝑗, β‹― , 𝑀 π‘˜π‘— , while
the input and output values are π‘₯1𝑗, β‹― , π‘₯ π‘˜π‘— and 𝑦𝑗, respectively. The change in weights using
Hebb rule is represented by Δ𝑀𝑖𝑗 = 𝑐π‘₯𝑖𝑗 𝑦𝑗. Here 𝑐 is a constant often called the learning rate. A
rule of thumb is that c = 1
#π‘’π‘›π‘‘π‘Ÿπ‘–π‘’π‘  𝑖𝑛 π‘‘π‘Ÿπ‘Žπ‘–π‘›π‘–π‘›π‘” 𝑠𝑒𝑑
β€’ Delta rule examines not only the output value 𝑦𝑗 but also the desired value 𝑑𝑗 for output. In this
case the change in weight is found by the rule: Δ𝑀𝑖𝑗 = 𝑐π‘₯𝑖𝑗 𝑑𝑗 βˆ’ 𝑦𝑗 . The nice feature of the
delta rule is that is minimizes the error 𝑑𝑗 βˆ’ 𝑦𝑗 at each node.
Gradient Descent
β€’ Here πœ‚ is referred to as the learning parameter. It typically
is found in range (0,1), although it may be larger. This
value determines how fast the algorithm learns.
β€’ We are trying to minimize the error at the output nodes,
while output errors are being propagated backward
through the network.
β€’ The learning in the gradient descent technique is based on
using the following value for delta at the output layer
Δ𝑀𝑗𝑖 = βˆ’πœ‚
πœ•πΈ
πœ•π‘€π‘—π‘–
= βˆ’πœ‚
πœ•πΈ
πœ•π‘¦π‘–
πœ•π‘¦π‘–
πœ•π‘†π‘–
πœ•π‘†π‘–
πœ•π‘€π‘—π‘–
β€’ here the weights 𝑀𝑗𝑖 are at one arc coming into 𝑖 from 𝑗.
β€’ So that new adjusted weights become 𝑀𝑗𝑖 = 𝑀𝑗𝑖 + Δ𝑀𝑗𝑖
β€’ Assuming sigmoidal activation function for the output layer
Δ𝑀𝑗𝑖 = πœ‚ 𝑑𝑖 βˆ’ 𝑦𝑖 𝑦𝑗 1 βˆ’ 𝑦𝑖 𝑦𝑖
Gradient Descent in the Hidden Layer
β€’ For node j in the hidden layer the change in the weights for arcs
coming into it:
Δ𝑀 π‘˜π‘— = βˆ’πœ‚
πœ•πΈ
πœ•π‘€ π‘˜π‘—
=
π‘š
πœ•πΈ
πœ•π‘¦ π‘š
πœ•π‘¦ π‘š
πœ•π‘† π‘š
πœ•π‘† π‘š
πœ•π‘¦π‘—
πœ•π‘¦π‘—
πœ•π‘†π‘—
πœ•π‘†π‘—
πœ•π‘€ π‘˜π‘—
β€’ Here the variable m ranges over all output nodes with arcs from 𝑗 .
β€’ Assuming hyperbolic tangent activation function for the hidden
layer:
Δ𝑀 π‘˜π‘— = πœ‚π‘¦ π‘˜
1 βˆ’ 𝑦𝑗
2
2
π‘š
(𝑑 π‘š βˆ’ 𝑦 π‘š)π‘€π‘—π‘š 𝑦 π‘š(1 βˆ’ 𝑦 π‘š)
β€’ Another common formula for the change in weight is
Δ𝑀𝑗𝑖 𝑑 + 1 = βˆ’πœ‚
πœ•πΈ
πœ•π‘€π‘—π‘–
+ 𝛼Δ𝑀𝑗𝑖(𝑑)
β€’ Here is called a momentum and is used to prevent oscillation
problems.
Perceptrons
β€’ The simplest NN is called a perceptron. A
perceptrone is a single neuron with multiple
inputs and one output. Step or any other (e.g.,
sigmoidal) activation function can be used.
β€’ A simple perceptrone can be used to classify
into two classes. Activation function output
value of 1 would be used to classify into one
class, while value of 0 would be used to pass
in the other class.
β€’ A simple feed forward neural network of
perceptrons is called a multilayer perceptron
(MLP). The neurons are placed in layers with
outputs always flowing toward the output
layer.
β€’ MLP needs no more than 2 hidden layers. Kolmogorov’s theorem states, that a mapping
between two sets of numbers can be performed using a NN with only one hidden layer.
Given 𝑛 attributes, NN having one input node for each attribute, the hidden layer should
have 2𝑛 + 1 nodes, each with input from each of the input nodes. The output layer
has one node for each desired output value.
MLP (Multilayer Perceptron)
4.6 Rule-Based Algorithms
β€’ One way to perform classification is to generate if-then rules that cover all
cases. A classification rule, π‘Ÿ = π‘Ž, 𝑐 , consists of the if or antecedent, π‘Ž
part, and the then 𝑐 or consequent portion . The antecedent contains a
predicate that can be evaluated as true or false against each tuple in the
database (and in the training data).
β€’ A DT can always be used to generate rules for each leaf node in the
decision tree. All rules with the same consequent could be combined
together by Oring the antecedents of the simpler rules.
There are some differences:
β€’ The tree has an implied order in which the splitting is performed.
β€’ A tree is created based on looking at all classes. When generating rules,
only one class must be examined at a time.
4.6.2 Generating Rules from a NN
β€’ While the source NN may still be used for classification, the derived rules can be used to
verify or interpret the network. The problem is that the rules do not explicitly exist. They
are buried in the structure of the graph itself. In addition, if learning is still occurring, the
rules themselves are dynamic.
β€’ The rules generated tend both to be more concise and to have a lower error rate than
rules used with DTs.
β€’ The basic idea of the RX algorithm is to cluster output node activation values (with the
associated hidden nodes and input); cluster hidden node activation values; generate
rules that describe the output values in terms of the hidden activation values; generate
rules that describe hidden output values in terms of inputs; combine two sets of rules.
β€’ A major problem with rule extraction is the potential size that these rules should be. For
example, if you have a node with n inputs each having 5 values, there are 5n different
input combinations to this one node alone. To overcome this problem and that of having
continuous ranges of output values from nodes, the output values for both the hidden
and output layers are first discretized. This is accomplished by clustering the values and
dividing continuous values into disjoint ranges.
Generating Rules Without a DT or NN
β€’ These techniques are sometimes called covering algorithms because they
attempt to generate rules exactly cover a specific class. They generate the best
rule possible by optimizing the desired classification probability. Usually the best
attribute-value pair is chosen, as opposed to the best attribute with the tree-
based algorithms.
β€’ 1R approach generates a simple set of rules that are equivalent to a DT with only
one level. The basic idea is to choose the best attribute to perform the
classification based on the training data. The best is defined here by counting the
number of errors. 1R can handle missing data by adding an additional attribute
value of missing. As with ID3, it tends to chose attributes with a large number of
values leading to overfitting.
β€’ Another approach to generating rules without first having a DT is called PRISM.
PRISM generates rules fro each class by looking at the training data and adding
rules that completely describe all tuples in that class. Its accuracy is 100 percent.
The algorithm refers to attribute-value pairs.
Combining Techniques
β€’ Multiple independent approaches can be applied to a classification problem, each yielding its
own class prediction. The results of these individual techniques can then be combined. Along with
boosting two other basic techniques can be used to combine classifiers:
β€’ One approach assumes that there are n independent classifiers and that each generates the
posterior probability π‘ƒπ‘˜(𝐢𝑗|𝑑𝑖) for each class. The values are combined with a weighted linear
combination π‘˜=1
𝑛
𝑀 π‘˜ π‘ƒπ‘˜(𝐢𝑗|𝑑𝑖)
β€’ Another technique is to choose the classifier that has the best accuracy in a database sample.
This is referred to as a dynamic classifier selection (DCS).
β€’ Another variation is simple voting: assign the tuple to the class to which a majority of the
classifiers have assigned it.
β€’ Adaptive classifier combination (ACC) technique. Given a tuple to classify, the neighborhood
around it is first determined, then the tuples in that neighborhood are classified by each classifier,
and finally the accuracy for each class is measured. By examining the accuracy across all classifiers
for each class, the tuple is placed in the class that has the highest local accuracy. In effect, the
class chosen is that to which most of its neighbors are accurately classified independent of
classifier.
Combination of Multiple Classifiers in DCS
Any shapes that are darkened indicate an incorrect classification. DCS looks at local
accuracy of each classifier: a) 7 tuples in the neighborhood are correctly classified; b) only
6 are correctly classified. Thus X will be classified according with the first classifier.
Summary
β€’ No one classification technique is always superior to the others.
β€’ The regression approaches force the data to fit a predefined model. The problem arises
when a linear model is chosen for non linear data.
β€’ The KNN technique requires only that the data be such, that distances can be calculated.
This can then be applied even to nonnumeric data. Outliers are handled by looking only
at the K nearest neighbors.
β€’ Bayesian classification assumes that the data attributes are independent with discrete
values.
β€’ Decision tree techniques are easy to understand, but they may lead to overfitting. To
avoid this, pruning techniques may be needed.
β€’ ID3 is applicable only to categorical data. C4.5 and C5 allow the use of continuous data
and improved techniques for splitting. CART creates binary trees and thus may result in
very deep trees.
β€’ All algorithms are 𝑂(𝑛) to classify the 𝑛 items in the dataset.
References:
Dunham, Margaret H. β€œData Mining: Introductory and Advanced
Topics”. Pearson Education, Inc., 2003.

More Related Content

What's hot

Decision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceDecision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceMaryamRehman6
Β 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataSalah Amean
Β 
Distributed Database System
Distributed Database SystemDistributed Database System
Distributed Database SystemSulemang
Β 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision treeKrish_ver2
Β 
Data Reduction
Data ReductionData Reduction
Data ReductionRajan Shah
Β 
NoSql Data Management
NoSql Data ManagementNoSql Data Management
NoSql Data Managementsameerfaizan
Β 
Data cube computation
Data cube computationData cube computation
Data cube computationRashmi Sheikh
Β 
web mining
web miningweb mining
web miningArpit Verma
Β 
Distributed Database Management System
Distributed Database Management SystemDistributed Database Management System
Distributed Database Management SystemAAKANKSHA JAIN
Β 
Classification
ClassificationClassification
ClassificationCloudxLab
Β 
Support Vector Machines ( SVM )
Support Vector Machines ( SVM ) Support Vector Machines ( SVM )
Support Vector Machines ( SVM ) Mohammad Junaid Khan
Β 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data MiningDHIVYADEVAKI
Β 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsJustin Cletus
Β 
Association rule mining
Association rule miningAssociation rule mining
Association rule miningAcad
Β 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data MiningDataminingTools Inc
Β 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning Mohammad Junaid Khan
Β 

What's hot (20)

Decision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceDecision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data science
Β 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
Β 
Distributed Database System
Distributed Database SystemDistributed Database System
Distributed Database System
Β 
Dbscan algorithom
Dbscan algorithomDbscan algorithom
Dbscan algorithom
Β 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision tree
Β 
Data Reduction
Data ReductionData Reduction
Data Reduction
Β 
NoSql Data Management
NoSql Data ManagementNoSql Data Management
NoSql Data Management
Β 
Data cube computation
Data cube computationData cube computation
Data cube computation
Β 
web mining
web miningweb mining
web mining
Β 
Distributed Database Management System
Distributed Database Management SystemDistributed Database Management System
Distributed Database Management System
Β 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
Β 
Classification
ClassificationClassification
Classification
Β 
Support Vector Machines ( SVM )
Support Vector Machines ( SVM ) Support Vector Machines ( SVM )
Support Vector Machines ( SVM )
Β 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
Β 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
Β 
Hierarchical Clustering
Hierarchical ClusteringHierarchical Clustering
Hierarchical Clustering
Β 
Association rule mining
Association rule miningAssociation rule mining
Association rule mining
Β 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
Β 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
Β 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning
Β 

Similar to 04 Classification in Data Mining

UNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data MiningUNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data MiningNandakumar P
Β 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learningAmAn Singh
Β 
ML SFCSE.pptx
ML SFCSE.pptxML SFCSE.pptx
ML SFCSE.pptxNIKHILGR3
Β 
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...Maninda Edirisooriya
Β 
03 Data Mining Techniques
03 Data Mining Techniques03 Data Mining Techniques
03 Data Mining TechniquesValerii Klymchuk
Β 
Predictive analytics
Predictive analyticsPredictive analytics
Predictive analyticsDinakar nk
Β 
Singular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxSingular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxrajalakshmi5921
Β 
EDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxEDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxrajalakshmi5921
Β 
Data mining techniques unit iv
Data mining techniques unit ivData mining techniques unit iv
Data mining techniques unit ivmalathieswaran29
Β 
Dimensionality Reduction.pptx
Dimensionality Reduction.pptxDimensionality Reduction.pptx
Dimensionality Reduction.pptxPriyadharshiniG41
Β 
dimension reduction.ppt
dimension reduction.pptdimension reduction.ppt
dimension reduction.pptDeadpool120050
Β 
introduction to Statistical Theory.pptx
 introduction to Statistical Theory.pptx introduction to Statistical Theory.pptx
introduction to Statistical Theory.pptxDr.Shweta
Β 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPiyush Srivastava
Β 
Decision tree for data mining and computer
Decision tree for data mining and computerDecision tree for data mining and computer
Decision tree for data mining and computertttiba
Β 
Data discretization
Data discretizationData discretization
Data discretizationHadi M.Abachi
Β 
Unit 3 – AIML.pptx
Unit 3 – AIML.pptxUnit 3 – AIML.pptx
Unit 3 – AIML.pptxhiblooms
Β 
Using Tree algorithms on machine learning
Using Tree algorithms on machine learningUsing Tree algorithms on machine learning
Using Tree algorithms on machine learningRajasekhar364622
Β 
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdfMachine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdfAdityaSoraut
Β 

Similar to 04 Classification in Data Mining (20)

UNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data MiningUNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data Mining
Β 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
Β 
ML SFCSE.pptx
ML SFCSE.pptxML SFCSE.pptx
ML SFCSE.pptx
Β 
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Β 
03 Data Mining Techniques
03 Data Mining Techniques03 Data Mining Techniques
03 Data Mining Techniques
Β 
Predictive analytics
Predictive analyticsPredictive analytics
Predictive analytics
Β 
Singular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxSingular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptx
Β 
EDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxEDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptx
Β 
Data mining techniques unit iv
Data mining techniques unit ivData mining techniques unit iv
Data mining techniques unit iv
Β 
Dimensionality Reduction.pptx
Dimensionality Reduction.pptxDimensionality Reduction.pptx
Dimensionality Reduction.pptx
Β 
Random Forest Decision Tree.pptx
Random Forest Decision Tree.pptxRandom Forest Decision Tree.pptx
Random Forest Decision Tree.pptx
Β 
dimension reduction.ppt
dimension reduction.pptdimension reduction.ppt
dimension reduction.ppt
Β 
introduction to Statistical Theory.pptx
 introduction to Statistical Theory.pptx introduction to Statistical Theory.pptx
introduction to Statistical Theory.pptx
Β 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an Organization
Β 
Decision tree for data mining and computer
Decision tree for data mining and computerDecision tree for data mining and computer
Decision tree for data mining and computer
Β 
Data discretization
Data discretizationData discretization
Data discretization
Β 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Β 
Unit 3 – AIML.pptx
Unit 3 – AIML.pptxUnit 3 – AIML.pptx
Unit 3 – AIML.pptx
Β 
Using Tree algorithms on machine learning
Using Tree algorithms on machine learningUsing Tree algorithms on machine learning
Using Tree algorithms on machine learning
Β 
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdfMachine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
Β 

More from Valerii Klymchuk

Sample presentation slides template
Sample presentation slides templateSample presentation slides template
Sample presentation slides templateValerii Klymchuk
Β 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data MiningValerii Klymchuk
Β 
03 Data Representation
03 Data Representation03 Data Representation
03 Data RepresentationValerii Klymchuk
Β 
05 Scalar Visualization
05 Scalar Visualization05 Scalar Visualization
05 Scalar VisualizationValerii Klymchuk
Β 
06 Vector Visualization
06 Vector Visualization06 Vector Visualization
06 Vector VisualizationValerii Klymchuk
Β 
07 Tensor Visualization
07 Tensor Visualization07 Tensor Visualization
07 Tensor VisualizationValerii Klymchuk
Β 
Crime Analysis based on Historical and Transportation Data
Crime Analysis based on Historical and Transportation DataCrime Analysis based on Historical and Transportation Data
Crime Analysis based on Historical and Transportation DataValerii Klymchuk
Β 
Artificial Intelligence for Automated Decision Support Project
Artificial Intelligence for Automated Decision Support ProjectArtificial Intelligence for Automated Decision Support Project
Artificial Intelligence for Automated Decision Support ProjectValerii Klymchuk
Β 
Data Warehouse Project
Data Warehouse ProjectData Warehouse Project
Data Warehouse ProjectValerii Klymchuk
Β 

More from Valerii Klymchuk (12)

Sample presentation slides template
Sample presentation slides templateSample presentation slides template
Sample presentation slides template
Β 
Toronto Capstone
Toronto CapstoneToronto Capstone
Toronto Capstone
Β 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data Mining
Β 
02 Related Concepts
02 Related Concepts02 Related Concepts
02 Related Concepts
Β 
03 Data Representation
03 Data Representation03 Data Representation
03 Data Representation
Β 
05 Scalar Visualization
05 Scalar Visualization05 Scalar Visualization
05 Scalar Visualization
Β 
06 Vector Visualization
06 Vector Visualization06 Vector Visualization
06 Vector Visualization
Β 
07 Tensor Visualization
07 Tensor Visualization07 Tensor Visualization
07 Tensor Visualization
Β 
Crime Analysis based on Historical and Transportation Data
Crime Analysis based on Historical and Transportation DataCrime Analysis based on Historical and Transportation Data
Crime Analysis based on Historical and Transportation Data
Β 
Artificial Intelligence for Automated Decision Support Project
Artificial Intelligence for Automated Decision Support ProjectArtificial Intelligence for Automated Decision Support Project
Artificial Intelligence for Automated Decision Support Project
Β 
Data Warehouse Project
Data Warehouse ProjectData Warehouse Project
Data Warehouse Project
Β 
Database Project
Database ProjectDatabase Project
Database Project
Β 

Recently uploaded

Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
Β 
Delhi Call Girls Punjabi Bagh 9711199171 β˜Žβœ”πŸ‘Œβœ” Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 β˜Žβœ”πŸ‘Œβœ” Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 β˜Žβœ”πŸ‘Œβœ” Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 β˜Žβœ”πŸ‘Œβœ” Whatsapp Hard And Sexy Vip Callshivangimorya083
Β 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
Β 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
Β 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
Β 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
Β 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
Β 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
Β 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
Β 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
Β 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
Β 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
Β 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
Β 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
Β 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
Β 
꧁❀ Greater Noida Call Girls Delhi ❀꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❀ Greater Noida Call Girls Delhi ❀꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❀ Greater Noida Call Girls Delhi ❀꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❀ Greater Noida Call Girls Delhi ❀꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
Β 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
Β 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
Β 

Recently uploaded (20)

Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
Β 
Delhi Call Girls Punjabi Bagh 9711199171 β˜Žβœ”πŸ‘Œβœ” Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 β˜Žβœ”πŸ‘Œβœ” Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 β˜Žβœ”πŸ‘Œβœ” Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 β˜Žβœ”πŸ‘Œβœ” Whatsapp Hard And Sexy Vip Call
Β 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
Β 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Β 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
Β 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
Β 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
Β 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
Β 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
Β 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
Β 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
Β 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
Β 
꧁❀ Aerocity Call Girls Service Aerocity Delhi ❀꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❀ Aerocity Call Girls Service Aerocity Delhi ❀꧂ 9999965857 ☎️ Hard And Sexy ...꧁❀ Aerocity Call Girls Service Aerocity Delhi ❀꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❀ Aerocity Call Girls Service Aerocity Delhi ❀꧂ 9999965857 ☎️ Hard And Sexy ...
Β 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
Β 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Β 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
Β 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
Β 
꧁❀ Greater Noida Call Girls Delhi ❀꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❀ Greater Noida Call Girls Delhi ❀꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❀ Greater Noida Call Girls Delhi ❀꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❀ Greater Noida Call Girls Delhi ❀꧂ 9711199171 ☎️ Hard And Sexy Vip Call
Β 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
Β 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Β 

04 Classification in Data Mining

  • 2. 4.1 Introduction β€’ Prediction can be thought of as classifying an attribute value into one of set of possible classes. It is often viewed as forecasting a continuous value, while classification forecasts a discrete value. β€’ All classification techniques assume some knowledge of the data. Training data consists of sample input data as well as the classification assignment for each data tuple. Given a database 𝐷 of tuples and a set of classes 𝐢, the classification problem is to define a mapping 𝑓: 𝐷 β†’ 𝐢 where each tuple is assigned to one class. β€’ The problem is implemented in two phases: β€’ Create a specific model by evaluating the training data. β€’ Apply the model to classifying tuples from the target database. β€’ There are three basic methods used to solve the classification problem: 1) specifying boundaries; 2) using probability distributions; 3) using posterior probabilities. β€’ A major issue associated with classification is overfitting. If the classification model fits the data exactly, it may not be applicable to a broader population. β€’ Statistical algorithms are based directly on the use of statistical information. Distance-based algorithms use similarity or distance measure to perform the classification. Decision trees and NN use those structures. Rule based classification algorithms generate if-then rules to perform classification.
  • 3. Measuring Performance and Accuracy β€’ Classification accuracy is usually calculated by determining the percentage of tuples placed in the correct class. β€’ Given a specific class and a database tuple may or may not be assigned to that class while its actual membership may or may not be in that class. This gives us four quadrants: β€’ True positive (TP): 𝑑𝑖 predicted to be in 𝐢𝑗 and is actually in it. β€’ False positive (FP): 𝑑𝑖 predicted to be in 𝐢𝑗 but is not actually in it. β€’ True negative (TN): 𝑑𝑖 not predicted to be in 𝐢𝑗 and is not actually in it. β€’ False negative (FN): 𝑑𝑖 not predicted to be in 𝐢𝑗 but is actually in it. β€’ An OC (operating characteristic) curve or ROC (receiver operating characteristic) curve shows the relationship between false positives and true positives. The horizontal axis has the percentage of false positives and the vertical axis has the percentage of true positives for a database sample. β€’ A confusion matrix illustrates the accuracy of the solution to a classification problem. Given π‘š classes, a confusion matrix is an π‘š Γ— π‘š matrix where entry 𝑐𝑖,𝑗 indicates the number of tuples from 𝐷 that were assigned to class 𝐢𝑗 but where the correct class is 𝐢𝑖.
  • 4. 4.2 Statistical Methods. Regression β€’ Regression used for classification deals with estimation (prediction) of an output (class) value based on input values from the database. It takes a set of data and fits the data to a formula. Classification can be performed using two different approaches: 1) Division: The data are divided into regions based on class; 2) Prediction: Formulas are generated to predict the output class value. β€’ The prediction is an estimate rather than the actual output value. This technique does not work well with nonnumeric data. β€’ In cases with noisy, erroneous data, outliers, the observable data may be described as ∢ 𝑦 = 𝑐0 + 𝑐1 π‘₯1 + β‹― + 𝑐 𝑛 π‘₯ 𝑛 + πœ–, where πœ– is a random error with a mean of 0. A method of least squares is used to minimize the least squared error. We first take partial derivatives with respect to coefficients and set them equal to zero. This approach finds least square estimates 𝑐0, 𝑐1, β‹― 𝑐 𝑛 for the coefficients so that the squared error is minimized for the set of observable values. β€’ We can estimate the accuracy of the fit of a linear regression model to the actual data using a mean squared error function. β€’ A commonly used regression technique is called logistic regression. Logistic regression fits data to a curve such as: 𝑝 = 𝑒(𝑐0+𝑐1 π‘₯1) 1 + 𝑒(𝑐0+𝑐1 π‘₯1) β€’ It produces values between 0 and 1 and can be interpreted as probability of class membership. The logarithm is applied to obtain the logistic function: log 𝑒 𝑝 1 βˆ’ 𝑝 = 𝑐0 + 𝑐1 π‘₯1 β€’ Here 𝑝 is the probability of being in the class and 1 βˆ’ 𝑝 is the probability that it is not. The process chooses values for 𝑐0 and 𝑐1 that maximize the probability of observing the given values.
  • 5. Bayesian Classification β€’ Assuming that the contribution by all attributes are independent and that each contributes equally to the classification problem, a classification scheme naive Bayes can be used. β€’ Training data can be used to determine prior and conditional probabilities 𝑃 𝐢𝑗 and 𝑃(π‘₯𝑖|𝐢𝑗), as well as 𝑃 π‘₯𝑖 . From these values Bayes theorem allows us to estimate the posterior probability 𝑃 𝐢𝑗 π‘₯𝑖 and 𝑃(𝐢𝑗|𝑑𝑖). β€’ This must be done for all attributes and all values 𝑃 𝑑𝑖 𝐢𝑗 = π‘˜=1 𝑝 𝑃(π‘₯π‘–π‘˜|𝐢𝑗) β€’ To calculate 𝑃(𝑑𝑖) we estimate the likelihoods for 𝑑𝑖 in each class and add these values. β€’ The posterior probability 𝑃(𝐢𝑗|𝑑𝑖) is then found for each class. The class with the highest probability is the one chosen for the tuple. β€’ Only one scan of training data is needed, it can handle missing values. In simple relationships this technique often yields good results. β€’ The technique does not handle continuous data. Diving into ranges could be used to solve this problem. Attributes usually are not independent, so we can use a subset by ignoring those that are dependent.
  • 6. 4.3 Distance-based Algorithms β€’ Similarity (or distance) measures may be used to identify the alikeness of different items in the database. The difficulty lies in how the similarity measures are defined and applied to the items in the database. Since most measures assume numeric (often discrete) data types, a mapping from the attribute domain to a subset of integers may be used for abstract data types. β€’ Simple approach assumes that each class 𝑐𝑖 is represented by its center or centroid – center for its class. The new item is placed in the class with the largest similarity value. β€’ K nearest neighbors (KNN) classification scheme requires not only training data, but also desired classification for each item in it. When a classification is made for a new item, its distance to each item in the training set must be determined. Only the K closest entries are considered. The new item is then placed in the class that contains the most items from this set of K closest items. β€’ KNN technique is extremely sensitive to the value of K. A rule of thumb is that 𝐾 ≀ π‘›π‘’π‘šπ‘π‘’π‘Ÿ π‘œπ‘“ π‘‘π‘Ÿπ‘Žπ‘–π‘›π‘–π‘›π‘” π‘–π‘‘π‘’π‘šπ‘ 
  • 8. Solving the classification problem using decision trees is a 2-step process: β€’ Decision tree induction: Construct a DT using training data. β€’ For each 𝑑𝑖 ∈ 𝐷, apply the DT to determine its class. Attributes in the database schema that are used to label nodes in the tree and around which the divisions takes place are called the splitting attributes. The predicates by which the arcs in the tree are labeled are called the splitting predicates. The major factors in the performance of the DT building algorithm are: the size of the training set and how the best splitting attribute is chosen. Algorithm continues adding nodes and arcs to the tree recursively until some stopping criteria is reached (can be determined differently). β€’ Advantages: easy to use, rules are easy to interpret and understand, scale well for large databases (the tree size is independent of the database size). β€’ Disadvantages: do not easily handle continuous data (attribute domains must be divided into categories (rectangular regions) in order to be handled, handling missing data is difficult, overfitting may occur (overcome via pruning), correlations among attributes are ignored by the DT process. 4.4 Decision Tree-based Algorithms
  • 9. β€’ Choosing splitting attributes. Using the initial training data, the β€œbest” splitting attribute is chosen first. Algorithms differ in how they determine the best attribute and its best predicates to use for splitting. The choice of attribute involves not only an examination of the data in the training set but also the informed input of domain experts. β€’ Ordering of splitting attributes. The order in which the attributes are chosen is also important. β€’ Splits (number of splits to take). If the domain is continuous or has a large number of values, the number of splits to use is not easily determined. β€’ Tree structure. A balanced shorter tree with the fewest levels is desirable. Multi-way branching or binary trees (tend to be deeper) can be used. β€’ Stopping criteria. The creating of the tree stops when the training data are perfectly classified. Stopping earlier may be used to prevent overfitting. More levels than needed would be created in a tree if it is known that there are data distributions not represented in the training data. β€’ Training data. The training data and the tree induction algorithm determine the tree shape. If training data set is too small, then the generated tree might not be specific enough to work properly with the more general data. If the training data set is too large, then the created tree may overfit. β€’ Pruning. The DT building algorithms may initially build the tree and then prune it for more effective classification. Pruning is a modification of the tree by removing redundant comparisons or sub-trees aiming to achieve better performance. Issues Faced by DT Algorithms
  • 10. Comparing Decision Trees The time and space complexity of DT algorithms depends on the size of the training data π‘ž; the number of attributes β„Ž; and the shape of the resulting tree. This gives a time complexity to build a tree of 𝑂(β„Žπ‘ž log π‘ž). The time to classify a database of size 𝑛 is based on the height of the tree and is 𝑂 𝑛 log π‘ž .
  • 11. ID3 Algorithm β€’ The technique to building a decision tree attempts to minimize the expected number of comparisons. It choses splitting attributes with the highest information gain first. β€’ Entropy is used to measure the amount of uncertainty or surprise or randomness in a set of data. Given probabilities of states 𝑝1, 𝑝2, β‹― , 𝑝𝑠 where 𝑖=1 𝑠 𝑝𝑖 = 1, entropy is defied as 𝐻 𝑝1, 𝑝2, β‹― , 𝑝𝑠 = 𝑖=1 𝑠 𝑝𝑖 log 1 𝑝𝑖 β€’ Gain is defined as the difference between how much information is needed to make a correct classification before the split versus how much information is needed after the split. The ID3 algorithm calculates the gain of a particular split by the following formula: Gain 𝐷, 𝑆 = 𝐻 𝐷 βˆ’ 𝑖=1 𝑠 𝑃(𝐷𝑖)𝐻(𝐷𝑖) β€’ ID3 approach favors attributes with many divisions and thus may lead to overfitting In the extreme, an attribute that has a unique value for each tuple in the training set would be the best because there would be only one tuple (and thus one class) for each division.
  • 12. Entropy a) log 1 𝑝 shows the amount of surprise as the probability 𝑝 ranges from 0 to 1. b) 𝑝 log 1 𝑝 shows the expected information based on probability 𝑝 of an event. c) 𝑝 log 1 𝑝 + (1 βˆ’ 𝑝) log 1 (1 βˆ’ 𝑝) shows the value of entropy. To measure the information associated with a division, we add information associated with both events, while taking into account the probability that each occurs.
  • 13. C4.5, C5.0 and CART β€’ In C4.5 splitting is based on GainRatio as opposed to Gain, which ensures a larger than average information gain πΊπ‘Žπ‘–π‘› π‘…π‘Žπ‘‘π‘–π‘œ 𝐷, 𝑆 = Gain(𝐷, 𝑆) H 𝐷1 𝐷 , β‹― , 𝐷𝑠 𝐷 β€’ C5.0 is based on boosting. Boosting is an approach to combining different classifiers. It does not always help when the training data contains a lot of noise. Boosting works by creating multiple training sets from one training set. Thus, multiple classifiers are actually constructed. Each classifier is assigned a vote, voting is performed, and the target tuple is assigned to the class with the most number of votes. β€’ Classification and regression trees (CART) is a technique that generates a binary decision tree. Entropy is used as a measure to choose the best splitting attribute and criterion, however, only 2 children are created. At each step, an exhaustive search determines the best split defined by: Ξ¦ 𝑠 𝑑 = 2𝑃𝐿 𝑃𝑅 𝑗=1 π‘š 𝑃 𝐢𝑗|𝑑 𝐿 βˆ’ 𝑃 𝐢𝑗|𝑑 𝑅 . β€’ This formula is evaluated at the current node 𝑑, and for each possible splitting attribute and criterion 𝑠 . Here 𝐿 and 𝑅 are the probability that a tuple 𝑑 will be on the left or right side of the tree. 𝑃 𝐢𝑗|𝑑 𝐿 or 𝑃 𝐢𝑗|𝑑 𝑅 is the probability that a tuple is in this class 𝐢𝑗 and in the left or right sub-tree. CART forces that an ordering of the attributes must be used, and it also contains a pruning strategy.
  • 14. β€’ There are two primary pruning strategies: 1) subtree replacement: a subtree is replaced by a leaf node. This results in an error rate close to that of the original tree. It works from the bottom of the tree up to the root; 2) subtree raising: replaces a sub-tree by its most used subtree. Here a subtree is raised from its current location to a node higher up in the tree. We must determine the increase in error rate for this replacement. Pruning
  • 15. Scalable DT Techniques β€’ SPRINT (Scalable PaRallelizable Induction of decision Trees). A gini index is used to find the best split. Here gini for a database 𝐷 is defined as gini 𝐷 = 1 βˆ’ 𝑝𝑗 2 , where 𝑝𝑗 is the frequency of class 𝐢𝑗 in 𝐷. The goodness of a split of 𝐷 into subsets 𝐷1and 𝐷2 is defined by 𝑔𝑖𝑛𝑖 𝑠𝑝𝑙𝑖𝑑 𝐷 = 𝑛1 𝑛 gini(𝐷1) + 𝑛2 𝑛 gini(𝐷2) The split with the best gini value is chosen. β€’ The RainForest approach allows a choice of split attribute without needing a training set. For each node of a DT, a table called the attribute-value class (AVC) label group is used. The table summarizes for an attribute the count of entries per class or attribute value grouping. Thus, the AVC table summarizes the information needed to determine splitting attributes.
  • 16. 4.5 Neural Network-based Algorithms Solving a classification problem using NNs involves several steps: β€’ Determine the number of output nodes, what attributes should be used as input, the number of hidden layers, the weights (labels) and functions to be used. Certain attribute values from the tuple are input into the directed graph at the corresponding source nodes. There often is one sink node for each class. β€’ For each tuple in the training set, propagate it though the network and evaluate the output prediction. The projected classification made by the graph can be compared with the actual classification. If the prediction is accurate, we adjust the labels to ensure that this prediction has a higher output weight the next time. If the prediction is not correct, we adjust the weights to provide a lower output value for this class. β€’ Propagate each tuple through the network and make the appropriate classification. The output value that is generated indicates the probability that the corresponding input tuple belongs to that class. The tuple will then be assigned to the class with the highest probability of membership. Advantages: 1) NNs are more robust (especially in noisy environments) than DTs because of the weights; 2) the NN improves its performance by learning. This may continue even after the training set has been applied; 3) the use of NNs can be parallelized for better performance; 4) there is a low error rate and thus a high degree of accuracy once the appropriate training has been performed. Disadvantages: 1) NNs are difficult to understand; 2) Generating rules from NNs is not straightforward; 3) input attribute values must be numeric; 4) testing, verification; 5) overfitting may occur; 6) the learning phase may fail to converge, the result is an estimate (not optimal).
  • 17. NN Propagation and Error β€’ Given a tuple of values input to the NN, 𝑋 = π‘₯1, β‹― , π‘₯β„Ž , one at each node in the input layer. Then the summation and activation functions are applied at each node, with an output value created for each output arc from that node. These values are sent to the subsequent nodes until a tuple of output values π‘Œ = 𝑦1, β‹― , 𝑦 π‘š is produced from the nodes in the output layer. β€’ Propagation occurs by applying the activation function at each node, which then places the output value on the arc to be sent as input to the next node. During classification process only propagation occurs. However, when learning is used after the output of the classification occurs, a comparison to the known classification is used to determine how to change the weights. β€’ A gradient descent technique in modifying the weights can be used to minimize MSE. Assuming that the output from node 𝑖 is 𝑦𝑖, but should be 𝑑𝑖, the error produced from a node in any layer can be found by 𝑦𝑖 βˆ’ 𝑑𝑖 . The mean squared error (MSE) is found by (𝑦𝑖 βˆ’ 𝑑𝑖)2 2. Thus the total MSE error over all m output nodes in the NN is: 𝑀𝑆𝐸 = 𝑖=1 π‘š (𝑦𝑖 βˆ’ 𝑑𝑖)2 π‘š
  • 18. Supervised Learning in NN β€’ In the simplest case learning progresses from the output layer backward to the input layer. The objective of a learning technique is to change the weights based on the output obtained for a specific input tuple. Weight are changed based on the changes that were made in weights in subsequent arcs. This backward learning process is called backpropagation. β€’ With the batch or offline approach, the weights are changed after all tuples in the training set are applied and a total MSE is found. With the incremental or online approach, the weights are changed after each tuple in the training set is applied. The incremental technique is usually preferred because it requires less space and may actually examine more potential solutions. β€’ Suppose for a given node, 𝑗 , the input weights are represented as a tuple 𝑀1𝑗, β‹― , 𝑀 π‘˜π‘— , while the input and output values are π‘₯1𝑗, β‹― , π‘₯ π‘˜π‘— and 𝑦𝑗, respectively. The change in weights using Hebb rule is represented by Δ𝑀𝑖𝑗 = 𝑐π‘₯𝑖𝑗 𝑦𝑗. Here 𝑐 is a constant often called the learning rate. A rule of thumb is that c = 1 #π‘’π‘›π‘‘π‘Ÿπ‘–π‘’π‘  𝑖𝑛 π‘‘π‘Ÿπ‘Žπ‘–π‘›π‘–π‘›π‘” 𝑠𝑒𝑑 β€’ Delta rule examines not only the output value 𝑦𝑗 but also the desired value 𝑑𝑗 for output. In this case the change in weight is found by the rule: Δ𝑀𝑖𝑗 = 𝑐π‘₯𝑖𝑗 𝑑𝑗 βˆ’ 𝑦𝑗 . The nice feature of the delta rule is that is minimizes the error 𝑑𝑗 βˆ’ 𝑦𝑗 at each node.
  • 19. Gradient Descent β€’ Here πœ‚ is referred to as the learning parameter. It typically is found in range (0,1), although it may be larger. This value determines how fast the algorithm learns. β€’ We are trying to minimize the error at the output nodes, while output errors are being propagated backward through the network. β€’ The learning in the gradient descent technique is based on using the following value for delta at the output layer Δ𝑀𝑗𝑖 = βˆ’πœ‚ πœ•πΈ πœ•π‘€π‘—π‘– = βˆ’πœ‚ πœ•πΈ πœ•π‘¦π‘– πœ•π‘¦π‘– πœ•π‘†π‘– πœ•π‘†π‘– πœ•π‘€π‘—π‘– β€’ here the weights 𝑀𝑗𝑖 are at one arc coming into 𝑖 from 𝑗. β€’ So that new adjusted weights become 𝑀𝑗𝑖 = 𝑀𝑗𝑖 + Δ𝑀𝑗𝑖 β€’ Assuming sigmoidal activation function for the output layer Δ𝑀𝑗𝑖 = πœ‚ 𝑑𝑖 βˆ’ 𝑦𝑖 𝑦𝑗 1 βˆ’ 𝑦𝑖 𝑦𝑖
  • 20. Gradient Descent in the Hidden Layer β€’ For node j in the hidden layer the change in the weights for arcs coming into it: Δ𝑀 π‘˜π‘— = βˆ’πœ‚ πœ•πΈ πœ•π‘€ π‘˜π‘— = π‘š πœ•πΈ πœ•π‘¦ π‘š πœ•π‘¦ π‘š πœ•π‘† π‘š πœ•π‘† π‘š πœ•π‘¦π‘— πœ•π‘¦π‘— πœ•π‘†π‘— πœ•π‘†π‘— πœ•π‘€ π‘˜π‘— β€’ Here the variable m ranges over all output nodes with arcs from 𝑗 . β€’ Assuming hyperbolic tangent activation function for the hidden layer: Δ𝑀 π‘˜π‘— = πœ‚π‘¦ π‘˜ 1 βˆ’ 𝑦𝑗 2 2 π‘š (𝑑 π‘š βˆ’ 𝑦 π‘š)π‘€π‘—π‘š 𝑦 π‘š(1 βˆ’ 𝑦 π‘š) β€’ Another common formula for the change in weight is Δ𝑀𝑗𝑖 𝑑 + 1 = βˆ’πœ‚ πœ•πΈ πœ•π‘€π‘—π‘– + 𝛼Δ𝑀𝑗𝑖(𝑑) β€’ Here is called a momentum and is used to prevent oscillation problems.
  • 21. Perceptrons β€’ The simplest NN is called a perceptron. A perceptrone is a single neuron with multiple inputs and one output. Step or any other (e.g., sigmoidal) activation function can be used. β€’ A simple perceptrone can be used to classify into two classes. Activation function output value of 1 would be used to classify into one class, while value of 0 would be used to pass in the other class. β€’ A simple feed forward neural network of perceptrons is called a multilayer perceptron (MLP). The neurons are placed in layers with outputs always flowing toward the output layer.
  • 22. β€’ MLP needs no more than 2 hidden layers. Kolmogorov’s theorem states, that a mapping between two sets of numbers can be performed using a NN with only one hidden layer. Given 𝑛 attributes, NN having one input node for each attribute, the hidden layer should have 2𝑛 + 1 nodes, each with input from each of the input nodes. The output layer has one node for each desired output value. MLP (Multilayer Perceptron)
  • 23. 4.6 Rule-Based Algorithms β€’ One way to perform classification is to generate if-then rules that cover all cases. A classification rule, π‘Ÿ = π‘Ž, 𝑐 , consists of the if or antecedent, π‘Ž part, and the then 𝑐 or consequent portion . The antecedent contains a predicate that can be evaluated as true or false against each tuple in the database (and in the training data). β€’ A DT can always be used to generate rules for each leaf node in the decision tree. All rules with the same consequent could be combined together by Oring the antecedents of the simpler rules. There are some differences: β€’ The tree has an implied order in which the splitting is performed. β€’ A tree is created based on looking at all classes. When generating rules, only one class must be examined at a time.
  • 24. 4.6.2 Generating Rules from a NN β€’ While the source NN may still be used for classification, the derived rules can be used to verify or interpret the network. The problem is that the rules do not explicitly exist. They are buried in the structure of the graph itself. In addition, if learning is still occurring, the rules themselves are dynamic. β€’ The rules generated tend both to be more concise and to have a lower error rate than rules used with DTs. β€’ The basic idea of the RX algorithm is to cluster output node activation values (with the associated hidden nodes and input); cluster hidden node activation values; generate rules that describe the output values in terms of the hidden activation values; generate rules that describe hidden output values in terms of inputs; combine two sets of rules. β€’ A major problem with rule extraction is the potential size that these rules should be. For example, if you have a node with n inputs each having 5 values, there are 5n different input combinations to this one node alone. To overcome this problem and that of having continuous ranges of output values from nodes, the output values for both the hidden and output layers are first discretized. This is accomplished by clustering the values and dividing continuous values into disjoint ranges.
  • 25. Generating Rules Without a DT or NN β€’ These techniques are sometimes called covering algorithms because they attempt to generate rules exactly cover a specific class. They generate the best rule possible by optimizing the desired classification probability. Usually the best attribute-value pair is chosen, as opposed to the best attribute with the tree- based algorithms. β€’ 1R approach generates a simple set of rules that are equivalent to a DT with only one level. The basic idea is to choose the best attribute to perform the classification based on the training data. The best is defined here by counting the number of errors. 1R can handle missing data by adding an additional attribute value of missing. As with ID3, it tends to chose attributes with a large number of values leading to overfitting. β€’ Another approach to generating rules without first having a DT is called PRISM. PRISM generates rules fro each class by looking at the training data and adding rules that completely describe all tuples in that class. Its accuracy is 100 percent. The algorithm refers to attribute-value pairs.
  • 26. Combining Techniques β€’ Multiple independent approaches can be applied to a classification problem, each yielding its own class prediction. The results of these individual techniques can then be combined. Along with boosting two other basic techniques can be used to combine classifiers: β€’ One approach assumes that there are n independent classifiers and that each generates the posterior probability π‘ƒπ‘˜(𝐢𝑗|𝑑𝑖) for each class. The values are combined with a weighted linear combination π‘˜=1 𝑛 𝑀 π‘˜ π‘ƒπ‘˜(𝐢𝑗|𝑑𝑖) β€’ Another technique is to choose the classifier that has the best accuracy in a database sample. This is referred to as a dynamic classifier selection (DCS). β€’ Another variation is simple voting: assign the tuple to the class to which a majority of the classifiers have assigned it. β€’ Adaptive classifier combination (ACC) technique. Given a tuple to classify, the neighborhood around it is first determined, then the tuples in that neighborhood are classified by each classifier, and finally the accuracy for each class is measured. By examining the accuracy across all classifiers for each class, the tuple is placed in the class that has the highest local accuracy. In effect, the class chosen is that to which most of its neighbors are accurately classified independent of classifier.
  • 27. Combination of Multiple Classifiers in DCS Any shapes that are darkened indicate an incorrect classification. DCS looks at local accuracy of each classifier: a) 7 tuples in the neighborhood are correctly classified; b) only 6 are correctly classified. Thus X will be classified according with the first classifier.
  • 28. Summary β€’ No one classification technique is always superior to the others. β€’ The regression approaches force the data to fit a predefined model. The problem arises when a linear model is chosen for non linear data. β€’ The KNN technique requires only that the data be such, that distances can be calculated. This can then be applied even to nonnumeric data. Outliers are handled by looking only at the K nearest neighbors. β€’ Bayesian classification assumes that the data attributes are independent with discrete values. β€’ Decision tree techniques are easy to understand, but they may lead to overfitting. To avoid this, pruning techniques may be needed. β€’ ID3 is applicable only to categorical data. C4.5 and C5 allow the use of continuous data and improved techniques for splitting. CART creates binary trees and thus may result in very deep trees. β€’ All algorithms are 𝑂(𝑛) to classify the 𝑛 items in the dataset.
  • 29. References: Dunham, Margaret H. β€œData Mining: Introductory and Advanced Topics”. Pearson Education, Inc., 2003.