Classification and prediction

Lecture-32Lecture-32
What is classification? What isWhat is classification? What is
prediction?prediction?

Classification:Classification:

predicts categorical class labelspredicts categorical class labels

classifies data (constructs a model) based on theclassifies data (constructs a model) based on the
training set and the values (class labels) in atraining set and the values (class labels) in a
classifying attribute and uses it in classifying new dataclassifying attribute and uses it in classifying new data
Prediction:Prediction:

models continuous-valued functionsmodels continuous-valued functions

predicts unknown or missing valuespredicts unknown or missing values
ApplicationsApplications

credit approvalcredit approval

target marketingtarget marketing

medical diagnosismedical diagnosis

treatment effectiveness analysistreatment effectiveness analysis
What is Classification & PredictionWhat is Classification & Prediction
Lecture-32 - What is classification? What is prediction?Lecture-32 - What is classification? What is prediction?

Classification—A Two-Step ProcessClassification—A Two-Step Process
Learning step- describing a set of predeterminedLearning step- describing a set of predetermined
classesclasses

Each tuple/sample is assumed to belong to a predefined class,Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attributeas determined by the class label attribute

The set of tuples used for model construction: training setThe set of tuples used for model construction: training set

The model is represented as classification rules, decision trees,The model is represented as classification rules, decision trees,
or mathematical formulaeor mathematical formulae
Classification- for classifying future or unknown objectsClassification- for classifying future or unknown objects

Estimate accuracy of the modelEstimate accuracy of the model
The known label of test sample is compared with theThe known label of test sample is compared with the
classified result from the modelclassified result from the model
Accuracy rate is the percentage of test set samples that areAccuracy rate is the percentage of test set samples that are
correctly classified by the modelcorrectly classified by the model
Test set is independent of training set, otherwise over-fittingTest set is independent of training set, otherwise over-fitting
will occurwill occur

Classification Process : Model ConstructionClassification Process : Model Construction
Training
Data
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Classification
Algorithms
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classifier
(Model)

Classification Process : Use the Model inClassification Process : Use the Model in
PredictionPrediction
Classifier
Testing
Data
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?

Supervised vs. Unsupervised LearningSupervised vs. Unsupervised Learning
Supervised learning (classification)Supervised learning (classification)

Supervision: The training data (observations,Supervision: The training data (observations,
measurements, etc.) are accompanied by labelsmeasurements, etc.) are accompanied by labels
indicating the class of the observationsindicating the class of the observations

New data is classified based on the training setNew data is classified based on the training set
Unsupervised learning (clustering)Unsupervised learning (clustering)

The class labels of training data is unknownThe class labels of training data is unknown

Given a set of measurements, observations, etc. withGiven a set of measurements, observations, etc. with
the aim of establishing the existence of classes orthe aim of establishing the existence of classes or
clusters in the dataclusters in the data

Issues regarding classificationIssues regarding classification
and predictionand prediction

Issues regarding classification andIssues regarding classification and
prediction -prediction -Preparing the data forPreparing the data for
classification and predictionclassification and prediction
Data cleaningData cleaning

Preprocess data in order to reduce noise and handlePreprocess data in order to reduce noise and handle
missing valuesmissing values
Relevance analysis (feature selection)Relevance analysis (feature selection)

Remove the irrelevant or redundant attributesRemove the irrelevant or redundant attributes
Data transformationData transformation

Generalize and/or normalize dataGeneralize and/or normalize data
Lecture-33 - Issues regarding classification and predictionLecture-33 - Issues regarding classification and prediction

Issues regarding classification and predictionIssues regarding classification and prediction
Comparing Classification MethodsComparing Classification Methods
AccuracyAccuracy
Speed and scalabilitySpeed and scalability

time to construct the modeltime to construct the model

time to use the modeltime to use the model
RobustnessRobustness

handling noise and missing valueshandling noise and missing values
ScalabilityScalability

efficiency in disk-resident databasesefficiency in disk-resident databases
Interpretability:Interpretability:

understanding and insight provded by the modelunderstanding and insight provded by the model
interpretabilityinterpretability

decision tree sizedecision tree size

compactness of classification rulescompactness of classification rules
Lecture-33 - Issues regarding classification and predictionLecture-33 - Issues regarding classification and prediction

Classification by decision treeClassification by decision tree
inductioninduction
Lecture-34 - Classification by decision tree inductionLecture-34 - Classification by decision tree induction

Classification by Decision Tree InductionClassification by Decision Tree Induction
Decision treeDecision tree

A flow-chart-like tree structureA flow-chart-like tree structure

Internal node denotes a test on an attributeInternal node denotes a test on an attribute

Branch represents an outcome of the testBranch represents an outcome of the test

Leaf nodes represent class labels or class distributionLeaf nodes represent class labels or class distribution
Decision tree generation consists of two phasesDecision tree generation consists of two phases

Tree constructionTree construction
At start, all the training examples are at the rootAt start, all the training examples are at the root
Partition examples recursively based on selected attributesPartition examples recursively based on selected attributes

Tree pruningTree pruning
Identify and remove branches that reflect noise or outliersIdentify and remove branches that reflect noise or outliers
Use of decision tree: Classifying an unknown sampleUse of decision tree: Classifying an unknown sample

Test the attribute values of the sample against the decision treeTest the attribute values of the sample against the decision tree

Training DatasetTraining Dataset
age income student credit_rating
<=30 high no fair
<=30 high no excellent
31…40 high no fair
>40 medium no fair
>40 low yes fair
>40 low yes excellent
31…40 low yes excellent
<=30 medium no fair
<=30 low yes fair
>40 medium yes fair
<=30 medium yes excellent
31…40 medium no excellent
31…40 high yes fair
>40 medium no excellent
This
follows
an
example
from
Quinlan’s
ID3

Output: A Decision Tree for “Output: A Decision Tree for “buys_computer”buys_computer”
age?
overcast
student? credit rating?
no yes fairexcellent
<=30 >40
no noyes yes
yes
30..40

Algorithm for Decision Tree InductionAlgorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquerTree is constructed in a top-down recursive divide-and-conquer
mannermanner

At start, all the training examples are at the rootAt start, all the training examples are at the root

Attributes are categorical (if continuous-valued, they areAttributes are categorical (if continuous-valued, they are
discretized in advance)discretized in advance)

Examples are partitioned recursively based on selected attributesExamples are partitioned recursively based on selected attributes

Test attributes are selected on the basis of a heuristic or statisticalTest attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)measure (e.g., information gain)
Conditions for stopping partitioningConditions for stopping partitioning

All samples for a given node belong to the same classAll samples for a given node belong to the same class

There are no remaining attributes for further partitioning – majorityThere are no remaining attributes for further partitioning – majority
voting is employed for classifying the leafvoting is employed for classifying the leaf

There are no samples leftThere are no samples left

Attribute Selection MeasureAttribute Selection Measure
Information gain (ID3/C4.5)Information gain (ID3/C4.5)

All attributes are assumed to be categoricalAll attributes are assumed to be categorical

Can be modified for continuous-valued attributesCan be modified for continuous-valued attributes
Gini index (IBM IntelligentMiner)Gini index (IBM IntelligentMiner)

All attributes are assumed continuous-valuedAll attributes are assumed continuous-valued

Assume there exist several possible split values for eachAssume there exist several possible split values for each
attributeattribute

May need other tools, such as clustering, to get theMay need other tools, such as clustering, to get the
possible split valuespossible split values

Can be modified for categorical attributesCan be modified for categorical attributes

Information Gain (ID3/C4.5)Information Gain (ID3/C4.5)
Select the attribute with the highest informationSelect the attribute with the highest information
gaingain
Assume there are two classes,Assume there are two classes, PP andand NN

Let the set of examplesLet the set of examples SS containcontain pp elements of classelements of class PP
andand nn elements of classelements of class NN

The amount of information, needed to decide if anThe amount of information, needed to decide if an
arbitrary example inarbitrary example in SS belongs tobelongs to PP oror NN is defined asis defined as
np
n
np
n
np
p
np
p
npI
++
−
++
−= 22 loglog),(

Information Gain in Decision TreeInformation Gain in Decision Tree
InductionInduction
Assume that using attribute A a setAssume that using attribute A a set SS will bewill be
partitioned into sets {partitioned into sets {SS11,, SS22 , …,, …, SSvv}}
 IfIf SSii containscontains ppii examples ofexamples of PP andand nnii examples ofexamples of NN,,
the entropy, or the expected information needed tothe entropy, or the expected information needed to
classify objects in all subtreesclassify objects in all subtrees SSii isis
The encoding information that would be gainedThe encoding information that would be gained
by branching onby branching on AA
∑
= +
+
=
ν
1
),()(
i
ii
ii
npI
np
np
AE
)(),()( AEnpIAGain −=

Attribute Selection by Information GainAttribute Selection by Information Gain
ComputationComputation
 Class P: buys_computerClass P: buys_computer
= “yes”= “yes”
 Class N: buys_computerClass N: buys_computer
= “no”= “no”
 I(p, n) = I(9, 5) =0.940I(p, n) = I(9, 5) =0.940
 Compute the entropy forCompute the entropy for
ageage::
HenceHence
SimilarlySimilarly
age pi ni I(pi, ni)
<=30 2 3 0.971
30…40 4 0 0
>40 3 2 0.971
69.0)2,3(
14
5
)0,4(
14
4
)3,2(
14
5
)(
=+
+=
I
IIageE
048.0)_(
151.0)(
029.0)(
=
=
=
ratingcreditGain
studentGain
incomeGain
)(),()( ageEnpIageGain −=

GiniGini Index (IBM IntelligentMiner)Index (IBM IntelligentMiner)
If a data setIf a data set TT contains examples fromcontains examples from nn classes,classes,
gini index,gini index, ginigini((TT) is defined as) is defined as
wherewhere ppjj is the relative frequency of classis the relative frequency of class jj inin T.T.
If a data setIf a data set TT is split into two subsetsis split into two subsets TT11 andand TT22 withwith
sizessizes NN11 andand NN22 respectively, therespectively, the ginigini index of theindex of the
split data contains examples fromsplit data contains examples from nn classes, theclasses, the
ginigini indexindex ginigini((TT) is defined as) is defined as
The attribute provides the smallestThe attribute provides the smallest giniginisplitsplit((TT) is) is
chosen to split the node (chosen to split the node (need to enumerate allneed to enumerate all
possible splitting points for each attributepossible splitting points for each attribute).).
∑
=
−=
n
j
p jTgini
1
21)(
)()()( 2
2
1
1
Tgini
N
N
Tgini
N
NTginisplit
+=

Extracting Classification Rules from TreesExtracting Classification Rules from Trees
Represent the knowledge in the form of IF-THEN rulesRepresent the knowledge in the form of IF-THEN rules
One rule is created for each path from the root to a leafOne rule is created for each path from the root to a leaf
Each attribute-value pair along a path forms a conjunctionEach attribute-value pair along a path forms a conjunction
The leaf node holds the class predictionThe leaf node holds the class prediction
Rules are easier for humans to understandRules are easier for humans to understand
ExampleExample
IFIF ageage = “<=30” AND= “<=30” AND studentstudent = “= “nono” THEN” THEN buys_computerbuys_computer = “= “nono””
IFIF ageage = “<=30” AND= “<=30” AND studentstudent = “= “yesyes” THEN” THEN buys_computerbuys_computer = “= “yesyes””
IFIF ageage = “31…40”= “31…40” THENTHEN buys_computerbuys_computer = “= “yesyes””
IFIF ageage = “>40” AND= “>40” AND credit_ratingcredit_rating = “= “excellentexcellent” THEN” THEN
buys_computerbuys_computer = “= “yesyes””
IFIF ageage = “>40” AND= “>40” AND credit_ratingcredit_rating = “= “fairfair” THEN” THEN buys_computerbuys_computer = “= “nono””

Avoid Overfitting in ClassificationAvoid Overfitting in Classification
The generated tree may overfit the training dataThe generated tree may overfit the training data

Too many branches, some may reflect anomaliesToo many branches, some may reflect anomalies
due to noise or outliersdue to noise or outliers

Result is in poor accuracy for unseen samplesResult is in poor accuracy for unseen samples
Two approaches to avoid overfittingTwo approaches to avoid overfitting

Prepruning: Halt tree construction earlyPrepruning: Halt tree construction early——do not splitdo not split
a node if this would result in the goodness measurea node if this would result in the goodness measure
falling below a thresholdfalling below a threshold
Difficult to choose an appropriate thresholdDifficult to choose an appropriate threshold

Postpruning: Remove branches from a “fully grown”Postpruning: Remove branches from a “fully grown”
treetree——get a sequence of progressively pruned treesget a sequence of progressively pruned trees
Use a set of data different from the training dataUse a set of data different from the training data
to decide which is the “best pruned tree”to decide which is the “best pruned tree”

Approaches to Determine the Final TreeApproaches to Determine the Final Tree
SizeSize
Separate training and testing setsSeparate training and testing sets
Use cross validation, 10-fold cross validationUse cross validation, 10-fold cross validation
Use all the data for trainingUse all the data for training

apply a statistical test (chi-square) to estimateapply a statistical test (chi-square) to estimate
whether expanding or pruning a node may improvewhether expanding or pruning a node may improve
the entire distributionthe entire distribution
Use minimum description length (MDL) principle:Use minimum description length (MDL) principle:

halting growth of the tree when the encoding ishalting growth of the tree when the encoding is
minimizedminimized

Enhancements to basic decision treeEnhancements to basic decision tree
inductioninduction
Allow for continuous-valued attributesAllow for continuous-valued attributes

Dynamically define new discrete-valued attributes thatDynamically define new discrete-valued attributes that
partition the continuous attribute value into a discretepartition the continuous attribute value into a discrete
set of intervalsset of intervals
Handle missing attribute valuesHandle missing attribute values

Assign the most common value of the attributeAssign the most common value of the attribute

Assign probability to each of the possible valuesAssign probability to each of the possible values
Attribute constructionAttribute construction

Create new attributes based on existing ones that areCreate new attributes based on existing ones that are
sparsely representedsparsely represented

This reduces fragmentation, repetition, and replicationThis reduces fragmentation, repetition, and replication

Classification in Large DatabasesClassification in Large Databases
ClassificationClassification——a classical problem extensively studied bya classical problem extensively studied by
statisticians and machine learning researchersstatisticians and machine learning researchers
Scalability: Classifying data sets with millions of examplesScalability: Classifying data sets with millions of examples
and hundreds of attributes with reasonable speedand hundreds of attributes with reasonable speed
Why decision tree induction in data mining?Why decision tree induction in data mining?

relatively faster learning speed (than other classificationrelatively faster learning speed (than other classification
methods)methods)

convertible to simple and easy to understandconvertible to simple and easy to understand
classification rulesclassification rules

can use SQL queries for accessing databasescan use SQL queries for accessing databases

comparable classification accuracy with other methodscomparable classification accuracy with other methods

Scalable Decision Tree Induction Methods inScalable Decision Tree Induction Methods in
Data Mining StudiesData Mining Studies
SLIQ (EDBT’96SLIQ (EDBT’96 —— Mehta et al.)Mehta et al.)

builds an index for each attribute and only class list andbuilds an index for each attribute and only class list and
the current attribute list reside in memorythe current attribute list reside in memory
SPRINT (VLDB’96SPRINT (VLDB’96 —— J. Shafer et al.)J. Shafer et al.)

constructs an attribute list data structureconstructs an attribute list data structure
PUBLIC (VLDB’98PUBLIC (VLDB’98 —— Rastogi & Shim)Rastogi & Shim)

integrates tree splitting and tree pruning: stop growingintegrates tree splitting and tree pruning: stop growing
the tree earlierthe tree earlier
RainForestRainForest (VLDB’98(VLDB’98 —— Gehrke, Ramakrishnan & Ganti)Gehrke, Ramakrishnan & Ganti)

separates the scalability aspects from the criteria thatseparates the scalability aspects from the criteria that
determine the quality of the treedetermine the quality of the tree

builds an AVC-list (attribute, value, class label)builds an AVC-list (attribute, value, class label)

Bayesian ClassificationBayesian Classification

Statical classifiersStatical classifiers
Based on Baye’s theoremBased on Baye’s theorem
Naïve Bayesian classificationNaïve Bayesian classification
Class conditional independenceClass conditional independence
Bayesian belief netwoksBayesian belief netwoks
Lecture-35 - Bayesian ClassificationLecture-35 - Bayesian Classification

Probabilistic learningProbabilistic learning

Calculate explicit probabilities for hypothesis, among the mostCalculate explicit probabilities for hypothesis, among the most
practical approaches to certain types of learning problemspractical approaches to certain types of learning problems
IncrementalIncremental

Each training example can incrementally increase/decrease theEach training example can incrementally increase/decrease the
probability that a hypothesis is correct. Prior knowledge can beprobability that a hypothesis is correct. Prior knowledge can be
combined with observed data.combined with observed data.
Probabilistic predictionProbabilistic prediction

Predict multiple hypotheses, weighted by their probabilitiesPredict multiple hypotheses, weighted by their probabilities
StandardStandard

Even when Bayesian methods are computationally intractable, theyEven when Bayesian methods are computationally intractable, they
can provide a standard of optimal decision making against whichcan provide a standard of optimal decision making against which
other methods can be measuredother methods can be measured

Baye’s TheoremBaye’s Theorem
Let X be a data tuple and H be hypothesis,Let X be a data tuple and H be hypothesis,
such that X belongs to a specific class C.such that X belongs to a specific class C.
Posterior probability of a hypothesis h on X,Posterior probability of a hypothesis h on X,
P(h|X) follows the Baye’s theoremP(h|X) follows the Baye’s theorem
)(
)()|(
)|(
XP
HPHXP
XHP =

Naïve Bayes ClassifierNaïve Bayes Classifier
Let D be a training data set of tuples andLet D be a training data set of tuples and
associated class labelsassociated class labels
X = (x1, x2, x3,…xn) and m = C1, C2, C3,..CmX = (x1, x2, x3,…xn) and m = C1, C2, C3,..Cm
Bayes theorem:Bayes theorem:
P(Ci|X) = P(X|Ci)·P(Ci) / P(X)P(Ci|X) = P(X|Ci)·P(Ci) / P(X)
Naïve Baye’s predicts that X belongs to class CiNaïve Baye’s predicts that X belongs to class Ci
if and only ifif and only if
P(Ci/X) > P(Cj/X) for 1<=j<=m, i!=jP(Ci/X) > P(Cj/X) for 1<=j<=m, i!=j

Naïve Bayes ClassifierNaïve Bayes Classifier
P(X) is constant for all classesP(X) is constant for all classes
P(C1)=P(C2)=…..=P(Cn)P(C1)=P(C2)=…..=P(Cn)
P(X|Ci)·P(Ci) is to be maximizeP(X|Ci)·P(Ci) is to be maximize
P(Ci)=|Ci,d|/|D|P(Ci)=|Ci,d|/|D|

Naïve Bayesian ClassificationNaïve Bayesian Classification
Naïve assumption: attribute independenceNaïve assumption: attribute independence
P(xP(x11,…,x,…,xkk|C) = P(x|C) = P(x11|C)·…·P(x|C)·…·P(xkk|C)|C)
If attribute is categorical:If attribute is categorical:
P(xP(xii|C) is estimated as the relative freq of|C) is estimated as the relative freq of
samples having value xsamples having value xii as i-th attribute in classas i-th attribute in class
CC
If attribute is continuous:If attribute is continuous:
P(xP(xii|C) is estimated thru a Gaussian density|C) is estimated thru a Gaussian density
functionfunction
Computationally easy in both casesComputationally easy in both cases

How effective are Bayesian classifiers?How effective are Bayesian classifiers?
makes computation possiblemakes computation possible
optimal classifiers when satisfiedoptimal classifiers when satisfied
but is seldom satisfied in practice, as attributesbut is seldom satisfied in practice, as attributes
(variables) are often correlated.(variables) are often correlated.
Attempts to overcome this limitation:Attempts to overcome this limitation:

Bayesian networks, that combine Bayesian reasoningBayesian networks, that combine Bayesian reasoning
with causal relationships between attributeswith causal relationships between attributes

Decision trees, that reason on one attribute at theDecision trees, that reason on one attribute at the
time, considering most important attributes firsttime, considering most important attributes first

Bayesian Belief Networks (I)Bayesian Belief Networks (I)
Family
History
LungCancer
PositiveXRay
Smoker
Emphysema
Dyspnea
LC
~LC
(FH, S) (FH, ~S)(~FH, S) (~FH, ~S)
0.8
0.2
0.5
0.5
0.7
0.3
0.1
0.9
Bayesian Belief Networks
The conditional probability table
for the variable LungCancer

Bayesian Belief NetworksBayesian Belief Networks
Bayesian belief network allows aBayesian belief network allows a subsetsubset of the variablesof the variables
conditionally independentconditionally independent
A graphical model of causal relationshipsA graphical model of causal relationships
Several cases of learning Bayesian belief networksSeveral cases of learning Bayesian belief networks

Given both network structure and all the variables: easyGiven both network structure and all the variables: easy

Given network structure but only some variablesGiven network structure but only some variables

When the network structure is not known in advanceWhen the network structure is not known in advance

Classification by BackpropagationClassification by Backpropagation

Classification by BackpropagationClassification by Backpropagation
Neural network learning algorithmNeural network learning algorithm
Psychologists and NeurobiologistsPsychologists and Neurobiologists
Neural network – set of connected input/outputNeural network – set of connected input/output
units in which each connection has a weightunits in which each connection has a weight
associated with it.associated with it.
Lecture-36 - Classification by BackpropagationLecture-36 - Classification by Backpropagation

Neural NetworksNeural Networks
AdvantagesAdvantages

prediction accuracy is generally highprediction accuracy is generally high

robust, works when training examples contain errorsrobust, works when training examples contain errors

output may be discrete, real-valued, or a vector ofoutput may be discrete, real-valued, or a vector of
several discrete or real-valued attributesseveral discrete or real-valued attributes

fast evaluation of the learned target functionfast evaluation of the learned target function
limitationslimitations

long training timelong training time

difficult to understand the learned function (weights)difficult to understand the learned function (weights)

not easy to incorporate domain knowledgenot easy to incorporate domain knowledge

A NeuronA Neuron
TheThe nn-dimensional input vector-dimensional input vector xx is mappedis mapped
into variableinto variable yy by means of the scalarby means of the scalar
product and a nonlinear function mappingproduct and a nonlinear function mapping
µk-
f
weighted
sum
Input
vector x
output y
Activation
function
weight
vector w
∑
w0
w1
wn
x0
x1
xn

Network TrainingNetwork Training
The ultimate objective of trainingThe ultimate objective of training

obtain a set of weights that makes almost all theobtain a set of weights that makes almost all the
tuples in the training data classified correctlytuples in the training data classified correctly
StepsSteps

Initialize weights with random valuesInitialize weights with random values

Feed the input tuples into the network one by oneFeed the input tuples into the network one by one

For each unitFor each unit
Compute the net input to the unit as a linear combination ofCompute the net input to the unit as a linear combination of
all the inputs to the unitall the inputs to the unit
Compute the output value using the activation functionCompute the output value using the activation function
Compute the errorCompute the error
Update the weights and the biasUpdate the weights and the bias

Multi-Layer PerceptronMulti-Layer Perceptron
Output nodes
Input nodes
Hidden nodes
Output vector
Input vector: xi
wij
∑ +=
i
jiijj OwI θ
jIj
e
O −
+
=
1
1
))(1( jjjjj OTOOErr −−=
jk
k
kjjj wErrOOErr ∑−= )1(
ijijij OErrlww )(+=
jjj Errl)(+=θθ

Classification based on conceptsClassification based on concepts
from association rule miningfrom association rule mining

Association-Based ClassificationAssociation-Based Classification
Several methods for association-basedSeveral methods for association-based
classificationclassification

ARCS: Quantitative association mining andARCS: Quantitative association mining and
clustering of association rules (Lent et al’97)clustering of association rules (Lent et al’97)
It beats C4.5 in (mainly) scalability and alsoIt beats C4.5 in (mainly) scalability and also
accuracyaccuracy

Associative classification: (Liu et al’98)Associative classification: (Liu et al’98)
It mines high support and high confidence rules inIt mines high support and high confidence rules in
the form of “cond_set => y”, where y is a class labelthe form of “cond_set => y”, where y is a class label
Lecture-37 - Classification based on concepts from association rule miningLecture-37 - Classification based on concepts from association rule mining

Association-Based ClassificationAssociation-Based Classification

CAEP (Classification by aggregatingCAEP (Classification by aggregating
emerging patterns) (Dong et al’99)emerging patterns) (Dong et al’99)
Emerging patterns (EPs): the itemsets whoseEmerging patterns (EPs): the itemsets whose
support increases significantly from one class tosupport increases significantly from one class to
anotheranother
Mine Eps based on minimum support and growthMine Eps based on minimum support and growth
raterate
Lecture-37 - Classification based on concepts from association rule miningLecture-37 - Classification based on concepts from association rule mining

Other Classification MethodsOther Classification Methods

Other Classification MethodsOther Classification Methods
k-nearest neighbor classifierk-nearest neighbor classifier
case-based reasoningcase-based reasoning
Genetic algorithmGenetic algorithm
Rough set approachRough set approach
Fuzzy set approachesFuzzy set approaches
Lecture-38 - Other Classification MethodsLecture-38 - Other Classification Methods

Instance-Based MethodsInstance-Based Methods
Instance-based learning:Instance-based learning:

Store training examples and delay the processingStore training examples and delay the processing
(“lazy evaluation”) until a new instance must be(“lazy evaluation”) until a new instance must be
classifiedclassified
ApproachesApproaches

kk-nearest neighbor approach-nearest neighbor approach
Instances represented as points in a EuclideanInstances represented as points in a Euclidean
space.space.

Locally weighted regressionLocally weighted regression
Constructs local approximationConstructs local approximation

Case-based reasoningCase-based reasoning
Uses symbolic representations and knowledge-Uses symbolic representations and knowledge-
based inferencebased inference

TheThe kk-Nearest Neighbor Algorithm-Nearest Neighbor Algorithm
All instances correspond to points in the n-DAll instances correspond to points in the n-D
space.space.
The nearest neighbor are defined in terms ofThe nearest neighbor are defined in terms of
Euclidean distance.Euclidean distance.
The target function could be discrete- or real-The target function could be discrete- or real-
valued.valued.
For discrete-valued, theFor discrete-valued, the kk-NN returns the most-NN returns the most
common value among the k training examplescommon value among the k training examples
nearest tonearest to xxqq..
Vonoroi diagram: the decision surface inducedVonoroi diagram: the decision surface induced
by 1-NN for a typical set of training examples.by 1-NN for a typical set of training examples.
_ _
+

Discussion on theDiscussion on the kk-NN Algorithm-NN Algorithm
The k-NN algorithm for continuous-valued target functionsThe k-NN algorithm for continuous-valued target functions

Calculate the mean values of theCalculate the mean values of the kk nearest neighborsnearest neighbors
Distance-weighted nearest neighbor algorithmDistance-weighted nearest neighbor algorithm

Weight the contribution of each of the k neighborsWeight the contribution of each of the k neighbors
according to their distance to the query pointaccording to their distance to the query point xxqq
giving greater weight to closer neighborsgiving greater weight to closer neighbors

Similarly, for real-valued target functionsSimilarly, for real-valued target functions
Robust to noisy data by averaging k-nearest neighborsRobust to noisy data by averaging k-nearest neighbors
Curse of dimensionality: distance between neighbors couldCurse of dimensionality: distance between neighbors could
be dominated by irrelevant attributes.be dominated by irrelevant attributes.

To overcome it, axes stretch or elimination of the leastTo overcome it, axes stretch or elimination of the least
relevant attributes.relevant attributes.
w
d xq xi
≡ 1
2( , )

Case-Based ReasoningCase-Based Reasoning
Also uses: lazy evaluation + analyze similar instancesAlso uses: lazy evaluation + analyze similar instances
Difference: Instances are not “points in a Euclidean space”Difference: Instances are not “points in a Euclidean space”
Example: Water faucet problem in CADET (Sycara et al’92)Example: Water faucet problem in CADET (Sycara et al’92)
MethodologyMethodology

Instances represented by rich symbolic descriptionsInstances represented by rich symbolic descriptions
( function graphs)( function graphs)

Multiple retrieved cases may be combinedMultiple retrieved cases may be combined

Tight coupling between case retrieval, knowledge-basedTight coupling between case retrieval, knowledge-based
reasoning, and problem solvingreasoning, and problem solving
Research issuesResearch issues

Indexing based on syntactic similarity measure, andIndexing based on syntactic similarity measure, and
when failure, backtracking, and adapting to additionalwhen failure, backtracking, and adapting to additional
casescases

Lazy vs. Eager LearningLazy vs. Eager Learning
Instance-based learning: lazy evaluationInstance-based learning: lazy evaluation
Decision-tree and Bayesian classification: eager evaluationDecision-tree and Bayesian classification: eager evaluation
Key differencesKey differences

Lazy method may consider query instanceLazy method may consider query instance xqxq when deciding how towhen deciding how to
generalize beyond the training datageneralize beyond the training data DD

Eager method cannot since they have already chosen globalEager method cannot since they have already chosen global
approximation when seeing the queryapproximation when seeing the query
Efficiency: Lazy - less time training but more time predictingEfficiency: Lazy - less time training but more time predicting
AccuracyAccuracy

Lazy method effectively uses a richer hypothesis space since itLazy method effectively uses a richer hypothesis space since it
uses many local linear functions to form its implicit globaluses many local linear functions to form its implicit global
approximation to the target functionapproximation to the target function

Eager: must commit to a single hypothesis that covers the entireEager: must commit to a single hypothesis that covers the entire
instance spaceinstance space

Genetic AlgorithmsGenetic Algorithms
GA: based on an analogy to biological evolutionGA: based on an analogy to biological evolution
Each rule is represented by a string of bitsEach rule is represented by a string of bits
An initial population is created consisting of randomlyAn initial population is created consisting of randomly
generated rulesgenerated rules
 e.g., IF Ae.g., IF A11 and Not Aand Not A22 then Cthen C22 can be encoded as 100can be encoded as 100
Based on the notion of survival of the fittest, a newBased on the notion of survival of the fittest, a new
population is formed to consists of the fittest rules andpopulation is formed to consists of the fittest rules and
their offspringstheir offsprings
The fitness of a rule is represented by its classificationThe fitness of a rule is represented by its classification
accuracy on a set of training examplesaccuracy on a set of training examples
Offsprings are generated by crossover and mutationOffsprings are generated by crossover and mutation

Rough Set ApproachRough Set Approach
Rough sets are used to approximately orRough sets are used to approximately or
“roughly” define equivalent classes“roughly” define equivalent classes
A rough set for a given class C is approximatedA rough set for a given class C is approximated
by two sets: a lower approximation (certain toby two sets: a lower approximation (certain to
be in C) and an upper approximation (cannotbe in C) and an upper approximation (cannot
be described as not belonging to C)be described as not belonging to C)
Finding the minimal subsets (reducts) ofFinding the minimal subsets (reducts) of
attributes (for feature reduction) is NP-hard butattributes (for feature reduction) is NP-hard but
a discernibility matrix is used to reduce thea discernibility matrix is used to reduce the
computation intensitycomputation intensity

Fuzzy Set ApproachesFuzzy Set Approaches
Fuzzy logic uses truth values between 0.0 and 1.0 toFuzzy logic uses truth values between 0.0 and 1.0 to
represent the degree of membership (such as usingrepresent the degree of membership (such as using
fuzzy membership graph)fuzzy membership graph)
Attribute values are converted to fuzzy valuesAttribute values are converted to fuzzy values

e.g., income is mapped into the discrete categoriese.g., income is mapped into the discrete categories
{low, medium, high} with fuzzy values calculated{low, medium, high} with fuzzy values calculated
For a given new sample, more than one fuzzy value mayFor a given new sample, more than one fuzzy value may
applyapply
Each applicable rule contributes a vote for membershipEach applicable rule contributes a vote for membership
in the categoriesin the categories
Typically, the truth values for each predicted categoryTypically, the truth values for each predicted category
are summedare summed

What Is Prediction?What Is Prediction?
Prediction is similar to classificationPrediction is similar to classification

First, construct a modelFirst, construct a model

Second, use model to predict unknown valueSecond, use model to predict unknown value
Major method for prediction is regressionMajor method for prediction is regression

Linear and multiple regressionLinear and multiple regression

Non-linear regressionNon-linear regression
Prediction is different from classificationPrediction is different from classification

Classification refers to predict categorical class labelClassification refers to predict categorical class label

Prediction models continuous-valued functionsPrediction models continuous-valued functions
Lecture-39 - PredictionLecture-39 - Prediction

Predictive modeling: Predict data values or constructPredictive modeling: Predict data values or construct
generalized linear models based on the database data.generalized linear models based on the database data.
One can only predict value ranges or category distributionsOne can only predict value ranges or category distributions
Method outline:Method outline:

Minimal generalizationMinimal generalization

Attribute relevance analysisAttribute relevance analysis

Generalized linear model constructionGeneralized linear model construction

Determine the major factors which influence the predictionDetermine the major factors which influence the prediction

Data relevance analysis: uncertainty measurement,Data relevance analysis: uncertainty measurement,
entropy analysis, expert judgement, etc.entropy analysis, expert judgement, etc.
Multi-level prediction: drill-down and roll-up analysisMulti-level prediction: drill-down and roll-up analysis
Predictive Modeling in DatabasesPredictive Modeling in Databases

Linear regressionLinear regression: Y =: Y = αα ++ ββ XX

Two parameters ,Two parameters , αα andand ββ specify the line and are tospecify the line and are to
be estimated by using the data at hand.be estimated by using the data at hand.

using the least squares criterion to the known values ofusing the least squares criterion to the known values of
YY11, Y, Y22, …, X, …, X11, X, X22, …., ….
Multiple regressionMultiple regression: Y = b0 + b1 X1 + b2 X2.: Y = b0 + b1 X1 + b2 X2.

Many nonlinear functions can be transformed into theMany nonlinear functions can be transformed into the
above.above.
Log-linear modelsLog-linear models::

The multi-way table of joint probabilities isThe multi-way table of joint probabilities is
approximated by a product of lower-order tables.approximated by a product of lower-order tables.

Probability:Probability: p(a, b, c, d) =p(a, b, c, d) = ααabab ββacacχχadad δδbcdbcd
Regress Analysis and Log-Linear Models inRegress Analysis and Log-Linear Models in

Locally Weighted RegressionLocally Weighted Regression
Construct an explicit approximation toConstruct an explicit approximation to ff over a local regionover a local region
surrounding query instancesurrounding query instance xqxq..
Locally weighted linear regression:Locally weighted linear regression:

The target functionThe target function ff is approximated nearis approximated near xqxq using theusing the
linear function:linear function:

minimize the squared error: distance-decreasing weightminimize the squared error: distance-decreasing weight
KK

the gradient descent training rule:the gradient descent training rule:
In most cases, the target function is approximated by aIn most cases, the target function is approximated by a
constant, linear, or quadratic function.constant, linear, or quadratic function.
( ) ( ) ( )f x w w a x wnan x= + + +
0 1 1

E xq f x f x
x k nearest neighbors of xq
K d xq x( ) ( ( ) ( ))
_ _ _ _
( ( , ))≡ −
∈
∑
1
2
2
∆wj K d xq x f x f x a j x
x k nearest neighbors of xq
≡ −
∈
∑η ( ( , ))(( ( ) ( )) ( )
_ _ _ _

Classification accuracyClassification accuracy

Classification Accuracy: Estimating ErrorClassification Accuracy: Estimating Error
RatesRates
Partition: Training-and-testingPartition: Training-and-testing

use two independent data sets- training set , test setuse two independent data sets- training set , test set

used for data set with large number of samplesused for data set with large number of samples
Cross-validationCross-validation

divide the data set intodivide the data set into kk subsamplessubsamples

useuse k-1k-1 subsamples as training data and one sub-subsamples as training data and one sub-
sample as test data ---sample as test data --- kk-fold cross-validation-fold cross-validation

for data set with moderate sizefor data set with moderate size
Bootstrapping (leave-one-out)Bootstrapping (leave-one-out)

for small size datafor small size data
Lecture-40 - Classification accuracyLecture-40 - Classification accuracy

Boosting and BaggingBoosting and Bagging
Boosting increases classification accuracyBoosting increases classification accuracy

Applicable to decision trees or BayesianApplicable to decision trees or Bayesian
classifierclassifier
Learn a series of classifiers, where eachLearn a series of classifiers, where each
classifier in the series pays more attention toclassifier in the series pays more attention to
the examples misclassified by its predecessorthe examples misclassified by its predecessor
Boosting requires only linear time andBoosting requires only linear time and
constant spaceconstant space

Boosting TechniqueBoosting Technique —— AlgorithmAlgorithm
Assign every example an equal weightAssign every example an equal weight 1/N1/N
For t = 1, 2, …, T DoFor t = 1, 2, …, T Do

Obtain a hypothesis (classifier) hObtain a hypothesis (classifier) h(t)(t)
under wunder w(t)(t)

Calculate the error ofCalculate the error of h(t)h(t) and re-weight theand re-weight the
examples based on the errorexamples based on the error

Normalize wNormalize w(t+1)(t+1)
to sum to 1to sum to 1
Output a weighted sum of all the hypothesis,Output a weighted sum of all the hypothesis,
with each hypothesis weighted according to itswith each hypothesis weighted according to its
accuracy on the training setaccuracy on the training set

Classification and prediction

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Classification and prediction

Similar to Classification and prediction (20)

More from Acad

More from Acad (15)

Recently uploaded

Recently uploaded (20)

Classification and prediction