Classifications and Predication (2)1-4 2

CoSc 6211 JJU 1
Classification techniques
• Decision Tree methods
• Naïve Bayes Methods
• Nearest Neighbor methods
• Rule-based methods
• Ensemble Methods
• Support Vector Machines

Classification:
Instance-Based Classifier

CoSc 6211 JJU 3
k-Nearest Neighbor

CoSc 6211 JJU 4
k-Nearest Neighbor
• KNN is a simple algorithm that stores all
available cases and classifies new cases based
on a similarity measure
• Lazy learning algorithm
• Non-parametric learning algorithm
– because it doesn’t assume anything about the
underlying data.

CoSc 6211 JJU 5
KNN-classification
• Lazy approach to classification
• Uses all the training set to perform
classification
• Uses distances between training and test
records

CoSc 6211 JJU 6
KNN example
• The test sample (green circle)
should be classified either to
the first class of blue squares
or to the second class of red
triangles.
If k = 3 it is assigned to the
second class because there
are 2 triangles and only 1
square inside the inner circle.
If k = 5 it is assigned to the
first class (3 squares vs. 2
triangles inside the outer
circle).
?

CoSc 6211 JJU 7
Classification Example
• Unknown instance is classified based on the nearest instance class

CoSc 6211 JJU 8
Similarity Measure
• Euclidian distance (sum of squared errors)
• Manhattan distance (sum of absolute errors)
dcityblock(x, y) = i | xi  yi |
where x = x1, x2, . . . , xm, and y = y1, y2, . . . , ym represent
the m attribute values of two records.

CoSc 6211 JJU 9
KNN
• Scaling issues
– Attributes may have to be scaled to prevent
distance measures from being dominated by one
of the attributes
– Example:
• height of a person may vary from 1.5m to 1.8m
• weight of a person may vary from 90lb to 300lb
• income of a person may vary from $10K to $1M

CoSc 6211 JJU 10
Scaling…
• Attribute normalization if scales are different
• For continuous variables, the min–max normalization or Z-score
standardization, may be used:
• Min–max normalization:
X∗
= X − min(X) / Range(X)
= X − min(X) / max(X) − min(X)
• Z-score standardization:
X∗
= X − mean(X) / SD(X)
For categorical variables,
different(xi ,yi ) = 0 if xi = yi , 1 otherwise
Example: x=male and y=female then distance(x, y)=1

CoSc 6211 JJU 11
Example of K-NN Classification
ID Att1 Att2 Class
1 10 2 Yes
2 4 4 No
3 1 9 Yes
4 3 10 Yes
5 4 6 No
6 8 8 No
7 1 8 Yes
ID Att1 Att2 Class
8 2 7 ?
9 7 7 ?
10 1 11 ?
Training data Test data
Example of K-NN Classification from reference book:
Tan, Steinbach, Kumar

12
Example of K-NN Classification
Edistance 8 9 10
1 9.4 5.8 13
2 3.6 4.2 7.6
3 2.2 6.3 2
4 3.2 5 2.2
5 2.2 3.2 5.8
6 6.1 1.4 7.6
7 1.4 6.1 3
ID Class
1 Yes
2 No
3 Yes
4 Yes
5 No
6 No
7 Yes
CoSc 6211 JJU
ED(1,8) = sqrt ((10-2)2
+ (2-7)2
= 9.4
ED(2,8) = sqrt ((4-2)2
+ (4-7)2
)= sqrt
(13)=3.6
.
.
ED(7,8) = sqrt ((1-2)2
+ (8-7)2
)= sqrt
(2)=1.4141..= 1.4
1) Computing Euclidian distance
3-NN(8) = {7,3,5} = {Yes, Yes, No} = Yes
3-NN(9) = ?
3-NN(10) = ?
Classification: using K=3
K-NN(8)=?
K-NN (9)=?
K-NN(10)=?
Use K =5 and 7
2) homework: use Manhattan distance to classify above K-NN

CoSc 6211 JJU 13
**Exercise 1
Record Age Gender
A 50 Male
B 20 Male
C 50 Female
a) Which is more similar to A , B or C ?
b) Give comments
Table 1 : Variable Values for Age and Gender for given data set

CoSc 6211 JJU 14
**Exercise 2
Record Age Agemmn AgeZs Gender
A 50 ? ? Male
B 20 ? ? Male
C 50 ? ? Female
Table 2 : Variable Values for Age and Gender for given data.
Agemmn = Min max normalization and AgeZs = Z score standard..
Assume for variable age , minimum is 10 and range is 50 and the mean is 45,
and the standard deviation is 15.
a) Compute and fill the above table : Agemmn and AgeZs using min-max
normalization and Zscore standardization respectively.
b) Which is more similar to A , B or C : using Agemmn and AgeZs. First
compute Edistance for both Agemmn and AgeZs.
c) Compare your answer with table 1 questions a
d) Given comments

CoSc 6211 JJU 15
Next:
• Rule-based methods

CoSc 6211 JJU 16
Rule-based classifier
Based on reference book
(Tan, Steinbach, Kumar and some based on text book (J.Han)

CoSc 6211 JJU 17
Rule-based classifier
• Classify records by using a collection of “if…then…”
rules
• Rule: (Condition) → y
– where
• Condition is a conjunctions of attributes
• y is the class label
– LHS: rule antecedent or condition
– RHS: rule consequent
– Examples of classification rules:
• (Blood Type=Warm) (Lay Eggs=Yes) → Birds
∧
• (Taxable Income < 50K) (Refund=Yes) → Evade=No
∧

CoSc 6211 JJU 18
Rule-based classifier example
Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds
R1: (Give Birth = no)  (Can Fly = yes)  Birds
R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians

CoSc 6211 JJU 19
Application of Rule-Based Classifier
• A rule r covers an instance x if the attributes of
the instance satisfy the condition of the rule
R1: (Give Birth = no)  (Can Fly = yes)  Birds
R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians
Name Blood Type Give Birth Can Fly Live in Water Class
hawk warm no yes no ?
grizzly bear warm yes no no ?
The rule R1 covers a hawk => Bird
The rule R3 covers the grizzly bear => Mammal

CoSc 6211 JJU 20
Rule Coverage and Accuracy
• Coverage of a rule:
– Fraction of records
that satisfy the
antecedent of a rule
• Accuracy of a rule:
– Fraction of records
that satisfy the
antecedent that also
satisfy the
consequent of a rule
Tid Refund Marital
Status
Taxable
Income Class
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
ncovers = # of tuples covered by R
ncorrect = # of tuples correctly classified by R
coverage(R) = ncovers /|D| /* D: training data set */
accuracy(R) = ncorrect / ncovers
R1: (Status=Single)  No
Coverage = 40%, Accuracy = 50%

CoSc 6211 JJU 21
Rule Extraction from a Decision Tree
 Rules are easier to understand
than large trees
 One rule is created for each path
from the root to a leaf
 Each attribute-value pair along a
path forms a conjunction: the leaf
holds the class prediction
age?
student? credit rating?
<=30 >40
no yes yes
yes
31..40
no
fair
excellent
yes
no
Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = no
IF age = old AND credit_rating = fair THEN buys_computer = yes

CoSc 6211 JJU 22
• Ensemble Methods: Increasing the Accuracy

CoSc 6211 JJU 23
Ensemble Methods:
Increasing the Accuracy
• Ensemble methods
– Use a combination of models to
increase accuracy
– Combine a series of k learned
models, M1, M2, …, Mk, with the aim
of creating an improved model M*
• Popular ensemble methods
– Bagging: averaging the prediction
over a collection of classifiers
– Boosting: weighted vote with a
collection of classifiers
– Ensemble: combining a set of
heterogeneous classifiers

CoSc 6211 JJU 24
Bagging: Bootstrap Aggregation
Analogy: Diagnosis based on multiple doctors’ majority vote
• Training
– Given a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled
with replacement from D (i.e., bootstrap)
– A classifier model Mi is learned for each training set Di
• Classification: classify an unknown sample X
– Each classifier Mi returns its class prediction
– The bagged classifier M* counts the votes and assigns the class with the most votes
to X
• Prediction: can be applied to the prediction of continuous values by taking the average
value of each prediction for a given test tuple
• Accuracy
– Often significantly better than a single classifier derived from D
– For noise data: not considerably worse, more robust
– Proved improved accuracy in prediction

CoSc 6211 JJU 25
Boosting
• Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
• How boosting works?
– Weights are assigned to each training tuple
– A series of k classifiers is iteratively learned
– After a classifier Mi is learned, the weights are updated to allow the
subsequent classifier, Mi+1, to pay more attention to the training tuples
that were misclassified by Mi
– The final M* combines the votes of each individual classifier, where the
weight of each classifier's vote is a function of its accuracy
• Boosting algorithm can be extended for numeric prediction
• Comparing with bagging: Boosting tends to have greater accuracy,
but it also risks overfitting the model to misclassified data

CoSc 6211 JJU 26
Adaboost (Freund and Schapire, 1997)
• Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
• Initially, all the weights of tuples are set the same (1/d)
• Generate k classifiers in k rounds. At round i,
– Tuples from D are sampled (with replacement) to form a training set Di of the
same size
– Each tuple’s chance of being selected is based on its weight
– A classification model Mi is derived from Di
– Its error rate is calculated using Di as a test set
– If a tuple is misclassified, its weight is increased, o.w. it is decreased
• Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi error rate is
the sum of the weights of the misclassified tuples:
• The weight of classifier Mi’s vote is
 

d
j
j
i err
w
M
error )
(
)
( j
X
)
(
)
(
1
log
i
i
M
error
M
error


CoSc 6211 JJU 27
Random Forest (Breiman 2001)
• Random Forest:
– Each classifier in the ensemble is a decision tree classifier and is generated
using a random selection of attributes at each node to determine the split
– During classification, each tree votes and the most popular class is returned
• Two Methods to construct Random Forest:
– Forest-RI (random input selection): Randomly select, at each node, F
attributes as candidates for the split at the node. The CART methodology is
used to grow the trees to maximum size
– Forest-RC (random linear combinations): Creates new attributes (or
features) that are a linear combination of the existing attributes (reduces
the correlation between individual classifiers)
• Comparable in accuracy to Adaboost, but more robust to errors and outliers
• Insensitive to the number of attributes selected for consideration at each split,
and faster than bagging or boosting

CoSc 6211 JJU 28
Improving Classification Accuracy of Class-
Imbalanced Data Sets
• Class-imbalance problem: Rare positive example but numerous
negative ones, e.g., medical diagnosis, fraud, oil-spill, fault, etc.
• Traditional methods assume a balanced distribution of classes and
equal error costs: not suitable for class-imbalanced data
• Typical methods for imbalance data in 2-class classification:
– Oversampling: re-sampling of data from positive class
– Under-sampling: randomly eliminate tuples from negative class
– Threshold-moving: moves the decision threshold, t, so that the
rare class tuples are easier to classify, and hence, less chance of
costly false negative errors
– Ensemble techniques: Ensemble multiple classifiers introduced
above
• Still difficult for class imbalance problem on multiclass tasks

CoSc 6211 JJU 29
• Next :
• Classification: ANN and Evaluation models
4. Clustering

Classifications and Predication (2)1-4 2

More Related Content

Similar to Classifications and Predication (2)1-4 2

More from abdirazak745

Recently uploaded

Classifications and Predication (2)1-4 2

Editor's Notes