CoSc 6211 JJU 1
Classification techniques
• Decision Tree methods
• Naïve Bayes Methods
• Nearest Neighbor methods
• Rule-based methods
• Ensemble Methods
• Support Vector Machines
Classification:
Instance-Based Classifier
CoSc 6211 JJU 3
k-Nearest Neighbor
CoSc 6211 JJU 4
k-Nearest Neighbor
• KNN is a simple algorithm that stores all
available cases and classifies new cases based
on a similarity measure
• Lazy learning algorithm
• Non-parametric learning algorithm
– because it doesn’t assume anything about the
underlying data.
CoSc 6211 JJU 5
KNN-classification
• Lazy approach to classification
• Uses all the training set to perform
classification
• Uses distances between training and test
records
CoSc 6211 JJU 6
KNN example
• The test sample (green circle)
should be classified either to
the first class of blue squares
or to the second class of red
triangles.
If k = 3 it is assigned to the
second class because there
are 2 triangles and only 1
square inside the inner circle.
If k = 5 it is assigned to the
first class (3 squares vs. 2
triangles inside the outer
circle).
?
CoSc 6211 JJU 7
Classification Example
• Unknown instance is classified based on the nearest instance class
CoSc 6211 JJU 8
Similarity Measure
• Euclidian distance (sum of squared errors)
• Manhattan distance (sum of absolute errors)
dcityblock(x, y) = i | xi  yi |
where x = x1, x2, . . . , xm, and y = y1, y2, . . . , ym represent
the m attribute values of two records.
CoSc 6211 JJU 9
KNN
• Scaling issues
– Attributes may have to be scaled to prevent
distance measures from being dominated by one
of the attributes
– Example:
• height of a person may vary from 1.5m to 1.8m
• weight of a person may vary from 90lb to 300lb
• income of a person may vary from $10K to $1M
CoSc 6211 JJU 10
Scaling…
• Attribute normalization if scales are different
• For continuous variables, the min–max normalization or Z-score
standardization, may be used:
• Min–max normalization:
X∗
= X − min(X) / Range(X)
= X − min(X) / max(X) − min(X)
• Z-score standardization:
X∗
= X − mean(X) / SD(X)
For categorical variables,
different(xi ,yi ) = 0 if xi = yi , 1 otherwise
Example: x=male and y=female then distance(x, y)=1
CoSc 6211 JJU 11
Example of K-NN Classification
ID Att1 Att2 Class
1 10 2 Yes
2 4 4 No
3 1 9 Yes
4 3 10 Yes
5 4 6 No
6 8 8 No
7 1 8 Yes
ID Att1 Att2 Class
8 2 7 ?
9 7 7 ?
10 1 11 ?
Training data Test data
Example of K-NN Classification from reference book:
Tan, Steinbach, Kumar
12
Example of K-NN Classification
Edistance 8 9 10
1 9.4 5.8 13
2 3.6 4.2 7.6
3 2.2 6.3 2
4 3.2 5 2.2
5 2.2 3.2 5.8
6 6.1 1.4 7.6
7 1.4 6.1 3
ID Class
1 Yes
2 No
3 Yes
4 Yes
5 No
6 No
7 Yes
CoSc 6211 JJU
ED(1,8) = sqrt ((10-2)2
+ (2-7)2
= 9.4
ED(2,8) = sqrt ((4-2)2
+ (4-7)2
)= sqrt
(13)=3.6
.
.
ED(7,8) = sqrt ((1-2)2
+ (8-7)2
)= sqrt
(2)=1.4141..= 1.4
1) Computing Euclidian distance
3-NN(8) = {7,3,5} = {Yes, Yes, No} = Yes
3-NN(9) = ?
3-NN(10) = ?
Classification: using K=3
K-NN(8)=?
K-NN (9)=?
K-NN(10)=?
Use K =5 and 7
2) homework: use Manhattan distance to classify above K-NN
CoSc 6211 JJU 13
**Exercise 1
Record Age Gender
A 50 Male
B 20 Male
C 50 Female
a) Which is more similar to A , B or C ?
b) Give comments
Table 1 : Variable Values for Age and Gender for given data set
CoSc 6211 JJU 14
**Exercise 2
Record Age Agemmn AgeZs Gender
A 50 ? ? Male
B 20 ? ? Male
C 50 ? ? Female
Table 2 : Variable Values for Age and Gender for given data.
Agemmn = Min max normalization and AgeZs = Z score standard..
Assume for variable age , minimum is 10 and range is 50 and the mean is 45,
and the standard deviation is 15.
a) Compute and fill the above table : Agemmn and AgeZs using min-max
normalization and Zscore standardization respectively.
b) Which is more similar to A , B or C : using Agemmn and AgeZs. First
compute Edistance for both Agemmn and AgeZs.
c) Compare your answer with table 1 questions a
d) Given comments
CoSc 6211 JJU 15
Next:
• Rule-based methods
CoSc 6211 JJU 16
Rule-based classifier
Based on reference book
(Tan, Steinbach, Kumar and some based on text book (J.Han)
CoSc 6211 JJU 17
Rule-based classifier
• Classify records by using a collection of “if…then…”
rules
• Rule: (Condition) → y
– where
• Condition is a conjunctions of attributes
• y is the class label
– LHS: rule antecedent or condition
– RHS: rule consequent
– Examples of classification rules:
• (Blood Type=Warm) (Lay Eggs=Yes) → Birds
∧
• (Taxable Income < 50K) (Refund=Yes) → Evade=No
∧
CoSc 6211 JJU 18
Rule-based classifier example
Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds
R1: (Give Birth = no)  (Can Fly = yes)  Birds
R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians
CoSc 6211 JJU 19
Application of Rule-Based Classifier
• A rule r covers an instance x if the attributes of
the instance satisfy the condition of the rule
R1: (Give Birth = no)  (Can Fly = yes)  Birds
R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians
Name Blood Type Give Birth Can Fly Live in Water Class
hawk warm no yes no ?
grizzly bear warm yes no no ?
The rule R1 covers a hawk => Bird
The rule R3 covers the grizzly bear => Mammal
CoSc 6211 JJU 20
Rule Coverage and Accuracy
• Coverage of a rule:
– Fraction of records
that satisfy the
antecedent of a rule
• Accuracy of a rule:
– Fraction of records
that satisfy the
antecedent that also
satisfy the
consequent of a rule
Tid Refund Marital
Status
Taxable
Income Class
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
ncovers = # of tuples covered by R
ncorrect = # of tuples correctly classified by R
coverage(R) = ncovers /|D| /* D: training data set */
accuracy(R) = ncorrect / ncovers
R1: (Status=Single)  No
Coverage = 40%, Accuracy = 50%
CoSc 6211 JJU 21
Rule Extraction from a Decision Tree
 Rules are easier to understand
than large trees
 One rule is created for each path
from the root to a leaf
 Each attribute-value pair along a
path forms a conjunction: the leaf
holds the class prediction
age?
student? credit rating?
<=30 >40
no yes yes
yes
31..40
no
fair
excellent
yes
no
Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = no
IF age = old AND credit_rating = fair THEN buys_computer = yes
CoSc 6211 JJU 22
• Ensemble Methods: Increasing the Accuracy
CoSc 6211 JJU 23
Ensemble Methods:
Increasing the Accuracy
• Ensemble methods
– Use a combination of models to
increase accuracy
– Combine a series of k learned
models, M1, M2, …, Mk, with the aim
of creating an improved model M*
• Popular ensemble methods
– Bagging: averaging the prediction
over a collection of classifiers
– Boosting: weighted vote with a
collection of classifiers
– Ensemble: combining a set of
heterogeneous classifiers
CoSc 6211 JJU 24
Bagging: Bootstrap Aggregation
Analogy: Diagnosis based on multiple doctors’ majority vote
• Training
– Given a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled
with replacement from D (i.e., bootstrap)
– A classifier model Mi is learned for each training set Di
• Classification: classify an unknown sample X
– Each classifier Mi returns its class prediction
– The bagged classifier M* counts the votes and assigns the class with the most votes
to X
• Prediction: can be applied to the prediction of continuous values by taking the average
value of each prediction for a given test tuple
• Accuracy
– Often significantly better than a single classifier derived from D
– For noise data: not considerably worse, more robust
– Proved improved accuracy in prediction
CoSc 6211 JJU 25
Boosting
• Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
• How boosting works?
– Weights are assigned to each training tuple
– A series of k classifiers is iteratively learned
– After a classifier Mi is learned, the weights are updated to allow the
subsequent classifier, Mi+1, to pay more attention to the training tuples
that were misclassified by Mi
– The final M* combines the votes of each individual classifier, where the
weight of each classifier's vote is a function of its accuracy
• Boosting algorithm can be extended for numeric prediction
• Comparing with bagging: Boosting tends to have greater accuracy,
but it also risks overfitting the model to misclassified data
CoSc 6211 JJU 26
Adaboost (Freund and Schapire, 1997)
• Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
• Initially, all the weights of tuples are set the same (1/d)
• Generate k classifiers in k rounds. At round i,
– Tuples from D are sampled (with replacement) to form a training set Di of the
same size
– Each tuple’s chance of being selected is based on its weight
– A classification model Mi is derived from Di
– Its error rate is calculated using Di as a test set
– If a tuple is misclassified, its weight is increased, o.w. it is decreased
• Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi error rate is
the sum of the weights of the misclassified tuples:
• The weight of classifier Mi’s vote is
 

d
j
j
i err
w
M
error )
(
)
( j
X
)
(
)
(
1
log
i
i
M
error
M
error

CoSc 6211 JJU 27
Random Forest (Breiman 2001)
• Random Forest:
– Each classifier in the ensemble is a decision tree classifier and is generated
using a random selection of attributes at each node to determine the split
– During classification, each tree votes and the most popular class is returned
• Two Methods to construct Random Forest:
– Forest-RI (random input selection): Randomly select, at each node, F
attributes as candidates for the split at the node. The CART methodology is
used to grow the trees to maximum size
– Forest-RC (random linear combinations): Creates new attributes (or
features) that are a linear combination of the existing attributes (reduces
the correlation between individual classifiers)
• Comparable in accuracy to Adaboost, but more robust to errors and outliers
• Insensitive to the number of attributes selected for consideration at each split,
and faster than bagging or boosting
CoSc 6211 JJU 28
Improving Classification Accuracy of Class-
Imbalanced Data Sets
• Class-imbalance problem: Rare positive example but numerous
negative ones, e.g., medical diagnosis, fraud, oil-spill, fault, etc.
• Traditional methods assume a balanced distribution of classes and
equal error costs: not suitable for class-imbalanced data
• Typical methods for imbalance data in 2-class classification:
– Oversampling: re-sampling of data from positive class
– Under-sampling: randomly eliminate tuples from negative class
– Threshold-moving: moves the decision threshold, t, so that the
rare class tuples are easier to classify, and hence, less chance of
costly false negative errors
– Ensemble techniques: Ensemble multiple classifiers introduced
above
• Still difficult for class imbalance problem on multiclass tasks
CoSc 6211 JJU 29
• Next :
• Classification: ANN and Evaluation models
4. Clustering

Classifications and Predication (2)1-4 2

  • 1.
    CoSc 6211 JJU1 Classification techniques • Decision Tree methods • Naïve Bayes Methods • Nearest Neighbor methods • Rule-based methods • Ensemble Methods • Support Vector Machines
  • 2.
  • 3.
    CoSc 6211 JJU3 k-Nearest Neighbor
  • 4.
    CoSc 6211 JJU4 k-Nearest Neighbor • KNN is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure • Lazy learning algorithm • Non-parametric learning algorithm – because it doesn’t assume anything about the underlying data.
  • 5.
    CoSc 6211 JJU5 KNN-classification • Lazy approach to classification • Uses all the training set to perform classification • Uses distances between training and test records
  • 6.
    CoSc 6211 JJU6 KNN example • The test sample (green circle) should be classified either to the first class of blue squares or to the second class of red triangles. If k = 3 it is assigned to the second class because there are 2 triangles and only 1 square inside the inner circle. If k = 5 it is assigned to the first class (3 squares vs. 2 triangles inside the outer circle). ?
  • 7.
    CoSc 6211 JJU7 Classification Example • Unknown instance is classified based on the nearest instance class
  • 8.
    CoSc 6211 JJU8 Similarity Measure • Euclidian distance (sum of squared errors) • Manhattan distance (sum of absolute errors) dcityblock(x, y) = i | xi  yi | where x = x1, x2, . . . , xm, and y = y1, y2, . . . , ym represent the m attribute values of two records.
  • 9.
    CoSc 6211 JJU9 KNN • Scaling issues – Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes – Example: • height of a person may vary from 1.5m to 1.8m • weight of a person may vary from 90lb to 300lb • income of a person may vary from $10K to $1M
  • 10.
    CoSc 6211 JJU10 Scaling… • Attribute normalization if scales are different • For continuous variables, the min–max normalization or Z-score standardization, may be used: • Min–max normalization: X∗ = X − min(X) / Range(X) = X − min(X) / max(X) − min(X) • Z-score standardization: X∗ = X − mean(X) / SD(X) For categorical variables, different(xi ,yi ) = 0 if xi = yi , 1 otherwise Example: x=male and y=female then distance(x, y)=1
  • 11.
    CoSc 6211 JJU11 Example of K-NN Classification ID Att1 Att2 Class 1 10 2 Yes 2 4 4 No 3 1 9 Yes 4 3 10 Yes 5 4 6 No 6 8 8 No 7 1 8 Yes ID Att1 Att2 Class 8 2 7 ? 9 7 7 ? 10 1 11 ? Training data Test data Example of K-NN Classification from reference book: Tan, Steinbach, Kumar
  • 12.
    12 Example of K-NNClassification Edistance 8 9 10 1 9.4 5.8 13 2 3.6 4.2 7.6 3 2.2 6.3 2 4 3.2 5 2.2 5 2.2 3.2 5.8 6 6.1 1.4 7.6 7 1.4 6.1 3 ID Class 1 Yes 2 No 3 Yes 4 Yes 5 No 6 No 7 Yes CoSc 6211 JJU ED(1,8) = sqrt ((10-2)2 + (2-7)2 = 9.4 ED(2,8) = sqrt ((4-2)2 + (4-7)2 )= sqrt (13)=3.6 . . ED(7,8) = sqrt ((1-2)2 + (8-7)2 )= sqrt (2)=1.4141..= 1.4 1) Computing Euclidian distance 3-NN(8) = {7,3,5} = {Yes, Yes, No} = Yes 3-NN(9) = ? 3-NN(10) = ? Classification: using K=3 K-NN(8)=? K-NN (9)=? K-NN(10)=? Use K =5 and 7 2) homework: use Manhattan distance to classify above K-NN
  • 13.
    CoSc 6211 JJU13 **Exercise 1 Record Age Gender A 50 Male B 20 Male C 50 Female a) Which is more similar to A , B or C ? b) Give comments Table 1 : Variable Values for Age and Gender for given data set
  • 14.
    CoSc 6211 JJU14 **Exercise 2 Record Age Agemmn AgeZs Gender A 50 ? ? Male B 20 ? ? Male C 50 ? ? Female Table 2 : Variable Values for Age and Gender for given data. Agemmn = Min max normalization and AgeZs = Z score standard.. Assume for variable age , minimum is 10 and range is 50 and the mean is 45, and the standard deviation is 15. a) Compute and fill the above table : Agemmn and AgeZs using min-max normalization and Zscore standardization respectively. b) Which is more similar to A , B or C : using Agemmn and AgeZs. First compute Edistance for both Agemmn and AgeZs. c) Compare your answer with table 1 questions a d) Given comments
  • 15.
    CoSc 6211 JJU15 Next: • Rule-based methods
  • 16.
    CoSc 6211 JJU16 Rule-based classifier Based on reference book (Tan, Steinbach, Kumar and some based on text book (J.Han)
  • 17.
    CoSc 6211 JJU17 Rule-based classifier • Classify records by using a collection of “if…then…” rules • Rule: (Condition) → y – where • Condition is a conjunctions of attributes • y is the class label – LHS: rule antecedent or condition – RHS: rule consequent – Examples of classification rules: • (Blood Type=Warm) (Lay Eggs=Yes) → Birds ∧ • (Taxable Income < 50K) (Refund=Yes) → Evade=No ∧
  • 18.
    CoSc 6211 JJU18 Rule-based classifier example Name Blood Type Give Birth Can Fly Live in Water Class human warm yes no no mammals python cold no no no reptiles salmon cold no no yes fishes whale warm yes no yes mammals frog cold no no sometimes amphibians komodo cold no no no reptiles bat warm yes yes no mammals pigeon warm no yes no birds cat warm yes no no mammals leopard shark cold yes no yes fishes turtle cold no no sometimes reptiles penguin warm no no sometimes birds porcupine warm yes no no mammals eel cold no no yes fishes salamander cold no no sometimes amphibians gila monster cold no no no reptiles platypus warm no no no mammals owl warm no yes no birds dolphin warm yes no yes mammals eagle warm no yes no birds R1: (Give Birth = no)  (Can Fly = yes)  Birds R2: (Give Birth = no)  (Live in Water = yes)  Fishes R3: (Give Birth = yes)  (Blood Type = warm)  Mammals R4: (Give Birth = no)  (Can Fly = no)  Reptiles R5: (Live in Water = sometimes)  Amphibians
  • 19.
    CoSc 6211 JJU19 Application of Rule-Based Classifier • A rule r covers an instance x if the attributes of the instance satisfy the condition of the rule R1: (Give Birth = no)  (Can Fly = yes)  Birds R2: (Give Birth = no)  (Live in Water = yes)  Fishes R3: (Give Birth = yes)  (Blood Type = warm)  Mammals R4: (Give Birth = no)  (Can Fly = no)  Reptiles R5: (Live in Water = sometimes)  Amphibians Name Blood Type Give Birth Can Fly Live in Water Class hawk warm no yes no ? grizzly bear warm yes no no ? The rule R1 covers a hawk => Bird The rule R3 covers the grizzly bear => Mammal
  • 20.
    CoSc 6211 JJU20 Rule Coverage and Accuracy • Coverage of a rule: – Fraction of records that satisfy the antecedent of a rule • Accuracy of a rule: – Fraction of records that satisfy the antecedent that also satisfy the consequent of a rule Tid Refund Marital Status Taxable Income Class 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 ncovers = # of tuples covered by R ncorrect = # of tuples correctly classified by R coverage(R) = ncovers /|D| /* D: training data set */ accuracy(R) = ncorrect / ncovers R1: (Status=Single)  No Coverage = 40%, Accuracy = 50%
  • 21.
    CoSc 6211 JJU21 Rule Extraction from a Decision Tree  Rules are easier to understand than large trees  One rule is created for each path from the root to a leaf  Each attribute-value pair along a path forms a conjunction: the leaf holds the class prediction age? student? credit rating? <=30 >40 no yes yes yes 31..40 no fair excellent yes no Example: Rule extraction from our buys_computer decision-tree IF age = young AND student = no THEN buys_computer = no IF age = young AND student = yes THEN buys_computer = yes IF age = mid-age THEN buys_computer = yes IF age = old AND credit_rating = excellent THEN buys_computer = no IF age = old AND credit_rating = fair THEN buys_computer = yes
  • 22.
    CoSc 6211 JJU22 • Ensemble Methods: Increasing the Accuracy
  • 23.
    CoSc 6211 JJU23 Ensemble Methods: Increasing the Accuracy • Ensemble methods – Use a combination of models to increase accuracy – Combine a series of k learned models, M1, M2, …, Mk, with the aim of creating an improved model M* • Popular ensemble methods – Bagging: averaging the prediction over a collection of classifiers – Boosting: weighted vote with a collection of classifiers – Ensemble: combining a set of heterogeneous classifiers
  • 24.
    CoSc 6211 JJU24 Bagging: Bootstrap Aggregation Analogy: Diagnosis based on multiple doctors’ majority vote • Training – Given a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled with replacement from D (i.e., bootstrap) – A classifier model Mi is learned for each training set Di • Classification: classify an unknown sample X – Each classifier Mi returns its class prediction – The bagged classifier M* counts the votes and assigns the class with the most votes to X • Prediction: can be applied to the prediction of continuous values by taking the average value of each prediction for a given test tuple • Accuracy – Often significantly better than a single classifier derived from D – For noise data: not considerably worse, more robust – Proved improved accuracy in prediction
  • 25.
    CoSc 6211 JJU25 Boosting • Analogy: Consult several doctors, based on a combination of weighted diagnoses—weight assigned based on the previous diagnosis accuracy • How boosting works? – Weights are assigned to each training tuple – A series of k classifiers is iteratively learned – After a classifier Mi is learned, the weights are updated to allow the subsequent classifier, Mi+1, to pay more attention to the training tuples that were misclassified by Mi – The final M* combines the votes of each individual classifier, where the weight of each classifier's vote is a function of its accuracy • Boosting algorithm can be extended for numeric prediction • Comparing with bagging: Boosting tends to have greater accuracy, but it also risks overfitting the model to misclassified data
  • 26.
    CoSc 6211 JJU26 Adaboost (Freund and Schapire, 1997) • Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd) • Initially, all the weights of tuples are set the same (1/d) • Generate k classifiers in k rounds. At round i, – Tuples from D are sampled (with replacement) to form a training set Di of the same size – Each tuple’s chance of being selected is based on its weight – A classification model Mi is derived from Di – Its error rate is calculated using Di as a test set – If a tuple is misclassified, its weight is increased, o.w. it is decreased • Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi error rate is the sum of the weights of the misclassified tuples: • The weight of classifier Mi’s vote is    d j j i err w M error ) ( ) ( j X ) ( ) ( 1 log i i M error M error 
  • 27.
    CoSc 6211 JJU27 Random Forest (Breiman 2001) • Random Forest: – Each classifier in the ensemble is a decision tree classifier and is generated using a random selection of attributes at each node to determine the split – During classification, each tree votes and the most popular class is returned • Two Methods to construct Random Forest: – Forest-RI (random input selection): Randomly select, at each node, F attributes as candidates for the split at the node. The CART methodology is used to grow the trees to maximum size – Forest-RC (random linear combinations): Creates new attributes (or features) that are a linear combination of the existing attributes (reduces the correlation between individual classifiers) • Comparable in accuracy to Adaboost, but more robust to errors and outliers • Insensitive to the number of attributes selected for consideration at each split, and faster than bagging or boosting
  • 28.
    CoSc 6211 JJU28 Improving Classification Accuracy of Class- Imbalanced Data Sets • Class-imbalance problem: Rare positive example but numerous negative ones, e.g., medical diagnosis, fraud, oil-spill, fault, etc. • Traditional methods assume a balanced distribution of classes and equal error costs: not suitable for class-imbalanced data • Typical methods for imbalance data in 2-class classification: – Oversampling: re-sampling of data from positive class – Under-sampling: randomly eliminate tuples from negative class – Threshold-moving: moves the decision threshold, t, so that the rare class tuples are easier to classify, and hence, less chance of costly false negative errors – Ensemble techniques: Ensemble multiple classifiers introduced above • Still difficult for class imbalance problem on multiclass tasks
  • 29.
    CoSc 6211 JJU29 • Next : • Classification: ANN and Evaluation models 4. Clustering

Editor's Notes

  • #23 So far – learning methods that learn a single hypothesis, chosen form a hypothesis space that is used to make predictions. Ensemble learning  select a collection (ensemble) of hypotheses and combine their predictions. Example 1 - generate 100 different decision trees from the same or different training set and have them vote on the best classification for a new example. Key motivation: reduce the error rate. Hope is that it will become much more unlikely that the ensemble of will misclassify an example
  • #27 Random forest - The random forest is a method of learning sets for classification, regression and other tasks, where a large number of decision trees are created at the time of the decision and the class is output, which is the average classification or prediction mode of individual trees. Random forests rectify the habit of decision trees that go with their training set. Random forests are a method of calculating the mean of several deep decision trees formed in different parts of the same training set, with the aim of reducing the variance. This is to the detriment of a slight increase and a loss of interpretability, but generally increases significantly the presentation of the final model. (Pham et al., 2017).