SlideShare a Scribd company logo
Khalid Elshafie. abolkog@dblab.cbnu.ac.kr Database / Bioinformatics Lab Chungbuk National University, Korea Classification : Basic Concepts December, 12 2009
Outline Chapter 4: Classification 11 December 2009 2/46
Introduction
Introduction (1/4) Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model  for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model.  Classification Model Output Class Label Input Attribute set Chapter 4: Classification 11 December 2009 4/46
Introduction(2/4) Classification: Two step process: 1-learning step: Training data are analyzed by classification algorithm and a model (classifier) is learned. 2- Classification: Test data are used to estimate the accuracy of the classification rules. Usually the given data set is divided into training and test sets. Chapter 4: Classification 11 December 2009 5/46
Introduction (3/4) Examples of Classification: Predicting tumor cells as benign or malignant Classifying credit card transactions as legitimate or fraudulent Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil Categorizing news stories as finance, weather, entertainment, sports, etc Chapter 4: Classification 11 December 2009 6/46
Introduction (4/4) Classification Techniques:  Decision Trees Based Methods. Rule Based Methods. Neural Networks. Naïve Bayes and Bayesian Belief Networks. Support Vector Machines. Chapter 4: Classification 11 December 2009 7/46
General Approach to Solving a Classification Problem
General Approach To Solving a Classification Problem (1/2) General Approach for building a classification model. Chapter 4: Classification 11 December 2009 9/46
General Approach To Solving a Classification Problem (2/2) Performance evaluation. Evaluating the performance of a classification model is based on the counts of test records correctly and incorrectly predicted by the model. Although a confusion matrix provides the information needed to determine how well a classification model perform, summarizing this information with a single number would make it more convenient to compare the performance to a different models. Confusion matrix for a 2-class problem Chapter 4: Classification 11 December 2009 10/46
Decision Tree Induction
Decision Tree Induction (1/15) What is a decision tree? A decision tree is a flowchart-like tree structure. Each internal node (none leaf node) denotes a test on an attribute, each branch represents an outcome of the test and each leaf node (terminal node) holds a class label. Single, Divorced Internal node MarSt Married Root node Refund NO No Yes TaxInc NO > 80K < 80K YES NO Leaf nodes Chapter 4: Classification 11 December 2009 12/46
Decision Tree Induction (2/15) How to build a decision tree? Let Dt be the set of training records that reach a node t General Procedure: If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt If Dt is an empty set, then t is a leaf node labeled by the default class, yd If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset. Dt ? Chapter 4: Classification 11 December 2009 13/46
Decision Tree Induction (3/15) How to build a decision tree? Tree induction: Greedy strategy. Split the record based on an attribute test that optimizes certain condition. Tree induction issues: Determine how to split the record? How to specify the attribute test condition? How to determine the best split? Determine when to stop splitting. Chapter 4: Classification 11 December 2009 14/46
Decision Tree Induction (4/15) How to specify test condition? Depends on attribute types Nominal. Ordinal. Continuous.  Depends on number of ways to split. 2-way split. Multi-way split. Chapter 4: Classification 11 December 2009 15/46
Decision Tree Induction (5/15) Splitting based on nominal attributes. Multi-way split Use as many partition as distinct values. Binary split. Divides the values into two subsets. CarType Family Luxury Sports CarType CarType {Family, Luxury} {Sports, Luxury} {Sports} {Family} OR Chapter 4: Classification 11 December 2009 16/46
Decision Tree Induction (6/15) Splitting based on ordinal attributes. Multi-way split Use as many partition as distinct values. Binary split. Divides the values into two subsets. as long as it doesn’t violate the order property of the attribute Size Small Large Medium Size Size {Small, Medium} {Medium, Large} {Large} {Small} OR Chapter 4: Classification 11 December 2009 17/46
Decision Tree Induction (7/15) Splitting based on continuous attributes. Multi-way split Must consider all possible test for continuous values. One approach, Discretization. Binary split. The test condition can be expressed as a comparison test. (A < v) or (A  v) Chapter 4: Classification 11 December 2009 18/46
Decision Tree Induction (8/15) How to determine the best split? Attribute Selection Measure. A heuristic for selecting the splitting criterion that best separate a given data set. Information gain. Gain Ratio. Chapter 4: Classification 11 December 2009 19/46
Decision Tree Induction (9/15) Information Gain. Used by ID3 algorithm as its attribute selection measure. Select the attribute with the heights information gain. Expected information (entropy) needed to classify a tuple in D: Information needed (after using A to split D into v partitions) to classify D: Information gained by branching on attribute A Chapter 4: Classification 11 December 2009 20/46
Decision Tree Induction (10/15) Information Gain. 14 record Class “Yes”=9 records. Class “No”= 5 records. Similarly,  Chapter 4: Classification 11 December 2009 21/46
Decision Tree Induction (11/15) Information Gain. age? senior youth Middle age Yes Chapter 4: Classification 11 December 2009 22/46
Decision Tree Induction (12/15) Gain ratio. Information gain measure is biased towards attributes with a large number of values C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain) ,[object Object]
For attribute income:
Gain(Income)=0.029
Therefore, GainRatio(Income)=0.029/0.926=0.031Chapter 4: Classification 11 December 2009 23/46
Decision Tree Induction (14/15) Comparing attribute selection measures Information gain:  biased towards multi-valued attributes. Gain ratio:  tends to prefer unbalanced splits in which one partition is much smaller than the others. Chapter 4: Classification 11 December 2009 24/46
Decision Tree Induction (15/15) Decision Tree Induction Advantages:  Inexpensive to construct. Easy to interpret for small-sized trees. Extremely fast at classifying unknown records Disadvantages: decision tree could be suboptimal (i.e., over fitting)   Chapter 4: Classification 11 December 2009 25/46
Model Overfitting
Model Overfitting (1/5) Model Overfitting: Type of errors committed by a classification model: Training errors. Number of misclassification errors committed on training record. Generalization error. The expected error of the model on previously unseen records. Good model must have low training error as well as low generalization error. The model that fit the training data too well can have a poorer generalization error than a model with a high training error. Chapter 4: Classification overfitting 11 December 2009 27/46
Model Overfitting (2/5) Reasons of overfitting The presence of Noisein the dataset. Chapter 4: Classification 11 December 2009 28/46
Model Overfitting (2/5) Reasons of overfitting The presence of Noisein the dataset. Chapter 4: Classification Misclassified  11 December 2009 29/46
Model Overfitting(3/5) Reasons of overfitting Lack of Representative Samples. Chapter 4: Classification Misclassified  11 December 2009 30/46
Model Overfitting(4/5) Handling overfitting Pre-Pruning (Early Stopping Rule) Stop the algorithm before it becomes a fully-grown tree Typical stopping conditions for a node:  Stop if all instances belong to the same class  Stop if all the attribute values are the same More restrictive conditions:  Stop if number of instances is less than some user-specified threshold  Stop if class distribution of instances are independent of the available features (e.g., using  2 test)  Stop if expanding the current node does not improve impurity    measures (e.g., Gini or information gain). Chapter 4: Classification 11 December 2009 31/46
Model Overfitting(5/5) Handling overfitting Post-pruning Grow decision tree to its entirety Trim the nodes of the decision tree in a bottom-up fashion If generalization error improves after trimming, replace sub-tree by a leaf node. Class label of leaf node is determined from majority class of instances in the sub-tree Chapter 4: Classification In practice , Post-Pruning is preferable since early pruning can “stop  too early” 11 December 2009 32/46
Performance Evaluation
Performance Evaluation(1/3) Holdout Method  Partition: Training-and-testing use two independent data sets, e.g., training set (2/3), test set (1/3) used for data set with large number of samples Chapter 4: Classification 30% Divide randomly Available examples Training Set used to develop one tree check accuracy 11 December 2009 34/46
Performance Evaluation(2/3) Cross-Validation divide the data set into k subsamples use k-1 subsamples as training data and one sub-sample as test data  k-fold cross-validation used for data set with moderate size 10-fold cross-validation the standard and most popular technique of estimating a classifier accuracy Chapter 4: Classification Available examples 10% 90% Test Set Training Set used to develop 10 different trees check accuracy 11 December 2009 35/46
Performance Evaluation(3/3) Bootstrapping Based on the sampling with replacement The initial dataset is sampled N times N : the total number of samples in the dataset, with replacement, to form another set of N samples for training. Since some samples in this new "set" will be repeated, so it means that some samples from the initial dataset will not appear in this training set. These samples will form a test set.  Used for small size dataset. Chapter 4: Classification 11 December 2009 36/46
Summary  Chapter 4: Classification 11 December 2009 37/46
Summary  Refund Yes No MarSt NO Married Single, Divorced TaxInc NO < 80K > 80K YES NO Apply model to test data Test Data Start from the root of tree. Chapter 4: Classification 11 December 2009 38/46
Summary  Refund Yes No MarSt NO Married Single, Divorced TaxInc NO < 80K > 80K YES NO Apply model to test data Test Data Chapter 4: Classification 11 December 2009 39/46
Summary  Refund Yes No MarSt NO Married Single, Divorced TaxInc NO < 80K > 80K YES NO Apply model to test data Test Data Chapter 4: Classification 11 December 2009 40/46
Summary  Refund Yes No MarSt NO Married Single, Divorced TaxInc NO < 80K > 80K YES NO Apply model to test data Test Data Chapter 4: Classification 11 December 2009 41/46
Summary  Refund Yes No MarSt NO Married  Single, Divorced TaxInc NO < 80K > 80K YES NO Apply model to test data Test Data Chapter 4: Classification 11 December 2009 42/46
Summary  Refund Yes No MarSt NO Married Single, Divorced TaxInc NO < 80K > 80K YES NO Apply model to test data Test Data Assign Cheat to “No” Chapter 4: Classification 11 December 2009 43/46

More Related Content

What's hot

Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
error007
 
Decision tree in artificial intelligence
Decision tree in artificial intelligenceDecision tree in artificial intelligence
Decision tree in artificial intelligence
MdAlAmin187
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Marina Santini
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Salah Amean
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Salah Amean
 
Id3,c4.5 algorithim
Id3,c4.5 algorithimId3,c4.5 algorithim
Id3,c4.5 algorithim
Abdelfattah Al Zaqqa
 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning
Mohammad Junaid Khan
 
2.5 backpropagation
2.5 backpropagation2.5 backpropagation
2.5 backpropagation
Krish_ver2
 
Clustering
ClusteringClustering
Clustering
M Rizwan Aqeel
 
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
Simplilearn
 
Decision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceDecision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data science
MaryamRehman6
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
Sulman Ahmed
 
CART – Classification & Regression Trees
CART – Classification & Regression TreesCART – Classification & Regression Trees
CART – Classification & Regression Trees
Hemant Chetwani
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Salah Amean
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and Regression
Megha Sharma
 
Classification: Basic Concepts and Decision Trees
Classification: Basic Concepts and Decision TreesClassification: Basic Concepts and Decision Trees
Classification: Basic Concepts and Decision Trees
sathish sak
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
Haris Jamil
 

What's hot (20)

Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
Decision tree in artificial intelligence
Decision tree in artificial intelligenceDecision tree in artificial intelligence
Decision tree in artificial intelligence
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
 
Id3,c4.5 algorithim
Id3,c4.5 algorithimId3,c4.5 algorithim
Id3,c4.5 algorithim
 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning
 
2.5 backpropagation
2.5 backpropagation2.5 backpropagation
2.5 backpropagation
 
Clustering
ClusteringClustering
Clustering
 
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
 
Decision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceDecision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data science
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
 
CART – Classification & Regression Trees
CART – Classification & Regression TreesCART – Classification & Regression Trees
CART – Classification & Regression Trees
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 
KNN
KNNKNN
KNN
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and Regression
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
Classification: Basic Concepts and Decision Trees
Classification: Basic Concepts and Decision TreesClassification: Basic Concepts and Decision Trees
Classification: Basic Concepts and Decision Trees
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
 

Similar to Chapter 4 Classification

5.Module_AIML Random Forest.pptx
5.Module_AIML Random Forest.pptx5.Module_AIML Random Forest.pptx
5.Module_AIML Random Forest.pptx
PRIYACHAURASIYA25
 
data mining.pptx
data mining.pptxdata mining.pptx
data mining.pptx
Kaviya452563
 
DM Unit-III ppt.ppt
DM Unit-III ppt.pptDM Unit-III ppt.ppt
DM Unit-III ppt.ppt
Laxmi139487
 
XL Miner: Classification
XL Miner: ClassificationXL Miner: Classification
XL Miner: Classification
DataminingTools Inc
 
XL-Miner: Classification
XL-Miner: ClassificationXL-Miner: Classification
XL-Miner: Classification
xlminer content
 
Decision tree induction
Decision tree inductionDecision tree induction
Decision tree induction
thamizh arasi
 
Efficient classification of big data using vfdt (very fast decision tree)
Efficient classification of big data using vfdt (very fast decision tree)Efficient classification of big data using vfdt (very fast decision tree)
Efficient classification of big data using vfdt (very fast decision tree)
eSAT Journals
 
83 learningdecisiontree
83 learningdecisiontree83 learningdecisiontree
83 learningdecisiontree
tahseen shaikh
 
Classification
ClassificationClassification
Classification
DataminingTools Inc
 
Classification
ClassificationClassification
Classification
Datamining Tools
 
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdfMachine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
AdityaSoraut
 
classification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdfclassification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdf
321106410027
 
Decision tree
Decision treeDecision tree
Decision tree
Varun Jain
 
13 random forest
13 random forest13 random forest
13 random forest
Vishal Dutt
 
Research trends in data warehousing and data mining
Research trends in data warehousing and data miningResearch trends in data warehousing and data mining
Research trends in data warehousing and data mining
Er. Nawaraj Bhandari
 
A STUDY OF DECISION TREE ENSEMBLES AND FEATURE SELECTION FOR STEEL PLATES FAU...
A STUDY OF DECISION TREE ENSEMBLES AND FEATURE SELECTION FOR STEEL PLATES FAU...A STUDY OF DECISION TREE ENSEMBLES AND FEATURE SELECTION FOR STEEL PLATES FAU...
A STUDY OF DECISION TREE ENSEMBLES AND FEATURE SELECTION FOR STEEL PLATES FAU...
International Journal of Technical Research & Application
 
Algoritma Random Forest beserta aplikasi nya
Algoritma Random Forest beserta aplikasi nyaAlgoritma Random Forest beserta aplikasi nya
Algoritma Random Forest beserta aplikasi nya
batubao
 
DIY market segmentation 20170125
DIY market segmentation 20170125DIY market segmentation 20170125
DIY market segmentation 20170125
Displayr
 
Chapter 4.pdf
Chapter 4.pdfChapter 4.pdf
Chapter 4.pdf
DrGnaneswariG
 
classification in data warehouse and mining
classification in data warehouse and miningclassification in data warehouse and mining
classification in data warehouse and mining
anjanasharma77573
 

Similar to Chapter 4 Classification (20)

5.Module_AIML Random Forest.pptx
5.Module_AIML Random Forest.pptx5.Module_AIML Random Forest.pptx
5.Module_AIML Random Forest.pptx
 
data mining.pptx
data mining.pptxdata mining.pptx
data mining.pptx
 
DM Unit-III ppt.ppt
DM Unit-III ppt.pptDM Unit-III ppt.ppt
DM Unit-III ppt.ppt
 
XL Miner: Classification
XL Miner: ClassificationXL Miner: Classification
XL Miner: Classification
 
XL-Miner: Classification
XL-Miner: ClassificationXL-Miner: Classification
XL-Miner: Classification
 
Decision tree induction
Decision tree inductionDecision tree induction
Decision tree induction
 
Efficient classification of big data using vfdt (very fast decision tree)
Efficient classification of big data using vfdt (very fast decision tree)Efficient classification of big data using vfdt (very fast decision tree)
Efficient classification of big data using vfdt (very fast decision tree)
 
83 learningdecisiontree
83 learningdecisiontree83 learningdecisiontree
83 learningdecisiontree
 
Classification
ClassificationClassification
Classification
 
Classification
ClassificationClassification
Classification
 
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdfMachine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
 
classification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdfclassification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdf
 
Decision tree
Decision treeDecision tree
Decision tree
 
13 random forest
13 random forest13 random forest
13 random forest
 
Research trends in data warehousing and data mining
Research trends in data warehousing and data miningResearch trends in data warehousing and data mining
Research trends in data warehousing and data mining
 
A STUDY OF DECISION TREE ENSEMBLES AND FEATURE SELECTION FOR STEEL PLATES FAU...
A STUDY OF DECISION TREE ENSEMBLES AND FEATURE SELECTION FOR STEEL PLATES FAU...A STUDY OF DECISION TREE ENSEMBLES AND FEATURE SELECTION FOR STEEL PLATES FAU...
A STUDY OF DECISION TREE ENSEMBLES AND FEATURE SELECTION FOR STEEL PLATES FAU...
 
Algoritma Random Forest beserta aplikasi nya
Algoritma Random Forest beserta aplikasi nyaAlgoritma Random Forest beserta aplikasi nya
Algoritma Random Forest beserta aplikasi nya
 
DIY market segmentation 20170125
DIY market segmentation 20170125DIY market segmentation 20170125
DIY market segmentation 20170125
 
Chapter 4.pdf
Chapter 4.pdfChapter 4.pdf
Chapter 4.pdf
 
classification in data warehouse and mining
classification in data warehouse and miningclassification in data warehouse and mining
classification in data warehouse and mining
 

Chapter 4 Classification

  • 1. Khalid Elshafie. abolkog@dblab.cbnu.ac.kr Database / Bioinformatics Lab Chungbuk National University, Korea Classification : Basic Concepts December, 12 2009
  • 2. Outline Chapter 4: Classification 11 December 2009 2/46
  • 4. Introduction (1/4) Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Classification Model Output Class Label Input Attribute set Chapter 4: Classification 11 December 2009 4/46
  • 5. Introduction(2/4) Classification: Two step process: 1-learning step: Training data are analyzed by classification algorithm and a model (classifier) is learned. 2- Classification: Test data are used to estimate the accuracy of the classification rules. Usually the given data set is divided into training and test sets. Chapter 4: Classification 11 December 2009 5/46
  • 6. Introduction (3/4) Examples of Classification: Predicting tumor cells as benign or malignant Classifying credit card transactions as legitimate or fraudulent Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil Categorizing news stories as finance, weather, entertainment, sports, etc Chapter 4: Classification 11 December 2009 6/46
  • 7. Introduction (4/4) Classification Techniques: Decision Trees Based Methods. Rule Based Methods. Neural Networks. Naïve Bayes and Bayesian Belief Networks. Support Vector Machines. Chapter 4: Classification 11 December 2009 7/46
  • 8. General Approach to Solving a Classification Problem
  • 9. General Approach To Solving a Classification Problem (1/2) General Approach for building a classification model. Chapter 4: Classification 11 December 2009 9/46
  • 10. General Approach To Solving a Classification Problem (2/2) Performance evaluation. Evaluating the performance of a classification model is based on the counts of test records correctly and incorrectly predicted by the model. Although a confusion matrix provides the information needed to determine how well a classification model perform, summarizing this information with a single number would make it more convenient to compare the performance to a different models. Confusion matrix for a 2-class problem Chapter 4: Classification 11 December 2009 10/46
  • 12. Decision Tree Induction (1/15) What is a decision tree? A decision tree is a flowchart-like tree structure. Each internal node (none leaf node) denotes a test on an attribute, each branch represents an outcome of the test and each leaf node (terminal node) holds a class label. Single, Divorced Internal node MarSt Married Root node Refund NO No Yes TaxInc NO > 80K < 80K YES NO Leaf nodes Chapter 4: Classification 11 December 2009 12/46
  • 13. Decision Tree Induction (2/15) How to build a decision tree? Let Dt be the set of training records that reach a node t General Procedure: If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt If Dt is an empty set, then t is a leaf node labeled by the default class, yd If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset. Dt ? Chapter 4: Classification 11 December 2009 13/46
  • 14. Decision Tree Induction (3/15) How to build a decision tree? Tree induction: Greedy strategy. Split the record based on an attribute test that optimizes certain condition. Tree induction issues: Determine how to split the record? How to specify the attribute test condition? How to determine the best split? Determine when to stop splitting. Chapter 4: Classification 11 December 2009 14/46
  • 15. Decision Tree Induction (4/15) How to specify test condition? Depends on attribute types Nominal. Ordinal. Continuous. Depends on number of ways to split. 2-way split. Multi-way split. Chapter 4: Classification 11 December 2009 15/46
  • 16. Decision Tree Induction (5/15) Splitting based on nominal attributes. Multi-way split Use as many partition as distinct values. Binary split. Divides the values into two subsets. CarType Family Luxury Sports CarType CarType {Family, Luxury} {Sports, Luxury} {Sports} {Family} OR Chapter 4: Classification 11 December 2009 16/46
  • 17. Decision Tree Induction (6/15) Splitting based on ordinal attributes. Multi-way split Use as many partition as distinct values. Binary split. Divides the values into two subsets. as long as it doesn’t violate the order property of the attribute Size Small Large Medium Size Size {Small, Medium} {Medium, Large} {Large} {Small} OR Chapter 4: Classification 11 December 2009 17/46
  • 18. Decision Tree Induction (7/15) Splitting based on continuous attributes. Multi-way split Must consider all possible test for continuous values. One approach, Discretization. Binary split. The test condition can be expressed as a comparison test. (A < v) or (A  v) Chapter 4: Classification 11 December 2009 18/46
  • 19. Decision Tree Induction (8/15) How to determine the best split? Attribute Selection Measure. A heuristic for selecting the splitting criterion that best separate a given data set. Information gain. Gain Ratio. Chapter 4: Classification 11 December 2009 19/46
  • 20. Decision Tree Induction (9/15) Information Gain. Used by ID3 algorithm as its attribute selection measure. Select the attribute with the heights information gain. Expected information (entropy) needed to classify a tuple in D: Information needed (after using A to split D into v partitions) to classify D: Information gained by branching on attribute A Chapter 4: Classification 11 December 2009 20/46
  • 21. Decision Tree Induction (10/15) Information Gain. 14 record Class “Yes”=9 records. Class “No”= 5 records. Similarly, Chapter 4: Classification 11 December 2009 21/46
  • 22. Decision Tree Induction (11/15) Information Gain. age? senior youth Middle age Yes Chapter 4: Classification 11 December 2009 22/46
  • 23.
  • 26. Therefore, GainRatio(Income)=0.029/0.926=0.031Chapter 4: Classification 11 December 2009 23/46
  • 27. Decision Tree Induction (14/15) Comparing attribute selection measures Information gain: biased towards multi-valued attributes. Gain ratio: tends to prefer unbalanced splits in which one partition is much smaller than the others. Chapter 4: Classification 11 December 2009 24/46
  • 28. Decision Tree Induction (15/15) Decision Tree Induction Advantages: Inexpensive to construct. Easy to interpret for small-sized trees. Extremely fast at classifying unknown records Disadvantages: decision tree could be suboptimal (i.e., over fitting) Chapter 4: Classification 11 December 2009 25/46
  • 30. Model Overfitting (1/5) Model Overfitting: Type of errors committed by a classification model: Training errors. Number of misclassification errors committed on training record. Generalization error. The expected error of the model on previously unseen records. Good model must have low training error as well as low generalization error. The model that fit the training data too well can have a poorer generalization error than a model with a high training error. Chapter 4: Classification overfitting 11 December 2009 27/46
  • 31. Model Overfitting (2/5) Reasons of overfitting The presence of Noisein the dataset. Chapter 4: Classification 11 December 2009 28/46
  • 32. Model Overfitting (2/5) Reasons of overfitting The presence of Noisein the dataset. Chapter 4: Classification Misclassified  11 December 2009 29/46
  • 33. Model Overfitting(3/5) Reasons of overfitting Lack of Representative Samples. Chapter 4: Classification Misclassified  11 December 2009 30/46
  • 34. Model Overfitting(4/5) Handling overfitting Pre-Pruning (Early Stopping Rule) Stop the algorithm before it becomes a fully-grown tree Typical stopping conditions for a node: Stop if all instances belong to the same class Stop if all the attribute values are the same More restrictive conditions: Stop if number of instances is less than some user-specified threshold Stop if class distribution of instances are independent of the available features (e.g., using  2 test) Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain). Chapter 4: Classification 11 December 2009 31/46
  • 35. Model Overfitting(5/5) Handling overfitting Post-pruning Grow decision tree to its entirety Trim the nodes of the decision tree in a bottom-up fashion If generalization error improves after trimming, replace sub-tree by a leaf node. Class label of leaf node is determined from majority class of instances in the sub-tree Chapter 4: Classification In practice , Post-Pruning is preferable since early pruning can “stop too early” 11 December 2009 32/46
  • 37. Performance Evaluation(1/3) Holdout Method Partition: Training-and-testing use two independent data sets, e.g., training set (2/3), test set (1/3) used for data set with large number of samples Chapter 4: Classification 30% Divide randomly Available examples Training Set used to develop one tree check accuracy 11 December 2009 34/46
  • 38. Performance Evaluation(2/3) Cross-Validation divide the data set into k subsamples use k-1 subsamples as training data and one sub-sample as test data k-fold cross-validation used for data set with moderate size 10-fold cross-validation the standard and most popular technique of estimating a classifier accuracy Chapter 4: Classification Available examples 10% 90% Test Set Training Set used to develop 10 different trees check accuracy 11 December 2009 35/46
  • 39. Performance Evaluation(3/3) Bootstrapping Based on the sampling with replacement The initial dataset is sampled N times N : the total number of samples in the dataset, with replacement, to form another set of N samples for training. Since some samples in this new "set" will be repeated, so it means that some samples from the initial dataset will not appear in this training set. These samples will form a test set. Used for small size dataset. Chapter 4: Classification 11 December 2009 36/46
  • 40. Summary Chapter 4: Classification 11 December 2009 37/46
  • 41. Summary Refund Yes No MarSt NO Married Single, Divorced TaxInc NO < 80K > 80K YES NO Apply model to test data Test Data Start from the root of tree. Chapter 4: Classification 11 December 2009 38/46
  • 42. Summary Refund Yes No MarSt NO Married Single, Divorced TaxInc NO < 80K > 80K YES NO Apply model to test data Test Data Chapter 4: Classification 11 December 2009 39/46
  • 43. Summary Refund Yes No MarSt NO Married Single, Divorced TaxInc NO < 80K > 80K YES NO Apply model to test data Test Data Chapter 4: Classification 11 December 2009 40/46
  • 44. Summary Refund Yes No MarSt NO Married Single, Divorced TaxInc NO < 80K > 80K YES NO Apply model to test data Test Data Chapter 4: Classification 11 December 2009 41/46
  • 45. Summary Refund Yes No MarSt NO Married Single, Divorced TaxInc NO < 80K > 80K YES NO Apply model to test data Test Data Chapter 4: Classification 11 December 2009 42/46
  • 46. Summary Refund Yes No MarSt NO Married Single, Divorced TaxInc NO < 80K > 80K YES NO Apply model to test data Test Data Assign Cheat to “No” Chapter 4: Classification 11 December 2009 43/46
  • 47. Chapter 4: Classification 11 December 2009 44/46
  • 48. Summary Classification is one of the most important technique in detaining. Have so much application in real world. Decision tree Powerful classification technique. Decision trees are easy to understand. Strength: Easy to understand, fast in classifying records. Weakness: Suffer from oversetting. Large tree size cause some memory handling issue Handling overfitting: Pruning. Evaluation methods Chapter 4: Classification 11 December 2009 45/46
  • 49. Thank you ! Any Comments & Questions ?