Chapter 4 Classification

  • 3,882 views
Uploaded on

Chapter 4 classification from the book, introduction to data mining

Chapter 4 classification from the book, introduction to data mining

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
3,882
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
2
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Khalid Elshafie.
    abolkog@dblab.cbnu.ac.kr
    Database / Bioinformatics Lab
    Chungbuk National University, Korea
    Classification : Basic Concepts
    December, 12 2009
  • 2. Outline
    Chapter 4: Classification
    11 December 2009
    2/46
  • 3. Introduction
  • 4. Introduction (1/4)
    Classification: Definition
    Given a collection of records (training set )
    Each record contains a set of attributes, one of the attributes is the class.
    Find a model for class attribute as a function of the values of other attributes.
    Goal: previously unseen records should be assigned a class as accurately as possible.
    A test set is used to determine the accuracy of the model.
    Classification Model
    Output
    Class Label
    Input
    Attribute set
    Chapter 4: Classification
    11 December 2009
    4/46
  • 5. Introduction(2/4)
    Classification:
    Two step process:
    1-learning step:
    Training data are analyzed by classification algorithm and a model (classifier) is learned.
    2- Classification:
    Test data are used to estimate the accuracy of the classification rules.
    Usually the given data set is divided into training and test sets.
    Chapter 4: Classification
    11 December 2009
    5/46
  • 6. Introduction (3/4)
    Examples of Classification:
    Predicting tumor cells as benign or malignant
    Classifying credit card transactions as legitimate or fraudulent
    Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil
    Categorizing news stories as finance, weather, entertainment, sports, etc
    Chapter 4: Classification
    11 December 2009
    6/46
  • 7. Introduction (4/4)
    Classification Techniques:
    Decision Trees Based Methods.
    Rule Based Methods.
    Neural Networks.
    Naïve Bayes and Bayesian Belief Networks.
    Support Vector Machines.
    Chapter 4: Classification
    11 December 2009
    7/46
  • 8. General Approach to Solving a Classification Problem
  • 9. General Approach To Solving a Classification Problem (1/2)
    General Approach for building a classification model.
    Chapter 4: Classification
    11 December 2009
    9/46
  • 10. General Approach To Solving a Classification Problem (2/2)
    Performance evaluation.
    Evaluating the performance of a classification model is based on the counts of test records correctly and incorrectly predicted by the model.
    Although a confusion matrix provides the information needed to determine how well a classification model perform, summarizing this information with a single number would make it more convenient to compare the performance to a different models.
    Confusion matrix for a 2-class problem
    Chapter 4: Classification
    11 December 2009
    10/46
  • 11. Decision Tree Induction
  • 12. Decision Tree Induction (1/15)
    What is a decision tree?
    A decision tree is a flowchart-like tree structure.
    Each internal node (none leaf node) denotes a test on an attribute, each branch represents an outcome of the test and each leaf node (terminal node) holds a class label.
    Single, Divorced
    Internal node
    MarSt
    Married
    Root node
    Refund
    NO
    No
    Yes
    TaxInc
    NO
    > 80K
    < 80K
    YES
    NO
    Leaf nodes
    Chapter 4: Classification
    11 December 2009
    12/46
  • 13. Decision Tree Induction (2/15)
    How to build a decision tree?
    Let Dt be the set of training records that reach a node t
    General Procedure:
    If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt
    If Dt is an empty set, then t is a leaf node labeled by the default class, yd
    If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset.
    Dt
    ?
    Chapter 4: Classification
    11 December 2009
    13/46
  • 14. Decision Tree Induction (3/15)
    How to build a decision tree?
    Tree induction:
    Greedy strategy.
    Split the record based on an attribute test that optimizes certain condition.
    Tree induction issues:
    Determine how to split the record?
    How to specify the attribute test condition?
    How to determine the best split?
    Determine when to stop splitting.
    Chapter 4: Classification
    11 December 2009
    14/46
  • 15. Decision Tree Induction (4/15)
    How to specify test condition?
    Depends on attribute types
    Nominal.
    Ordinal.
    Continuous.
    Depends on number of ways to split.
    2-way split.
    Multi-way split.
    Chapter 4: Classification
    11 December 2009
    15/46
  • 16. Decision Tree Induction (5/15)
    Splitting based on nominal attributes.
    Multi-way split
    Use as many partition as distinct values.
    Binary split.
    Divides the values into two subsets.
    CarType
    Family
    Luxury
    Sports
    CarType
    CarType
    {Family, Luxury}
    {Sports, Luxury}
    {Sports}
    {Family}
    OR
    Chapter 4: Classification
    11 December 2009
    16/46
  • 17. Decision Tree Induction (6/15)
    Splitting based on ordinal attributes.
    Multi-way split
    Use as many partition as distinct values.
    Binary split.
    Divides the values into two subsets. as long as it doesn’t violate the order property of the attribute
    Size
    Small
    Large
    Medium
    Size
    Size
    {Small, Medium}
    {Medium, Large}
    {Large}
    {Small}
    OR
    Chapter 4: Classification
    11 December 2009
    17/46
  • 18. Decision Tree Induction (7/15)
    Splitting based on continuous attributes.
    Multi-way split
    Must consider all possible test for continuous values.
    One approach, Discretization.
    Binary split.
    The test condition can be expressed as a comparison test.
    (A < v) or (A  v)
    Chapter 4: Classification
    11 December 2009
    18/46
  • 19. Decision Tree Induction (8/15)
    How to determine the best split?
    Attribute Selection Measure.
    A heuristic for selecting the splitting criterion that best separate a given data set.
    Information gain.
    Gain Ratio.
    Chapter 4: Classification
    11 December 2009
    19/46
  • 20. Decision Tree Induction (9/15)
    Information Gain.
    Used by ID3 algorithm as its attribute selection measure.
    Select the attribute with the heights information gain.
    Expected information (entropy) needed to classify a tuple in D:
    Information needed (after using A to split D into v partitions) to classify D:
    Information gained by branching on attribute A
    Chapter 4: Classification
    11 December 2009
    20/46
  • 21. Decision Tree Induction (10/15)
    Information Gain.
    14 record
    Class “Yes”=9 records.
    Class “No”= 5 records.
    Similarly,
    Chapter 4: Classification
    11 December 2009
    21/46
  • 22. Decision Tree Induction (11/15)
    Information Gain.
    age?
    senior
    youth
    Middle age
    Yes
    Chapter 4: Classification
    11 December 2009
    22/46
  • 23. Decision Tree Induction (12/15)
    Gain ratio.
    Information gain measure is biased towards attributes with a large number of values
    C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain)
    • Example:
    • 24. For attribute income:
    • 25. Gain(Income)=0.029
    • 26. Therefore, GainRatio(Income)=0.029/0.926=0.031
    Chapter 4: Classification
    11 December 2009
    23/46
  • 27. Decision Tree Induction (14/15)
    Comparing attribute selection measures
    Information gain:
    biased towards multi-valued attributes.
    Gain ratio:
    tends to prefer unbalanced splits in which one partition is much smaller than the others.
    Chapter 4: Classification
    11 December 2009
    24/46
  • 28. Decision Tree Induction (15/15)
    Decision Tree Induction
    Advantages:
    Inexpensive to construct.
    Easy to interpret for small-sized trees.
    Extremely fast at classifying unknown records
    Disadvantages:
    decision tree could be suboptimal (i.e., over fitting)
    Chapter 4: Classification
    11 December 2009
    25/46
  • 29. Model Overfitting
  • 30. Model Overfitting (1/5)
    Model Overfitting:
    Type of errors committed by a classification model:
    Training errors.
    Number of misclassification errors committed on training record.
    Generalization error.
    The expected error of the model on previously unseen records.
    Good model must have low training error as well as low generalization error.
    The model that fit the training data too well can have a poorer generalization error than a model with a high training error.
    Chapter 4: Classification
    overfitting
    11 December 2009
    27/46
  • 31. Model Overfitting (2/5)
    Reasons of overfitting
    The presence of Noisein the dataset.
    Chapter 4: Classification
    11 December 2009
    28/46
  • 32. Model Overfitting (2/5)
    Reasons of overfitting
    The presence of Noisein the dataset.
    Chapter 4: Classification
    Misclassified

    11 December 2009
    29/46
  • 33. Model Overfitting(3/5)
    Reasons of overfitting
    Lack of Representative Samples.
    Chapter 4: Classification
    Misclassified

    11 December 2009
    30/46
  • 34. Model Overfitting(4/5)
    Handling overfitting
    Pre-Pruning (Early Stopping Rule)
    Stop the algorithm before it becomes a fully-grown tree
    Typical stopping conditions for a node:
    Stop if all instances belong to the same class
    Stop if all the attribute values are the same
    More restrictive conditions:
    Stop if number of instances is less than some user-specified threshold
    Stop if class distribution of instances are independent of the available features (e.g., using  2 test)
    Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain).
    Chapter 4: Classification
    11 December 2009
    31/46
  • 35. Model Overfitting(5/5)
    Handling overfitting
    Post-pruning
    Grow decision tree to its entirety
    Trim the nodes of the decision tree in a bottom-up fashion
    If generalization error improves after trimming, replace sub-tree by a leaf node.
    Class label of leaf node is determined from majority class of instances in the sub-tree
    Chapter 4: Classification
    In practice , Post-Pruning is preferable since early pruning can “stop too early”
    11 December 2009
    32/46
  • 36. Performance Evaluation
  • 37. Performance Evaluation(1/3)
    Holdout Method
    Partition: Training-and-testing
    use two independent data sets, e.g., training set (2/3), test set (1/3)
    used for data set with large number of samples
    Chapter 4: Classification
    30%
    Divide randomly
    Available examples
    Training Set
    used to develop one tree
    check accuracy
    11 December 2009
    34/46
  • 38. Performance Evaluation(2/3)
    Cross-Validation
    divide the data set into k subsamples
    use k-1 subsamples as training data and one sub-sample as test data
    k-fold cross-validation
    used for data set with moderate size
    10-fold cross-validation
    the standard and most popular technique of estimating a classifier accuracy
    Chapter 4: Classification
    Available examples
    10%
    90%
    Test Set
    Training Set
    used to develop 10 different trees
    check accuracy
    11 December 2009
    35/46
  • 39. Performance Evaluation(3/3)
    Bootstrapping
    Based on the sampling with replacement
    The initial dataset is sampled N times
    N : the total number of samples in the dataset, with replacement, to form another set of N samples for training.
    Since some samples in this new "set" will be repeated, so it means that some samples from the initial dataset will not appear in this training set. These samples will form a test set.
    Used for small size dataset.
    Chapter 4: Classification
    11 December 2009
    36/46
  • 40. Summary
    Chapter 4: Classification
    11 December 2009
    37/46
  • 41. Summary
    Refund
    Yes
    No
    MarSt
    NO
    Married
    Single, Divorced
    TaxInc
    NO
    < 80K
    > 80K
    YES
    NO
    Apply model to test data
    Test Data
    Start from the root of tree.
    Chapter 4: Classification
    11 December 2009
    38/46
  • 42. Summary
    Refund
    Yes
    No
    MarSt
    NO
    Married
    Single, Divorced
    TaxInc
    NO
    < 80K
    > 80K
    YES
    NO
    Apply model to test data
    Test Data
    Chapter 4: Classification
    11 December 2009
    39/46
  • 43. Summary
    Refund
    Yes
    No
    MarSt
    NO
    Married
    Single, Divorced
    TaxInc
    NO
    < 80K
    > 80K
    YES
    NO
    Apply model to test data
    Test Data
    Chapter 4: Classification
    11 December 2009
    40/46
  • 44. Summary
    Refund
    Yes
    No
    MarSt
    NO
    Married
    Single, Divorced
    TaxInc
    NO
    < 80K
    > 80K
    YES
    NO
    Apply model to test data
    Test Data
    Chapter 4: Classification
    11 December 2009
    41/46
  • 45. Summary
    Refund
    Yes
    No
    MarSt
    NO
    Married
    Single, Divorced
    TaxInc
    NO
    < 80K
    > 80K
    YES
    NO
    Apply model to test data
    Test Data
    Chapter 4: Classification
    11 December 2009
    42/46
  • 46. Summary
    Refund
    Yes
    No
    MarSt
    NO
    Married
    Single, Divorced
    TaxInc
    NO
    < 80K
    > 80K
    YES
    NO
    Apply model to test data
    Test Data
    Assign Cheat to “No”
    Chapter 4: Classification
    11 December 2009
    43/46
  • 47. Chapter 4: Classification
    11 December 2009
    44/46
  • 48. Summary
    Classification is one of the most important technique in detaining.
    Have so much application in real world.
    Decision tree
    Powerful classification technique.
    Decision trees are easy to understand.
    Strength:
    Easy to understand, fast in classifying records.
    Weakness:
    Suffer from oversetting.
    Large tree size cause some memory handling issue
    Handling overfitting:
    Pruning.
    Evaluation methods
    Chapter 4: Classification
    11 December 2009
    45/46
  • 49. Thank you !
    Any Comments & Questions ?