Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Like this? Share it with your network

Share

Data mining classification-2009-v0

on

  • 1,007 views

 

Statistics

Views

Total Views
1,007
Views on SlideShare
1,006
Embed Views
1

Actions

Likes
1
Downloads
42
Comments
0

1 Embed 1

https://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Data mining classification-2009-v0 Presentation Transcript

  • 1. Data Mining Classification Prithwis Mukerjee, Ph.D.
  • 2. Classification
    • Definition
      • The separation or ordering of objects ( or things ) in classes
    • A Priori Classification
      • When the classification is done before you have looked at the data
    • Post Priori Classification
      • When the classification is done after you have looked at the data
  • 3. General approach
    • You decide on the classes without looking at the data
      • For example : High risk, medium risk, low risk classes
    • You “train” system
      • Take a small set of objects – the training set
        • Each object has a set of attributes
      • Classify the objects in this small (“training”) set into the three classes, without looking at the attributes
        • You will need human expertise here, to classify objects
      • Now find a set of rules based on the attributes such that the system classifies the objects just as you have done without looking at the attributes
    • Use these rules to classify the full set of attributes
  • 4. If we have this data ...
  • 5. We need to build a decision tree like .... Pouch ? Feathers ? Bird Mammal Marsupial YES YES NO NO
  • 6. Question is ...
    • Why did we ignore two attributes ?
      • Flies ?
      • Feathers ?
    • Why did we use the attribute called POUCH first ?
      • And then we used the attribute called FEATHERS
    • A rigorous classification process should tell us
      • If there are lots of attributes to be looked at then which are the important ones ?
      • In which order should we look at the attributes
    • So that the classification arrived at is very similar to the classification done with the training set
  • 7. Decision Tree : Tree Induction Algorithm
    • Step 1 : Place all members into one node
      • If all members belong to the same class
        • Stop : there is nothing to be done
    • Step 2 : Else
      • Choose one attribute and based on its value split the node into two nodes
      • For each of the two nodes
        • If all members belong to the same class
          • Stop
        • Else : Recursively go to Step 1
    • Big question : How do you choose which attribute to split a node on ?
      • Information Theory
      • GINI Index
  • 8. Information Theory : Recapitulate
    • Information Content I
      • Of an event E
      • That has n possible outcomes
      • Where outcome i happens with probability p i
      • Is defined as I = Σ i ( - p i log 2 p i )
    • Example :
      • Event E A has two possible outcomes
        • P 1 = 0, P 2 = 0 : Outcome 1 is a certainty
        • I = 0 because there is NO information in the outcome
      • Event E B has two possible outcomes
        • P 1 = 0.5, P 2 = 0.5 : Both outcomes are equally likely
        • I = -0.5 log 2 (0.5) – 0.5 log 2 (0.5) = 1
        • Maximum possible information that is possible for an event with two outcomes
  • 9. Information in the roll of a dice
    • Fair dice
      • All numbers 1 – 6 equally probable ( p i = 1/6)
      • I = 6 x (- 1/6) log 2 (1/6) = 2.585
    • Loaded Dice Case 1
      • P 6 = 0.5 ; P 1 = P 2 = P 3 = P 4 = P 5 = 0.1
      • I = 5 x (-0.1) log 2 (0.1) – 0.5 x log 2 (0.5) = 2.16
    • Loaded Dice Case 2
      • P 6 = 0.75 ; P 1 = P 2 = P 3 = P 4 = P 5 = 0.05
      • I = 5 x (-0.05) log 2 (0.1) – 0.75 x log 2 (0.75) = 1.39
    • Point to note ...
      • We can change the information in the roll of the dice by changing the probabilities of the various outcomes !
  • 10. How do we change the information ?
    • In a dice
      • We make mechanical modifications so that the probabilities of each outcome changes
        • This is higly illegal
    • In a set of individuals
      • We regroup the individuals into the classes so that the probability of each class changes
        • This is highly permitted in our algorithm
    • H
  • 11. Consider the following scenario ..
    • Probability of each outcome ( or class )
      • P(A) = 3/10 , P(B) = 3/10 , P(C) = 4/10
    • Total Information Content of Set S
      • -(3/10) log 2 (3/10) – (3/10) log 2 (3/10) – (4/10) log 2 (4/10) = 1.57
  • 12. Suppose we split this set on HOME
    • I 1 : Information in set S 1
      • -(2/5)log 2 (2/5) – (1/5) log 2 (1/5) – (2/5) log 2 (2/5) = 1.52
    • I 2 : Information in set S 2
      • -(1/5)log 2 (1/5) – (2/5) log 2 (2/5) – (2/5) log 2 (2/5) = 1.52
    • Total Information in S 1 and S 2
      • 0.5 I 1 + 0.5I 2 = 0.5 x 1.52 + 0.5 x 1.52 = 1.52
    P 1 (A) = 2/5 P 1 (B) = 1/5 P 1 (C) = 2/5 P 2 (A) = 1/5 P 2 (B) = 2/5 P 2 (C) = 2/5
  • 13. Impact of HOME attribute
    • In sets S 1 and S 2 , the attribute HOME was the same
    • But in set S the attribute HOME is not the same and so is of some significance
    • What is the significance of the HOME attribute ?
    • By adding the HOME attribute we have increased the information content
      • FROM : 1.52
      • TO : 1.57
    • So HOME attribute adds 0.05 to the overall information content
      • Or HOME attribute reduces uncertainty by 0.05
  • 14. Let us go back to the original set S ..
    • Probability of each outcome ( or class )
      • P(A) = 3/10 , P(B) = 3/10 , P(C) = 4/10
    • Total Information Content of Set S
      • -(3/10) log 2 (3/10) – (3/10) log 2 (3/10) – (4/10) log 2 (4/10) = 1.57
  • 15. This time we split on GENDER
    • I 1 : Information in set S 1
      • -(3/7)log 2 (3/7) – (4/7) log 2 (4/7) = 0.985
    • I 2 : Information in set S 2
      • = 0
    • Total Information in S 1 and S 2
      • (7/10) I 1 + (3/10)I 2 = 7/10 x 0.985 + 3/10 x 0 = 0.69
    P 1 (A) = 3/7 P 1 (B) = 0/7 P 1 (C) = 4/7 P 2 (A) = 0/3 P 2 (B) = 3/3 P 2 (C) = 0/3
  • 16. Impact of GENDER attribute
    • In sets S 1 and S 2 , the attribute GENDER was the same
    • But in set S the attribute GENDER is not the same and so is of some significance
    • What is the significance of the GENDER attribute ?
    • By adding the GENDER attribute we have increased the information content
      • FROM : 0.69
      • TO : 1.57
    • So GENDER attribute adds 0.88 to the overall information content
      • Or GENDER attribute reduces uncertainty by 0.88
  • 17. If we were to do this for all attributes ...
    • We would observe that GENDER is the best candidate for the split
  • 18. And the first part of our tree would be ... Gender What Next ? Class B Male Female
  • 19. Remove GENDER and Class B and continue
    • Probability of each outcome ( or class )
      • P(A) = 3/7 , P(C) = 4/7
    • Total Information Content of Set S
      • -(3/7) log 2 (3/7) – (4/7) log 2 (4/7) = 1.33
  • 20. We split this set on HOME ...
    • I 1 : Information in set S 1
      • -(2/4)log 2 (2/4) – (2/4) log 2 (2/4) = 1.00
    • I 2 : Information in set S 2
      • -(1/3)log 2 (1/3) – (2/3) log 2 (2/3) = 0.92
    • Total Information in S 1 and S 2
      • (4/7) I 1 + (3/7)I 2 = 4/7 x 1.00 + 3/7 x 0.92 = 0.96
    P 1 (A) = 2/4 P 1 (C) = 2/4 P 1 (A) = 1/3 P 1 (C) = 2/3 Gain = 1.33 – 0.96 = 0.37
  • 21. But if we were to split on MARRIED
    • I 1 : Information in set S 1
      • = 0.0
    • I 2 : Information in set S 2
      • = 0.0
    • Total Information in S 1 and S 2
      • = 0.0
    P 1 (A) = 4/4 P 1 (C) = 0/4 P 1 (A) = 0/3 P 1 (C) = 3/3 Gain = 1.33 - 0 = 1.33
  • 22. Two things have happened
    • With MARRIED
      • We have hit the upper limit of information gain
      • No other attribute can do any better than this
    • In The TWO sub sets
      • All members belong to the same class
        • Either A or C
    • Hence we STOP here and observe ...
  • 23. That our DECISION TREE looks like Gender Married Class C Class A Class B Male YES Female NO