Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Data mining classification-2009-v0

on

  • 972 views

 

Statistics

Views

Total Views
972
Views on SlideShare
971
Embed Views
1

Actions

Likes
1
Downloads
42
Comments
0

1 Embed 1

https://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Data mining classification-2009-v0 Data mining classification-2009-v0 Presentation Transcript

  • Data Mining Classification Prithwis Mukerjee, Ph.D.
  • Classification
    • Definition
      • The separation or ordering of objects ( or things ) in classes
    • A Priori Classification
      • When the classification is done before you have looked at the data
    • Post Priori Classification
      • When the classification is done after you have looked at the data
  • General approach
    • You decide on the classes without looking at the data
      • For example : High risk, medium risk, low risk classes
    • You “train” system
      • Take a small set of objects – the training set
        • Each object has a set of attributes
      • Classify the objects in this small (“training”) set into the three classes, without looking at the attributes
        • You will need human expertise here, to classify objects
      • Now find a set of rules based on the attributes such that the system classifies the objects just as you have done without looking at the attributes
    • Use these rules to classify the full set of attributes
  • If we have this data ...
  • We need to build a decision tree like .... Pouch ? Feathers ? Bird Mammal Marsupial YES YES NO NO
  • Question is ...
    • Why did we ignore two attributes ?
      • Flies ?
      • Feathers ?
    • Why did we use the attribute called POUCH first ?
      • And then we used the attribute called FEATHERS
    • A rigorous classification process should tell us
      • If there are lots of attributes to be looked at then which are the important ones ?
      • In which order should we look at the attributes
    • So that the classification arrived at is very similar to the classification done with the training set
  • Decision Tree : Tree Induction Algorithm
    • Step 1 : Place all members into one node
      • If all members belong to the same class
        • Stop : there is nothing to be done
    • Step 2 : Else
      • Choose one attribute and based on its value split the node into two nodes
      • For each of the two nodes
        • If all members belong to the same class
          • Stop
        • Else : Recursively go to Step 1
    • Big question : How do you choose which attribute to split a node on ?
      • Information Theory
      • GINI Index
  • Information Theory : Recapitulate
    • Information Content I
      • Of an event E
      • That has n possible outcomes
      • Where outcome i happens with probability p i
      • Is defined as I = Σ i ( - p i log 2 p i )
    • Example :
      • Event E A has two possible outcomes
        • P 1 = 0, P 2 = 0 : Outcome 1 is a certainty
        • I = 0 because there is NO information in the outcome
      • Event E B has two possible outcomes
        • P 1 = 0.5, P 2 = 0.5 : Both outcomes are equally likely
        • I = -0.5 log 2 (0.5) – 0.5 log 2 (0.5) = 1
        • Maximum possible information that is possible for an event with two outcomes
  • Information in the roll of a dice
    • Fair dice
      • All numbers 1 – 6 equally probable ( p i = 1/6)
      • I = 6 x (- 1/6) log 2 (1/6) = 2.585
    • Loaded Dice Case 1
      • P 6 = 0.5 ; P 1 = P 2 = P 3 = P 4 = P 5 = 0.1
      • I = 5 x (-0.1) log 2 (0.1) – 0.5 x log 2 (0.5) = 2.16
    • Loaded Dice Case 2
      • P 6 = 0.75 ; P 1 = P 2 = P 3 = P 4 = P 5 = 0.05
      • I = 5 x (-0.05) log 2 (0.1) – 0.75 x log 2 (0.75) = 1.39
    • Point to note ...
      • We can change the information in the roll of the dice by changing the probabilities of the various outcomes !
  • How do we change the information ?
    • In a dice
      • We make mechanical modifications so that the probabilities of each outcome changes
        • This is higly illegal
    • In a set of individuals
      • We regroup the individuals into the classes so that the probability of each class changes
        • This is highly permitted in our algorithm
    • H
  • Consider the following scenario ..
    • Probability of each outcome ( or class )
      • P(A) = 3/10 , P(B) = 3/10 , P(C) = 4/10
    • Total Information Content of Set S
      • -(3/10) log 2 (3/10) – (3/10) log 2 (3/10) – (4/10) log 2 (4/10) = 1.57
  • Suppose we split this set on HOME
    • I 1 : Information in set S 1
      • -(2/5)log 2 (2/5) – (1/5) log 2 (1/5) – (2/5) log 2 (2/5) = 1.52
    • I 2 : Information in set S 2
      • -(1/5)log 2 (1/5) – (2/5) log 2 (2/5) – (2/5) log 2 (2/5) = 1.52
    • Total Information in S 1 and S 2
      • 0.5 I 1 + 0.5I 2 = 0.5 x 1.52 + 0.5 x 1.52 = 1.52
    P 1 (A) = 2/5 P 1 (B) = 1/5 P 1 (C) = 2/5 P 2 (A) = 1/5 P 2 (B) = 2/5 P 2 (C) = 2/5
  • Impact of HOME attribute
    • In sets S 1 and S 2 , the attribute HOME was the same
    • But in set S the attribute HOME is not the same and so is of some significance
    • What is the significance of the HOME attribute ?
    • By adding the HOME attribute we have increased the information content
      • FROM : 1.52
      • TO : 1.57
    • So HOME attribute adds 0.05 to the overall information content
      • Or HOME attribute reduces uncertainty by 0.05
  • Let us go back to the original set S ..
    • Probability of each outcome ( or class )
      • P(A) = 3/10 , P(B) = 3/10 , P(C) = 4/10
    • Total Information Content of Set S
      • -(3/10) log 2 (3/10) – (3/10) log 2 (3/10) – (4/10) log 2 (4/10) = 1.57
  • This time we split on GENDER
    • I 1 : Information in set S 1
      • -(3/7)log 2 (3/7) – (4/7) log 2 (4/7) = 0.985
    • I 2 : Information in set S 2
      • = 0
    • Total Information in S 1 and S 2
      • (7/10) I 1 + (3/10)I 2 = 7/10 x 0.985 + 3/10 x 0 = 0.69
    P 1 (A) = 3/7 P 1 (B) = 0/7 P 1 (C) = 4/7 P 2 (A) = 0/3 P 2 (B) = 3/3 P 2 (C) = 0/3
  • Impact of GENDER attribute
    • In sets S 1 and S 2 , the attribute GENDER was the same
    • But in set S the attribute GENDER is not the same and so is of some significance
    • What is the significance of the GENDER attribute ?
    • By adding the GENDER attribute we have increased the information content
      • FROM : 0.69
      • TO : 1.57
    • So GENDER attribute adds 0.88 to the overall information content
      • Or GENDER attribute reduces uncertainty by 0.88
  • If we were to do this for all attributes ...
    • We would observe that GENDER is the best candidate for the split
  • And the first part of our tree would be ... Gender What Next ? Class B Male Female
  • Remove GENDER and Class B and continue
    • Probability of each outcome ( or class )
      • P(A) = 3/7 , P(C) = 4/7
    • Total Information Content of Set S
      • -(3/7) log 2 (3/7) – (4/7) log 2 (4/7) = 1.33
  • We split this set on HOME ...
    • I 1 : Information in set S 1
      • -(2/4)log 2 (2/4) – (2/4) log 2 (2/4) = 1.00
    • I 2 : Information in set S 2
      • -(1/3)log 2 (1/3) – (2/3) log 2 (2/3) = 0.92
    • Total Information in S 1 and S 2
      • (4/7) I 1 + (3/7)I 2 = 4/7 x 1.00 + 3/7 x 0.92 = 0.96
    P 1 (A) = 2/4 P 1 (C) = 2/4 P 1 (A) = 1/3 P 1 (C) = 2/3 Gain = 1.33 – 0.96 = 0.37
  • But if we were to split on MARRIED
    • I 1 : Information in set S 1
      • = 0.0
    • I 2 : Information in set S 2
      • = 0.0
    • Total Information in S 1 and S 2
      • = 0.0
    P 1 (A) = 4/4 P 1 (C) = 0/4 P 1 (A) = 0/3 P 1 (C) = 3/3 Gain = 1.33 - 0 = 1.33
  • Two things have happened
    • With MARRIED
      • We have hit the upper limit of information gain
      • No other attribute can do any better than this
    • In The TWO sub sets
      • All members belong to the same class
        • Either A or C
    • Hence we STOP here and observe ...
  • That our DECISION TREE looks like Gender Married Class C Class A Class B Male YES Female NO