Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Â
Big picture of data mining
1. Decision Trees in the Big Picture
• Classification (vs. Rule Pattern Discovery)
• Supervised Learning (vs. Unsupervised)
• Inductive
• Generation (vs. Discrimination)
2. Example
age income veteran
college_
educated
support_
hillary
youth low no no no
youth low yes no no
middle_aged low no no yes
senior low no no yes
senior medium no yes no
senior medium yes no yes
middle_aged medium no yes no
youth low no yes no
youth low no yes no
senior high no yes yes
youth low no no no
middle_aged high no yes no
middle_aged medium yes yes yes
senior high no yes no
3. Example
age income veteran
college_
educated
support_
hillary
youth low no no no
youth low yes no no
middle_aged low no no yes
senior low no no yes
senior medium no yes no
senior medium yes no yes
middle_aged medium no yes no
youth low no yes no
youth low no yes no
senior high no yes yes
youth low no no no
middle_aged high no yes no
middle_aged medium yes yes yes
senior high no yes no
Class-
labels
5. Example
age income veteran
college_
educated
support_
hillary
middle_aged medium no no yes (predicted)
no
age
youth middle_aged
college_
educated
income yes
yes
low medium high
no
senior
no yes
noyes
Inner nodes are ATTRIBUTES
Branches are attribute VALUES
Leaves are class-label VALUES
ANSWER
7. Example
Induced Rules:
The youth do not support
Hillary.
All who are middle-aged
and low-income support
Hillary.
Seniors support Hillary.
Nested IF-THEN:
IF age == youth
THEN support_hillary = no
ELSE IF age == middle_aged
& income == low
THEN support_hillary = yes
ELSE IF age = senior
THEN support_hillary = yes
8. How do you construct one?
1. Select an attribute to place at the root node
and make one branch for each possible value.
14 tuples; Entire Training Set
5 tuples 4 tuples 5 tuples
age
youth middle_aged senior
9. How do you construct one?
2. For each branch, recursively process the
remaining training examples by choosing an
attribute to split them on. The chosen
attribute cannot be one used in the ancestor
nodes. If at anytime all the training examples
have the same class, stop processing that
part of the tree.
10. How do you construct one?
age=youth Income veteran
college_
educated
support_
hillary
youth low no no no
youth low yes no no
youth low no yes no
youth low no yes no
youth low no no no
no
age
youth
middle_aged senior
11. How do you construct one?
age=
middle_aged income veteran
college_
educated
supports_
hillary
middle_aged low no no yes
middle_aged medium no yes no
middle_aged high no yes no
middle_aged medium yes yes yes
no veteran
age
youth
middle_aged
senior
yes no
16. no veteran
ageyouth
middle_aged
yes
yes no
college_
educated
yes no
senior
no yes
age=senior income veteran
college_
educated
supports_
hillary
senior low no no yes
senior medium no yes no
senior medium yes no yes
senior high no yes yes
senior high no yes no
17. no veteran
ageyouth
middle_aged
yes
yes no
college_
educated
yes no
senior
no yes
age=senior income veteran
college_
educated
supports_
hillary
senior low no no yes
senior medium no yes no
senior medium yes no yes
senior high no yes yes
senior high no yes no
college_
educated
yes no
18. no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes no
senior
no yes
college_
educated
yes no
age=senior income veteran
college_
educated=yes
supports_
hillary
senior medium no yes no
senior high no yes yes
senior high no yes no
income
low
medium
high
19. no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes no
senior
no yes
college_
educated
yes no
age=senior income veteran
college_
educated=yes
supports_
hillary
senior medium no yes no
senior high no yes yes
senior high no yes no
income
low
medium
high
No low-income college-
educated seniors…
20. no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes
senior
no yes
college_
educated
yes no
age=senior income veteran
college_
educated=yes
supports_
hillary
senior medium no yes no
senior high no yes yes
senior high no yes no
income
low
medium
high
No low-income college-
educated seniors…
no
no
“Majority Vote”
21. no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes
senior
no yes
college_
educated
yes no
age=senior
income=
medium veteran
college_
educated=yes
supports_
hillary
senior medium no yes no
income
low
medium
high
no
no
22. no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes
senior
no yes
college_
educated
yes no
age=senior
income=
medium veteran
college_
educated=yes
supports_
hillary
senior medium no yes no
income
low
medium
high
no
no
no
23. no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes
senior
no yes
college_
educated
yes no
income
low
medium
high
no
no
no
age=senior income=high veteran
college_
educated=yes
supports_
hillary
senior high no yes yes
senior high no yes no
24. no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes
senior
no yes
college_
educated
yes no
income
low medium high
no
no
no
age=senior income=high veteran
college_
educated=yes
supports_
hillary
senior high no yes yes
senior high no yes no
veteran
yes no
25. no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes
senior
no yes
college_
educated
yes no
income
low medium high
no
no
no
age=senior income=high veteran
college_
educated=yes
supports_
hillary
senior high no yes yes
senior high no yes no
veteran
yes no
“Majority Vote” split…
No Veterans
??? ???
26. no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes
senior
no yes
college_
educated
yes no
income
low medium high
no
no
no veteran
yes no
??? ???
age=senior income veteran
college_
educated=no
supports_
hillary
senior low no no yes
senior medium yes no yes
27. no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes
senior
no yes
college_
educated
yes no
income
low medium high
no
no
no veteran
yes no
??? ???
age=senior income veteran
college_
educated=no
supports_
hillary
senior low no no yes
senior medium yes no yes
yes
29. Cost to grow?
n = number of Attributes
D = Training Set of tuples
O( n * |D| * log|D| )
30. Cost to grow?
n = number of Attributes
D = Training Set of tuples
O( n * |D| * log|D| )
Amount of
work at each
tree level
Max height
of the tree
31. How do we minimize the cost?
• Optimal decision trees are NP-complete
(shown by Hyafil and Rivest)
32. How do we minimize the cost?
• Optimal decision trees are NP-complete
(shown by Hyafil and Rivest)
• Need Heuristic to pick “best” attribute to split
on.
34. How do we minimize the cost?
• Optimal decision trees are NP-complete
(shown by Hyafil and Rivest)
• Most common approach is “greedy”
• Need Heuristic to pick “best” attribute to split
on.
• “Best” attribute results in “purest” split
Pure = all tuples belong to the same class
37. Information Gain
• Ross Quinlan’s ID3 (iterative dichotomizer 3rd)
uses info gain as its heuristic.
• Heuristic based on Claude Shannon’s
information theory.
39. Calculate Entropy for D
D = Training Set D=14
m = num. of classes m=2
i = 1,…,m
Ci = distinct class C1 = yes, C2 = no
Ci,D = tuples in D of class Ci C1,D = yes, C2,D = no
pi = prob. a random tuple in p1 = 5/14, p2 = 9/14
D belongs to class Ci
=|Ci,D|/|D|
41. Entropy for D split by A
A = attribute to split D on E.g. age
v = distinct values of A E.g. youth,
middle_aged, senior
j = 1,…,v
Dj = subset of D where A=j E.g. All tuples
where age=youth
43. Information Gain
Gain(A) = Entropy(D) - EntropyA (D)
Set of tuples D Subset of D split on
attribute A
Choose the A with the highest Gain.
decreases Entropy
46. ss age income veteran
college_
educated
support_
hillary
215-98-9343 youth low no no no
238-34-3493 youth low yes no no
234-28-2434 middle_aged low no no yes
243-24-2343 senior low no no yes
634-35-2345 senior medium no yes no
553-32-2323 senior medium yes no yes
554-23-4324 middle_aged medium no yes no
523-43-2343 youth low no yes no
553-23-1223 youth low no yes no
344-23-2321 senior high no yes yes
212-23-1232 youth low no no no
112-12-4521 middle_aged high no yes no
423-13-3425 middle_aged medium yes yes yes
423-53-4817 senior high no yes no
Added social security number attribute
49. Gain ratio
• C4.5, a successor of ID3, uses this heuristic.
• Attempts to overcome Information Gain’s bias
in favor of attributes with large number of
values.
52. Gini Index
• CART uses this heuristic.
• Binary splits.
• Not biased toward multi-value attributes like
Info Gain.
age
youth
middle_aged
senior
age
senioryouth,
middle_aged
53. Gini Index
For the attribute age the possible subsets are:
{youth, middle_aged, senior},
{youth, middle_aged}, {youth, senior},
{middle_aged, senior}, {youth},
{middle_aged}, {senior} and {}.
We exclude the powerset and the empty set.
So we have to examine 2v – 2 subsets.
54. Gini Index
For the attribute age the possible subsets are:
{youth, middle_aged, senior},
{youth, middle_aged}, {youth, senior},
{middle_aged, senior}, {youth},
{middle_aged}, {senior} and {}.
We exclude the powerset and the empty set.
So we have to examine 2v – 2 subsets.
CALCULATE GINI INDEX
ON EACH SUBSET
56. Miscellaneous thoughts
• Widely applicable to data exploration,
classification and scoring tasks
• Generate understandable rules
• Better for predicting discrete outcomes than
continuous (lumpy)
• Error-prone when # of training examples for a
class is small
• Most business cases trying to predict few
broad categories