Lecture 5 Decision tree.pdf

Goal of Classification Algorithm
• Build models with good generalization capabilities, i.e., Model that
accurately predict the class labels of previously unknown records.
• Classification Algorithm
• Naïve Bayes Classifier
• Decision Tree
• Rule-based classifiers
• Neural Network
• Support Vector Machine

Why decision tree?
• Decision trees are powerful and popular tools for
classification and prediction.
• Decision trees represent rules, which can be understood
by humans and used in knowledge system such as
database.

Predicting potential loan default

Predicting potential loan default (Credit)

Predicting potential loan default (Income)

Predicting potential loan default (Term)

Predicting potential loan default
(Personal Info)

Classifier Review
Input Predicted class

 Decision tree is a classifier in the form of a tree structure
 Decision tree maps out all possible decision paths in the
form of a tree.
– Root node: The node has no incoming edges and zero or
more outgoing edges.
– Internal node ((Decision node): specifies a test on a single
attribute
– Leaf node: indicates the value of the target attribute
– Branches (Arc/edge): split of one attribute
 Decision trees classify instances or examples by starting at
the root of the tree and moving through it until a leaf node
based on local optimum decesion.
Definition

What does decision tree represents?

Decision Tree Classification Task
Decision
Tree
Test Data
Training Data

Learn Decision tree from data ?

Decision Tree Learning Problem

Quality metric: Classification Error
• Error measure fraction of mistakes
=
#
#
• Best possible value : 0.0
• Worst possible value: 1.0

Find the tree with lowest classification
error

How do we find the best tree
•Exponentially large number of possible decision
tree makes decision tree hard.

Decision tree
• Decision tree to represent learned target function
• Each internal node tests an attribute
• Each branch corresponds to attribute value
• Each leaf node assigns a classification
• Can be represented by
logical formula
(2) Which node
to proceed?
(3) When to stop/ come
to conclusion?
(1) Which to
start? (root)

Tree Induction
•Greedy strategy.
• Split the records based on an attribute test that
optimizes certain criterion.

Greedy Algorithm
Step 1: Start with an empty tree

Greedy Algorithm
Step 2: Split on a feature

Greedy Decision Tree Algorithm
Step 1: Start with an empty tree
Step 2: Select a feature to split data
For each split of the tree.
Step 3: If nothing more to, make
predictions
Step 4: Otherwise, go to Step 2 &
continue (recurse) on this split.
Problem 1: Feature split
selection
Problem 2:
Stopping condition
Recursion

Design Issues of Decision Tree Induction
•Issues
• How to Classify a leaf node
• Assign the majority class
• If leaf is empty, assign the default class – the class that has the
highest popularity.
• Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
• Determine when to stop splitting
• Every attribute has already been included along this path
through the tree.
• Stop splitting if all the records belong to the same class or have
identical attribute values
• Stop when each leaf node has uncertainty below some
threshold.

Decision Tree learning
Start with the data
Assume N = 40, 3 features

Compact visual notation: Root node

Decision Stump: Single Level Tree

Visual Notation: Intermediate
Node

Making Prediction with Decision Stump

How do we learn decision stump

Algorithms
•Many Algorithms:
• Hunt’s Algorithm (one of the earliest)
• ID3 (Iterative Dichotomiser)
• C4.5
• CART (Classification And Regression Tree)
• SLIQ, SPRINT

General Structure of Hunt’s Algorithm
• Basic of many existing DT algorithm.
• Let Dt be the set of training records that reach a
node t
• General Procedure:
• If Dt contains records that belong the same class
yt, then t is a leaf node labeled as yt
• If Dt contains records with the same attribute
values, then t is a leaf node labeled with the
majority class yt
• If Dt is an empty set, then t is a leaf node labeled
by the default class, yd
• If Dt contains records that belong to more than
one class, use an attribute test to split the data
into smaller subsets.
• Recursively apply the procedure to each
subset.
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Dt
?

Hunt’s Algorithm
Don’t
Cheat
Refund
Don’t
Cheat
Don’t
Cheat
Yes No
Refund
Don’t
Cheat
Yes No
Marital
Status
Don’t
Cheat
Cheat
Single,
Divorced
Married
Taxable
Income
Don’t
Cheat
< 80K >= 80K
Refund
Don’t
Cheat
Yes No
Marital
Status
Don’t
Cheat
Cheat
Single,
Divorced
Married
Tid Refund Marital
Status
Taxable
Income Cheat
3 No Single 70K No
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10
Tid Refund Marital
Status
Taxable
Income Cheat
3 No Single 70K No
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10

Hunt’s Algorithm
•Empty node (Non of the training records have the
combination of attribute value)
• Node is declared as a leaf node with same class label as
the majority class of training records associated with
its parent node.
•Non-empty node
• Same class
• Identical attribute values (Except for the class label)
• Node is declared as a leaf node with the same class label as the
majority class of training records associated with this node.

Iterative Dichotomiser (ID3)
• Dichotomisation means, the act of dividing into two sharply
different categories.
Outlook
Sunny Overcast Rain
Humidity
High Normal
Wind
Strong Weak
No Yes
Yes
Yes
No

Principled Criterion
•Selection of an attribute to test at each node
choosing the most useful attribute for classifying
examples.
•Information gain
• Measures how well a given attribute separates the
training examples according to their target
classification.
• This measure is used to select among the candidate
attributes at each step while growing the tree.
• Gain is measure of how much we can reduce
uncertainty (value lies between 0, 1)

How to Specify Test Condition?
•Depends on attribute types
• Binary
• Nominal
• Ordinal
• Continuous
•Depends on number of ways to split
• 2-way split
• Multi-way split

Splitting Based on Nominal Attributes
• Binary split: The test condition for a binary attribute generates
two potential outcomes
Body Temp
{Warm-
blooded}
{Cold-blooded}

Splitting Based on Nominal Attributes
• Multi-way split: Use as many partitions as distinct values.
• Binary split: Divides values into two subsets.
Need to find optimal partitioning.
CarType
Family
Sports
Luxury
CarType
{Family,
Luxury} {Sports}
CarType
{Sports,
Luxury} {Family}
OR
Note: CART produces only binary split by considering all 2 − 1
ways of creating a binary partition of attribute values.

Splitting Based on Ordinal Attributes
• Multi-way split: Use as many partitions as distinct values.
• Binary split: Divides values into two subsets – respects the order
(Grouped as long as the grouping does not violates the order property
of the attribute values). Need to find optimal partitioning.
Size
Small
Medium
Large
Size
{Medium,
Large,
Extra Large} {Small}
Size
{Small,
Medium} {Large, Extra Large}
OR
Size
{Small,
Large} {Medium,
Extra Large}

Splitting Based on Continuous Attributes
•Different ways of handling
•Discretization to form an ordinal categorical
attribute
• Static – discretize once at the beginning
• Dynamic – ranges can be found by equal interval bucketing,
equal frequency bucketing (percentiles), or clustering.
•Binary Decision: (A < v) or (A  v)
• consider all possible splits and finds the best cut
• can be more compute intensive

Threshold Split

•Threshold Split in 1-D

•Visualizing the threshold split

Split on Age >= 38

•Split on Income >= $60K

•Each Split partition the 2D space

How to determine the Best Split
Before Splitting: 10 records of class 0,
10 records of class 1
Which test condition is the best?
• Class distribution of the records before and after splitting

How to determine the Best Split
• Greedy approach:
• Nodes with homogeneous class distribution are preferred
• Need a measure of node impurity: The smaller the degree of impurity the
more skewed the class distribution.
• Ideas?
• Entropy and Information gain
Non-homogeneous,
High degree of impurity
Homogeneous,
Low degree of impurity

Entropy
• A measure of
• Uncertainty
• (Im)Purity
• Information content
• Given a collection S
= − ⊕ log ⊕ − ⊖ log ⊖
⊕ is the proportion of positive examples in D
⊖ is the proportion of negative examples in D
• The lower the Entropy, the less uniform the distribution, the
purer the node.

Information Gain
• Gain tells us how would be gained by branching on A.
• Information gain is simply the expected reduction in entropy caused by
partitioning the examples according to the selected attribute.
• Information gain, Gain (S, A) of an attribute A is defined as
= − ( )
= ( )
∈ ( )
( ) is the set of all possible values for attribute A, is the subset
of D for which attribute A has value v

Simple Greedy Decision Tree Learning
When do we stop ?

Stopping Condition
1. All data agrees on y

Stopping Condition 2: Already split on
all features

Example
• D is collection of 14 examples, 9 positive and 5 negative examples
9+, 5 − = −(9 14
⁄ )log 9 14
⁄ −
(5 14
⁄ )log 5 14
⁄ = 0.940
Entropy is 0 if all members of D belongs to the same class. Entropy is 1
when the collection contains an equal number of positive and negative
examples.

Entropy
1. The entropy is 0 if the outcome is
‘certain’
2. The entropy is maximum if we
have no knowledge of the system
( or any outcome is equally possible)
 S is sample of training examples.
 ⊕ is the proportion of positive examples in S
 ⊖ is the proportion of negative examples in S.
 Entropy measures the impurity of S
= − ⊕ log ⊕ − ⊖ log ⊖
Entropy of a 2-class problem with regard to the
portion of one of the two groups

Examples
• Before partitioning, the entropy is
• Info(10/20, 10/20) = - 10/20 log(10/20) - 10/20 log(10/20) = 1
• Using the ``where’’ attribute, divide into 2 subsets
• Entropy of the first set Info (home) = - 6/12 log(6/12) - 6/12 log(6/12) = 1
• Entropy of the second set Info (away) = - 4/8 log(6/8) - 4/8 log(4/8) = 1
• Expected entropy after partitioning
• 12/20 * Info (home) + 8/20 * Info (away) = 1

Example
• Using the ``when’’ attribute, divide into 3 subsets
• Entropy of the first set Info (5pm) = - 1/4 log(1/4) - 3/4 log(3/4);
• Entropy of the second set Info (7pm) = - 9/12 log(9/12) - 3/12 log(3/12);
• Entropy of the second set Info (9pm) = - 0/4 log(0/4) - 4/4 log(4/4) = 0
• Expected entropy after partitioning
• 4/20 * Info (1/4, 3/4) + 12/20 * Info (9/12, 3/12) + 4/20 * Info (0/4, 4/4) = 0.65
• Information gain 1-0.65 = 0.35

Strong wind factor on decision

Example
• The information gain due to sorting the original 14 examples by the
attribute Wind may then be calculated as
= ,
= 9+, 5 −
← 6+, 2 −
← 3+, 3 −
, = − ( )
∈{ , }
= − (8 14
⁄ ) − (6 14
⁄ )

Continue
= 0.940 − 8 14
⁄ ∗ 0.811 − 6 14
⁄ ∗1.00
=0.048

Continue
• The Information Gain for all four attributes are
, = 0.246
, = 0.151
, = 0.048
, = 0.029

Overcast outlook on decision
• Decision will always be yes if outlook were overcast.

Intermediate
Resulting
Tree
{D1, D2, D8, D9, D11}
[2+, 3-]
{D3, D7, D12, D13}
[4+, 0-]
{D4, D5, D6, D10, D14}
[3+, 2-]
Which attribute to select
?
?
Yes

Sunny outlook on decision
• Here, there are 5 instances for sunny outlook. Decision would be
probably 3/5 percent no, 2/5 percent yes.

Cont…
= | = 0.570
= |wind = 0.019
= |Humidity = 0.970
Humidity is the decision because it produces the highest score if outlook
were sunny.

At this point, decision will always be no if humidity were
high.
On the other hand, decision will always be yes if humidity
were normal

Cont…
= |
= |Humidity
= |Wind
Wind produces the highest score if outlook were rain.

Decision will always be yes if wind were weak and outlook were rain.
Decision will be always no if wind were strong and outlook
were rain.

Information Gain: Limitation
• Problematic: Attribute with a large number of values (extreme
case: ID case)
• Subsets are more likely to be pure if there is a large number of
values
• Information gain is biased towards choosing attribute with a large
number of values

Gain Ratio
• A modification of the information gain that reduces its bais.
• The gain ratio measure penalizes attribute such as customer ID by
incorporating a term, called split information.
• Split information is sensitive to how broadly and uniformly the
attribute splits the data.

C4.5
• C4.5, a successor of ID3, uses an extension to Information gain known
as gain ratio.
• It overcome the bias problem.
• It applies a kind of normalization to information gain using split
information value.
= − log
is the entropy of with respect to the values of attribute
.
=
( )
The attribute with maximum gain ratio is selected as the splitting
attribute.

Gain ratios for weather data
Outlook Temperature
Info: 0.693 Info: 0.911
Gain: 0.940-0.693 0.247 Gain: 0.940-0.911 0.029
Split info: info([5,4,5]) 1.577 Split info: info([4,6,4]) 1.362
Gain ratio: 0.247/1.577 0.156 Gain ratio: 0.029/1.362 0.021
Humidity Windy
Info: 0.788 Info: 0.892
Gain: 0.940-0.788 0.152 Gain: 0.940-0.892 0.048
Split info: info([7,7]) 1.000 Split info: info([8,6]) 0.985
Gain ratio: 0.152/1 0.152 Gain ratio: 0.048/0.985 0.049

Lecture 5 Decision tree.pdf

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Lecture 5 Decision tree.pdf

Similar to Lecture 5 Decision tree.pdf (20)

Recently uploaded

Recently uploaded (20)

Lecture 5 Decision tree.pdf