Details discussion about the Tree Algorithms like Gini, Information Gain, Chi-square for categorical and Reduction in variance for continuous variable. Let me know if anything is required. Happy to help. Enjoy machine learning! #bobrupakroy
2. Tree Algorithms:
For Categorical target variable
1. Gini is the most widely used splitting criterion.
It gives the probability that 2 times chosen at random from the same
population are in the same class.
For a pure population. The probability is 1
#reds=2 blue=0 #reds=7 blue=10
#prop. of reds=1 #prop. of reds=7/17=.41
#prop. of blue=0 #prop. of blue=10/17=.58
Gini =1^2 + 0^2=1 Gini =.41^2 + .58^2=.504
A
Rupak Roy
3. Tree Algorithms:
For Categorical target variable
#reds=10 blue= 2 #reds=2 blue=10
#prop. of reds=10/12= .83 #prop. of reds=.166
#prop. of blue= 2/12= .166 #prop. of blue=.83
Gini =.83^2 + .17^2=.71 Gini =.17^2 + .83^2=.71
Gini Score for split
A: (1*2/19)+(.50*10/19)=.48 B: (.71*12/24)+(.71*12/24)=.71
Higher the Gini score the better the model is. So higher Gini score will be
chosen by Gini method. It is the default for Decision Trees
B
Rupak Roy
4. Tree Algorithms:
Categorical target variable
2. Information Gain
Before applying Information Gain lets understand what is Logarithm.
What is the log(10,000)? = 4
10,000 = 10 x 10 x 10 X 10 =(10)4
#of reds(8), blue(4) #of reds(4), blue(8)
#Prop. of reds(.7) Blue(.3) #Prop. of reds(.3) Blue(.7)
#Entropy of the node1 #Entropy of the node2
= -1 *(.7log 2 (.7) + .3log 2(.3)) = -1 *(.3log 2 (.3) + .7log 2(.7))
A
5. Tree Algorithms:
Categorical target variable
We can repeat the the same for B
And assume the entropy for the split (B) = entropy of node1+node2 =.81
Then we will compute information gain for B
Information gain for (B) = Entropy (parent node) – Entropy (split)
i.e. Information gain = 1 – entropy of the split = 1-.81 =0.19
Higher the Entropy score the better the model is. Finally the information
gain will choose the higher entropy score.
Entropy is a measure on how disorganized the systems is.
Entropy ranges from 0 to1
Pure node has an Entropy of 0 while impure node has Entropy of 1
B
6. Tree Algorithms:
Categorical target variable
3. Chi-Square: is a test of statistical significance developed by Karl
Pearson
Chi- Square = square root of (actual –expected)2 / Expected
Again the highest chi-square score will be selected.
Rupak Roy
7. Tree Algorithms:
Continuous target variable
4. Reduction in Variance
Variance measures how far each number in the set is from the mean. In
simple words variance the is fact or quality of being different, divergent,
or inconsistent.
A low variance refers most values
are close to the mean.
A high variance refers most values
are far from the mean.
Varaince = where,
So the reduction in variance split criterion is specially designed for
target variable having continuous/ numeric data type.
Pure node variance is 0,and like before the highest score will be
selected.
8. Over fitting & Tree Pruning
A fully grown tree tends to over fit the data. It occurs when a statistical
model describes random error or noise and generally occurs when a
model is excessively complex.
The model with over fitting will result in poor predicting power.
Rupak Roy
9. Over fitting & Tree Pruning
Pruning: process of eliminating unstable nodes to create simpler, robust
nodes. In other words it reduces the size of decision trees by removing
sections of the tree that provides little predicting
power. Pruning reduces the complexity of the final result, and hence
improves predictive accuracy by the reduction of over fitting.
Pruning Algorithms:
CART- Prunes the tree by imposing a complexity penalty based on
number of leaves in the tree.
C5- assumes a higher rate of error than what is seen on the training
data. The smaller the node the more the increase over observed. When
the child node estimate is higher than the parent node, tree is pruned.
Still it is advisable to study the tree in detail for any node that looks
unstable should be pruned.
Rupak Roy
10. Applications of Techniques
1) Classification & Regression trees (CART) algorithm uses the GINI
method to create binary splits. Most commonly used decision tree
algorithm.
2) Chi-square Automatic Interaction Detector (CHAID) – detecting
statistical relationship between variables. It uses the Chi-square
algorithm test to produce multi- splits.
3) Gini: method is used in sociology and other noisy domains.
4) Reduction in variance & F-test algorithms used in regression trees.
Rupak Roy
11. Summary
For Binary Target / Categorical use CART
For Noisy data use CART i.e. GINI
If you want trees with multiple splits at each level use CHAID
Numeric target variable use F-test &
Continuous target variable use Reduction in variance
Rupak Roy
12. Next
We will learn what are the requirements that makes a good decision
tree.
Rupak Roy