Detailed discussion about decision tree regressor and the classifier with finding the right algorithm to split
Let me know if anything is required. Ping me at google #bobrupakroy
2. What is decision Tree?
Decision trees are a type of supervised machine learning model or in
simple words a branching method where the data is continuously split
according to a certain parameter.
Rupak Roy
3. Binary target variable
If the target variable for a decision tree is a binary we will use Binary
Decision tree
Target Variable:
1 for Sales, 0 for No-sales
30%
50%
70%
100%
Adv.= 1K
Sales =10K
Adv. =8,000,
Sales = 91,000
Adv. =6,000,
Sales = 65,000
Adv. =4,500,
Sales = 33,000
Adv. =1500,
Sales = 32000
Adv. =2000,
Sales = 26,000
Adv. =2,000
Sales = 9000
Adv.= 900
Sales = 6310
Rupak Roy
4. Continuous target variable
IF the target variable is numeric like Income (a continuous variable not
discrete like Yes or No) We will use Regression Tree for prediction.
What happens is the target variable is spitted each in the tree is chosen
to decrease the variance in the values of the target variable within
each child node.
In simple words,,
If Average Income is less than 70K
than categorize it and create a new tree
under 60k. Again if less then 60K than
create new tree under 50 and 10
Avg.
Income
$70k
60K
50 10
49
38
Yes No
Rupak Roy
5. Continuous target variable
Example: a company wants to impute missing values in the income
field for its customers. The average income of a person is 30K. The
company can assign the missing values using the rules created from
decision trees for an better estimate.
Terminology:
• Base node is also known as root node
• Any node which can be further splitted is called as decision node.
• The node that cannot be further splitted are called as terminal nodes
or leaf nodes.
• The process of cutting down the tree or removing sections of it is
called Pruning.
• The process of adding a whole section to a tree is called as grafting.
Rupak Roy
6. Data preparation for decision trees
Most decision trees can handle categorical & continuous variables so
there is no need of much data transformation.
Classification trees is used if the target variable is discrete.
Regression trees is used when the target variable is continuous.
Removing the records due to missing values is likely to create a bias
training set because the records with missing values are not likely to be
a random sample of the population. Removing the missing values has
the risk that important information associated will be lost.
And Replacing them with imputed values has the risk of diverting the
important properties/ information which will tend to create a bias
model .
Treating them as separate category is better than assigning them as
average values.
7. Decision trees
are a non-parametric technique.
What is non-parametric technique?
A parametric statistical test is one that makes assumptions about the parameters
(designing properties)of the population distributions(s) from which one’s data are
drawn, while a non-parametric is the one that makes no such assumptions.
For practical purpose, you can think of “parametric” as referring to tests, such as t-test
and the analysis of variance, that assume the underlying source populations(s) to be
normally distributed; they generally also assume that one’s measures derive from an
equal-interval scale. And you think of “non-parametric” as referring to tests that do
not make on these particular assumptions.
Examples of non-parametric tests include the various forms of chi-square tests,
the Fisher Exact Probability test,
the Mann-Whitney Test,
the Wilcoxon Signed-Rank Test,
the Kruskal-Wallis Test and the Friedman Test.
Non-parametric tests are sometimes spoken of as "distribution-free" tests.
Hence Decision Trees are not effected by outliers.
Rupak Roy
8. Steps for Decision tree
1. Find the split
- Identify all possible split options
- Choose the best split value for the tree
2. Grow the tree
- Continue growing tree as much as possible
3. Prune the tree
- Stop/Prune the tree using a size based criteria
4. Extract the rules
- Extract the rules generated from the tree.
9. Finding the right split
The one which creates most homogenous population is considered to be
the Best Split
Poor split Good split (homogenous)
There are various decision tree algorithms that helps to split the data in
smaller & smaller group in a way that each new nodes has greater purity
than its parent nodes with respect to the target variable.
Splits are evaluated based on the node purity in terms of the target variable.
These means the splitting criteria depends on the type of target variable
and not on the type of the input variable.
Rupak Roy
10. Finding the right split algorithm
1. For Categorical target variable we use
- Gini - Chi-square - Information Gain
2. For Continuous target variable
- Reduction in variance
3. Other methods:
- Information Gain is a improvement over the information gain
measure.
- -F test, t measures the variance in distributions between parent &
child nodes. It is used when the target variable is continuous.
Rupak Roy