There are two basic types of decision tree analysis: Classification and Regression, Classification Trees are used when the target variable is categorical and used to classify/divide data into these predefined categories. Regression Trees are used when the target variable is numeric. Decision Tree analysis is useful in classifying and segmenting markets, types of customers and other categories in order to make decisions on where to focus enterprise resources.
5. Terminologies
Decision tree : It’s a powerful and popular tool for classification and prediction in form of a tree structure
Predictors and Target variable :
Target variable usually denoted by Y , is the variable being predicted and is also called dependent variable, output variable, response variable
or outcome variable (Ex : One highlighted in red box in image below)
Predictor, sometimes called an independent variable, is a variable that is being used to predict the target variable ( Ex : variables highlighted in
green below )
Here the predictors highlighted in green box above which consist of wine attributes are used to predict the target variable that is Quality of a
wine (labeled as Quality_category) highlighted in red box above
6. Leaf Node :
Terminal node in a decision
tree where there are no
further splits
Interior Node :
The non leaf
nodes. Also
called decision
nodes
Splitting :
It is a process of
dividing a node into
two or more sub-
nodes.
Root Node :
The top most node
in a tree
Terminologies
7. Terminologies
Each internal (non-leaf)
node denotes a test on
a feature/predictor
Each leaf represents a value
of the target variable /class
label given the values of the
input variables represented
by the path from the root to
the leaf
Each branch
represents the
outcome of a test
9. Types of Decision Tree
Examples
Based on the historical data related to credit card
payments , loan payments , delinquency rate ,
outstanding balance we want to classify/divide the
customers into defaulters and non defaulters.
To access the characteristics of a customer such as his
purchase frequency, income , age, type of bank account,
occupation etc. that leads to purchase/non purchase of a
particular banking product such as installment loan ,
personal loan, checking account etc. Here classification tree
will classify the customers into purchasers and non
purchasers
There are two basic types of decision tree :
• Classification
• Regression
Classification trees are needed when target
variable is categorical and as the name
implies are used to classify/divide the data
into these predefined categories of a target
variable.
10. Types of Decision Tree
•Based on customer’ past
behavioral data on a retail website
such as days from last purchase,
brand preference, income , age ,
gender, website visits , location,
total amount of purchase so far
etc., if we want to predict the
purchase amount by each
customer then regression trees
are useful (Here the target
variable would be purchase
amount)
Examples
:
Regression trees are needed when the target
variable is numeric
11. Types of Decision Tree
Similarly, regression tree can also be used to
identify the market segment who is more
likely to respond to a future mailing.
For instance the segments (green box in
image below) having response rate higher
than overall response rate can be targeted
first as they will require little effort to obtain.
Where as different marketing strategy needs
to be devised for lower segments (segments
having response rate less than overall - red
box in image below)
12. Classification Tree
Let’s say we have only two predictors : level
of Alcohol and free sulfur dioxide in a wine
and we want to predict if a wine quality
(target variable) will be High or Low
•Since the target variable wine quality here
contains categorical values (High & low) , the
classification method will be applicable here
as the predictors will be classifying the data
into high & low.
•Decision tree splits the nodes on all
available variables and then selects the split
which results in most homogeneous/pure
sub-nodes.
•For example, if the target can be either yes
or no (will or will not increase spending), the
objective is to produce nodes where most of
the cases will increase spending or most of
the cases will not increase spending.
13. How Does A Tree Decide Where To Split
This is where decision
tree helps, it will
segregate the students
based on all values of
three variables
and identify the variable,
which creates the best
homogeneous sets of
students.
Now, I want to create a
model to predict who will
play cricket during leisure
period?
15 out of these 30 play
cricket in leisure time.
Let’s say we have a
sample of 30 students
with three variables
Gender (Boy/ Girl), Class
(IX/ X) and Height (5 to 6
feet).
14. Homogeneous Nodes
In the snapshot below, you can see that variable Gender is able to identify best homogeneous sets compared to the
other two variables.
Here most of the cases (65%) play cricketHere most of the cases
(80%)
Don’t play cricket
Hence Homogeneous
15. How To Interpret The
Classification Tree Output
In our data if it turns out that most of the
wines containing alcohol level <11 turned out
to be of low quality (hence homogeneous)
then the first split happens based on Alcohol
level and it becomes a top node in the tree
Total number of
cases with
prediction = low for
wine quality
Total number of
cases with
prediction = high for
wine quality
16. How To Interpret The
Classification Tree Output
Further , if alcohol >=12 then it classifies the
wine to be of high quality else low quality (As
seen in red box in image below)
The cases/records falling in high quality are
further tested with free sulfur dioxide level. If
free sulfur dioxide is >=28 and alcohol is also
>=12 then such wines are classified to be of
High quality. (As seen in green box in image
below)
But wines with alcohol >=12 but not having
free sulfur dioxide >=28 are classified to be
of low quality. (As seen in blue box in image
below)
17. Method: Regression
Regression-type trees are generally
applicable where we attempt to predict the
values of a numeric target variable from one
or more numeric and/or categorical predictor
variables.
For example, we may want to predict the
quality of wine (a numeric target variable)
from various other predictors such as volatile
acidity in wine , alcohol level in wine etc.
In this case the leaf nodes will contain
predicted wine quality based on wine
attributes such as alcohol and volatility as
shown below.
18. Method : Regression
Let’s take an example of predicting the wine
quality on a scale of 3 to 10 based on
predictors such as alcohol level , free sulfur
dioxide level , volatility etc.
19. Method : Regression
As seen in the red box in image below , the
first split is again based on alcohol level as
we observed in the output of classification
tree example.
The similar type of pattern is shown up here
wherein the quality is predicted to be high in
case of free sulfur >=24 and alcohol >=12
(Blue box in image below)
Additional pattern observed here is that the
wine quality is also dependent on volatility
level. Quality is high in case of volatility level
<0.21 (Purple box below)
21. Max Depth
It sets the maximum depth of any node of
the final tree, with the root node counted as
depth 0.
Lesser this number, lesser the length of tree.
For instance, setting max depth=2 while
generating a classification tree to predict
wine quality will lead to output as shown
below:
Depth 1
Depth 2
Depth 0
22. Depth 1
Depth 2
Depth 3
Depth 0Max Depth
Similarly , setting max depth = 3
will give following output :
Hence, higher the max depth,
lengthier the final tree.
Lengthier trees are generally not
reliable as they tend to have
nodes with very less records so
the tree would have poor
generalizability
24. Input Wizard Sample For Selecting Target
Variables And Predictors
Predictors and target
variable should be
selected using input
wizard as shown
below
Select the variables you want
to use
for prediction of selected
target variable
Purchase frequency
Age
Gender
Income
Website visits
Select the variable
you would like to
predict (Target
variable):
Purchase frequency
Age
Gender
Income
Website visits
Purchase
1 2
25. 3 4
High
Age
Medium Low
<= 18 18 to 25 >=25
Male Female
Gender
Method
Impurity
Max
Depth
# Classes in
target variable
Classification
Gini
Two
Categorical
predictors
Purchase Frequency
Age
Gender
Select the classes to include
Purchase frequency
Tuning parameters
Note : Tuning parameters are explained in next section
Assuming the target
variable contains
yes/no values
Input Wizard Sample For Categorical
Predictors’ Class Selection & Tuning
Parameters
27. Please note : Spark expects to give input for these parameters instead of auto detection
Sample Output Formats
1.Method
When target variable type is
numeric , regression should
be auto selected and in
case of categorical target
variable type , classification
should be auto selected
Input control type : Static
label
2.Impurity
If method=classification
then impurity should be set
to gini automatically
If method=regression then
impurity should be set to
variance automatically
Input control type : Static
label
3.Categorical
predictors info
Categorical predictors and
their class values should be
auto detected
Input control type for
categorical predictors list :
Static label
Input control type for
classes selection : Multiple
checkbox buttons
4. Max depth :
Input control type :
Editable slider with numeric
value label
(Suggested value : 3 to 5)
5. Number of
classes (Only for
method =
classification) :
This value is based on total
number of classes present in
the target variable.
For example in wine quality
classification case , the total
classes of wine quality are
two : high & low
Input control type : Static
label
33. Limitations Of Decision Tree
Frequent changes to the data lead
to substantial differences in the
output , hence decision tree should
not be applied on data which is
fluctuating significantly.
There has to be predefined classes
for target variable (The categories
to which each record belongs for
classification tree) in the dataset.
Decision trees are prone to errors
in classification problems with
target variable containing many
classes and training dataset
containing relatively small number
of records.
• Hence total records in a training dataset
must be large in proportion to the total
classes of a target variable(There is no
thumb rule on how much larger the size of
records should be compared to target
variable classes)
34. Business Problem 1 – Classification tree
• Which customer segments should be targeted for increasing the subscription
rate of a term deposit product
• In this case, the classification tree can be used to access the characteristics of
customers that lead to subscription / non subscription of a term deposit
product targeted in direct marketing campaign
• Here the target variable would be the column of whether the customer that
was called during a direct marketing campaign , subscribed to a term deposit
product (“yes” if subscribed else “no”)
35. The Dataset
• Let’s say we have following customer attributes :
oAge
oJob type
oMarital status
oEducation
oAccount default status
oLoan status
oContact type
oOutcome of previous contact Target variable
Predictors
As shown above, we want to
classify customer attributes
such as
Age (numeric predictor), prior
loan status (categorical
predictor) ,
marital status (categorical
predictor) etc. into subscribers
and non subscribers of term
deposit product (target variable
classes)
36. Bar plot in leaf nodes show break up of yes and no classes
in the node with 0 to 1 scale in right side of the bar plot
indicating percentage of yes and no in that node and n
showing number of records belonging to that leaf node
Output Tree 1
37. Interpretation of tree output
As per the tree output, loan
status came out to be the best
predictor for term deposit
product purchase
Customers with prior loan and
marital status : “married”
outperforms all other segments
(Highlighted through blue
dashed line)
Also the customers with no prior
loans and age > 60 has the
second highest propensity to
purchase term deposit product
(Highlighted through green
dashed line)
Moreover , within the segment
with no prior loans , the singles
with age <=22 seem to be out
performing age >22 segment in
terms of term deposit product
purchase (Highlighted through
black dashed line)
38. How Splits And Terminal Nodes Are
Generated
Term deposit
purchased : Yes
Term deposit
purchased : No
Total records (%)
Prior Loan : Yes 10% 6% 16%
Prior Loan : No 14% 70% 84%
Marital Status No Yes Total records (%)
Divorced 13% 26% 39%
Married 21% 19% 40%
Single 6% 15% 21%
Decision tree chooses the predictor most
predictive of the target class
Here in our case , most of the records (84% of
records in dataset) contain prior loan status :
no and only 16% have loan status : Yes
Within loan status : no , 70% population don’t
purchase term deposit.
None of the other predictors have such homogeneity
with respect to term deposit purchase
For example marital status categories breakup is as
follows :
Thus due to relatively low homogeneity of other variables such as marital status, loan status was chosen as an attribute to create the first split
Similarly the sub nodes’ split happen using same homogeneity criteria
Terminal nodes are those nodes which can’t be split further due to the stopping criteria such as max depth (when maximum depth defined is 3, then node
splitting stops happening when tree depth =3 is achieved and last generated nodes become the terminal nodes )
39. Output 2 : Accuracy Of Prediction
No Yes
No 38,439 6,772
Yes 0 0
Actualclasses
Predicted classes
Accuracy = sum of boxes highlighted in red / all boxes = 38,439/(38,439+6,772+0+0) =
85.02%
Hence the sample decision tree model we just built is 85% accurate
and there is 15% chance of error here
Actual versus predicted table shows
how many classes are predicted
correctly by decision tree as shown
below :
40. Business Benefits
The segments
highlighted in black ,
blue and green in tree
output 1 are the low
hanging fruits requiring
less efforts to obtain so
no need to devise a
different target
marketing strategy for
these segments
The segments having
highest number of “No's”
(which are not
highlighted in tree
output 1) need to be
targeted in a different
and more efficient way to
convert them into
purchasers. For example
customers with marital
status : single/divorced.
Thus segmenting
customers based on their
propensity to buy/not
buy a product can aid in
devising better and
efficient target
marketing strategy in
order to convert more
non purchasers into
purchasers and in turn
increasing the product
penetration.
41. Use Case 2 – Classification Tree
Business benefit:
•Bank can decide on which customer
segments are eligible for any type of loan
versus which customer segments should
be denied any loan as they are likely to
default.
•This way risker customers are identified
easily and bank can avert the risk of
delinquencies
Business problem :
Based on the historical customer
attributes such as his/her credit card
payments ,loan payments ,outstanding
balance etc. a bank needs to classify the
customers into defaulters and non
defaulters
•In this case, the classification tree can be
used to access the characteristics of
customers that are likely to default
•Here the target variable would be a
column of whether customer has
defaulted previously or not (“yes” if
defaulted else “no”)
42. Use Case 3 – Regression Tree
Business benefit:
• Online retailers can identify the customer
segments which have higher capacity to
purchase and can design special marketing
strategy for such segments as these
segments are their main revenue drivers.
• This way premium customers can be given
special attention to retain their loyalty and in
turn revenue can be increased.
Business problem
Based on customers’ attributes and past
online shopping behavioral data, an online
retail giant such as Amazon/Flipkart wants to
predict the future purchase amount of
customers
• Here predictors can be customer's ‘days from
last purchase’, ‘brand preference’, ‘income’ ,
‘age’ , ‘gender’, ‘website visits’ , ‘location’,
‘total amount of purchase so far’ etc.,
• As the target variable is numeric (purchase
amount) , regression tree can be used to
predict the purchase amount by different
type of customer segments.
43. Use case 4 – Regression Tree
Business benefit:
• As soon as the new order arrives , a
service provider can provide
estimated completion time to a
customer based on the general
pattern observed through
regression tree model
• Proper workforce allocation and
planning
• Avoiding revenue leakage through
prevention of delay fine
Business problem :
• Predicting order completion time
for telecom service provider
• Predictors in this case can be : user
location , work force availability,
distance from nearest network
junction, average time taken in last
6 months , average historical delay
in last 6 months etc.
• Target variable here would be turn
around time of order completion
44. Want to Learn
More?
Get in touch with us @
support@Smarten.com
And Do Checkout the Learning section
on
Smarten.com
June 2018