decision tree machine learning model for classification

CSIBER
AN AUTONOMOUS INSTITUTE Classification: Decision Tree
• A decision tree is a popular machine learning algorithm used
for both classification and regression tasks.
• It's a tree-like structure where each internal node represents a
decision based on a feature, each branch represents an
outcome of that decision, and each leaf node represents the
final output (class label or numerical value).
• Decision trees are particularly useful for their simplicity,
interpretability, and ease of visualization.

CSIBER
AN AUTONOMOUS INSTITUTE
Decision Tree
• The process of building a decision tree involves recursively splitting the dataset into
subsets based on the most informative features at each node.
• The algorithm selects the feature that provides the best separation of the data,
typically using metrics like Gini impurity for classification tasks or mean squared
error for regression tasks.
• This process continues until a stopping criterion is met, such as a predefined depth
of the tree or when further splitting does not significantly improve the model.

CSIBER
Pros and Cons of Decision Tree
Pros: Decision trees requires little data preprocessing (no need for normalization or scaling)
and being able to handle both numerical and categorical data.
Decision trees are widely used in various applications due to their simplicity, interpretability,
and effectiveness in capturing complex decision boundaries in the data.
Cons: Decision trees are prone to overfitting, especially when the tree becomes too deep.
Techniques like pruning or using ensemble methods like Random Forests can help mitigate
overfitting and improve generalization performance.

CSIBER
Terminologies
In decision trees, several key terminologies and concepts are used to describe
different elements and aspects of the tree-building process.
Root Node: The top node of the decision tree, from which the tree starts branching.
Decision Nodes (Internal Nodes): Nodes in the tree where decisions are made based
on the values of features. These nodes have branches leading to child nodes.

CSIBER
Terminologies
Leaf Nodes (Terminal Nodes) :End nodes of the tree where the final
decision or prediction is made. They do not have any child nodes.
Branches: The edges connecting nodes in the tree. Each branch
corresponds to a possible outcome of a decision.
Splitting:The process of dividing a node into two or more child nodes
based on a certain feature and a threshold or condition.

CSIBER
Terminologies
Decision Rule: The condition at each decision node that determines the direction
of the branch. It is based on a feature and a threshold.
Entropy: A measure of impurity or disorder in a set of data. Decision trees use
entropy to find the best feature to split the data.
Information Gain: The reduction in entropy or impurity achieved by splitting the
data based on a particular feature. Decision trees aim to maximize information
gain when choosing features for splitting.

CSIBER
Terminologies
Gini Impurity: Another measure of impurity used in decision trees. It represents
the probability of incorrectly classifying a randomly chosen element.
Pruning: The process of removing branches from the tree to reduce overfitting.
Pruning helps improve the generalization ability of the model.
Depth of the Tree: The length of the longest path from the root node to a leaf
node. It indicates how many decisions or splits are made before reaching a
prediction.

CSIBER
Terminologies
CART (Classification and Regression Trees)
• An algorithm used for constructing decision trees that can be applied to both classification and
regression tasks.
Random Forest
• An ensemble learning method that builds multiple decision trees and combines their predictions to
improve accuracy and reduce overfitting.
• These terminologies provide a foundation for understanding and discussing decision trees and their
construction. Different algorithms and implementations may use variations of these terms, but the
fundamental concepts remain consistent across the field of machine learning.

CSIBER
Example of a Decision Tree
categorical
categorical
continuous
class
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes No
Married
Single,
Divorced
< 80K > 80K
Splitting Attributes
Training Data Model: Decision Tree

CSIBER
categorical
categorical
continuous
class
MarSt
Refund
TaxInc
YES
NO
NO
NO
Yes No
Married
Single,
Divorced
< 80K > 80K
There could be more than one tree that
fits the same data!
Example of a Decision Tree

CSIBER
Decision Tree Classification Task
Decision
Tree

CSIBER
Apply Model to Test Data
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes No
Married
Single,
Divorced
< 80K > 80K
Test Data
Start from the root of tree.

CSIBER
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes No
Married
Single,
Divorced
< 80K > 80K
Test Data

CSIBER
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes No
Married
Single,
Divorced
< 80K > 80K
Test Data
Assign Cheat to “No”

CSIBER
21
Decision Tree Classification
A decision tree is created in two phases:
■ Tree Building Phase
Repeatedly partition the training data until all the examples in each partition
belong to one class or the partition is sufficiently small
■ Tree Pruning Phase
Remove dependency on statistical noise or variation that may be particular
only to the training set

CSIBER
22
Tree Building Phase
General tree-growth algorithm (binary tree)
Partition(Data S)
If (all points in S are of the same class) then return;
for each attribute A do
evaluate splits on attribute A;
Use best split to partition S into S1 and S2;
Partition(S1);
Partition(S2);

CSIBER
23
Tree Building Phase (cont.)
• The form of the split depends on the type of the attribute
• Splits for numeric attributes are of the form A ≤ v, where v is a real number
• Splits for categorical attributes are of the form A S’, where S’ is a subset of all
∈
possible values of A

CSIBER
Why decision tree
• Decision trees are powerful and popular tools for classification and prediction.
• Decision trees represent rules, which can be understood by humans and used in
knowledge system such as database.

CSIBER
key requirements
Attribute-value description: object or case must be expressible in terms of a fixed collection
of properties or attributes (e.g., hot, mild, cold).
Predefined classes (target values): the target function has discrete output values (boolean
or multiclass)
Sufficient data: enough training cases should be provided to learn the model.

CSIBER
Principled Criterion
Selection of an attribute to test at each node - choosing the most useful attribute for
classifying examples.
■ Gini impurity
■ measure of impurity or disorder in a set of data used in decision tree algorithms,
particularly for classification problems.
■ Entropy
■ A measure of homogeneity of the set of examples.
Information gain
■ measures how well a given attribute separates the training examples according to their
target classification
■ This measure is used to select among the candidate attributes at each step while
growing the tree

CSIBER
Calculation of Gini Impurity
Gini impurity is a measure of impurity or disorder in a set of data used in decision tree algorithms,
particularly for classification problems.
• Which indicates the likelihood of new, random data being misclassified if it were given a random class
label according to the class distribution in the dataset.
• The Gini impurity for a node is calculated based on the distribution of class labels in that node.
The formula for Gini impurity is as follows:
where:
• c is the number of classes.
• pi is the probability of the occurrence of class i in the node.
The Gini impurity ranges from 0 to 0.5, where 0 indicates perfect purity (all instances belong to a single
class), and 0.5 indicates maximum impurity (an equal distribution of instances across all classes).

CSIBER
Here's a step-by-step example of calculating Gini impurity for a hypothetical dataset with
two classes, "Yes" and "No“.
Dataset
Calculation

CSIBER
• The node with uniform class distribution has the highest impurity.
• The minimum impurity is obtained when all records belong to the same class.
Several examples are given in the following table to demonstrate the Gini Impurity computation.

CSIBER
• Gini Impurity is a measurement used to build Decision Trees to determine how the
features of a dataset should split nodes to form the tree.
• In the context of decision trees, when selecting a feature for splitting, the one that
results in the lowest Gini impurity after the split is chosen, as it indicates a more
"pure" separation of the classes.

CSIBER
Example : Gini Impurity
• For example, say you want to build a classifier that determines if someone will default on
their credit card. You have some labeled data with features, such as bins for age, income,
credit rating, and whether or not each person is a student.
• To find the best feature for the first split of the tree – the root node – you could calculate
how poorly each feature divided the data into the correct class, default ("yes") or didn't
default ("no").
• This calculation would measure the impurity of the split, and the feature with the lowest
impurity would determine the best feature for splitting the current node.
• This process would continue for each subsequent node using the remaining features.

CSIBER
Example : Gini Impurity

CSIBER
• An attribute with the smallest Gini Impurity is selected for splitting the node.
• If a data set D is split on an attribute into two subsets D1 and D2 with sizes and n1 and n2,
respectively, the Gini Impurity can be defined as:
• When training a decision tree, the attribute that provides the smallest is chosen to split the
node.
• In order to obtain information gain for an attribute, the weighted impurities of the branches is
subtracted from the original impurity.
• The best split can also be chosen by maximizing the Gini gain.
• Gini gain is calculated as follows:

CSIBER
Gini Impurity and Gini Index
The Gini Index is essentially the same, expressed in percentage terms. It is
calculated as the Gini Impurity multiplied by 100
Gini Index = Gini Impurity ×100
Both Gini Impurity and Gini Index are widely used as criteria for evaluating
the quality of splits in decision trees, and the choice between them often
depends on the convention used in specific implementations or literature.

CSIBER
Entropy
A measure of homogeneity of the set of examples.
Given a set S of positive and negative examples of some target concept (a 2-class
problem), the entropy of set S relative to this binary classification is
E(S) = - p(P)log2 p(P) – p(N)log2 p(N)

CSIBER
Some Intuitions
The entropy is 0 if the outcome is ``certain’’.
The entropy is maximum if we have no knowledge of the system (or any outcome
is equally possible).
Entropy of a 2-class problem
with regard to the portion of
one of the two groups

CSIBER
It is simply the expected reduction in entropy caused by partitioning the examples according
to this attribute.
It is the number of bits saved when encoding the target value of an arbitrary member of S,
by knowing the value of attribute A.
• The range of entropy is from 0 to 1, whereas the range of Gini Impurity is from 0 to 0.5

CSIBER
Calculation of Entropy
Entropy is a measure of impurity or disorder in a set of data. In the context of decision
trees, entropy is used to determine the homogeneity of a node based on the distribution of
class labels. The formula for entropy (H) is given by:
The entropy is calculated for each node in a decision tree, and the goal is to minimize
entropy by choosing the best features for splitting the data.

CSIBER
Example:
Here's a step-by-step example of calculating entropy for a hypothetical dataset with two classes, "Yes"
and "No“.
Dataset
Calculation:
Gini Index – 0.48 Entropy – 0.29

CSIBER
Example:
If all instances in a dataset belong to the same class (either all "Yes" or all "No"), then the
dataset is perfectly pure, and there is no uncertainty or disorder in terms of class labels. In
such a case, the entropy is 0, indicating complete homogeneity.
where
c is the number of classes, and pi is the probability of class i in set S.

CSIBER
Example:
If all instances are of the same class, say "Yes," then =1 and =0. Substituting these values into the
entropy formula:
H(S)=0
Note that (0) is undefined, so we typically define 0 . (0) as 0 for the purpose of the calculation.
Therefore:
Similarly, if all instances are of the class "No," the entropy will also be 0.
If a dataset contains only one class, the entropy is 0, indicating perfect homogeneity and no uncertainty.
This is the ideal scenario for a node in a decision tree because it implies a clear and pure separation of
classes.

CSIBER
Entropy measure: let us get a feeling
As the data become purer and purer, the entropy value becomes smaller and smaller.

CSIBER
Difference Between Gini Impurity and
Entropy
Both Gini impurity and entropy are measures of impurity or disorder used in decision tree
algorithms to determine the best split at each node. However, there are some differences
in how they calculate impurity.
Formula
Gini Impurity: It is calculated using the formula: Gini(t) = 1 - Σ (pi)^2, where pi is the
proportion of instances of class i in node t.
Entropy: It is calculated using the formula: Entropy(t) = -Σ pi * log2(pi), where pi is the
proportion of instances of class i in node t.

CSIBER
Entropy
Scale
Gini Impurity: Gini impurity values range from 0 to 0.5
• where 0 represents perfect purity (all instances belong to a single class), and 1 represents maximum impurity
(instances are evenly distributed across all classes).
Entropy: Entropy values range from 0 to log2(number of classes) (0 to 1)
• where 0 represents perfect purity, and log2(number of classes) represents maximum impurity.
Decision Rule
Gini Impurity: Tends to favor larger partitions with dominant classes. It is less sensitive to imbalances in class
distribution.
Entropy: Can be sensitive to imbalances in class distribution. It may result in slightly more balanced trees compared to
Gini impurity.

CSIBER
Entropy
Computation
Gini Impurity: Computationally less expensive compared to entropy, as it doesn't involve logarithmic
calculations.
Entropy: Requires logarithmic calculations, which can be relatively more computationally expensive.
In practice, both Gini impurity and entropy are commonly used, and the choice between them often
depends on the specific characteristics of the dataset and the desired properties of the resulting decision
tree.
Scikit-learn, a popular machine learning library in Python, allows users to choose between Gini impurity
and entropy as the criterion for splitting in decision tree classifiers.

CSIBER
Information Gain
Information gain measures the expected reduction in entropy, or uncertainty.
■ Values(A) is the set of all possible values for attribute A, and Sv the subset of S for which
attribute A has value v Sv = {s in S | A(s) = v}.
■ the first term in the equation for Gain is just the entropy of the original collection S
■ the second term is the expected value of the entropy after S is partitioned using attribute A

CSIBER
Relationship Between Entropy and
Information Gain
Here's how entropy and information gain are related:
Entropy (H)
Entropy is a measure of impurity or disorder in a set of data. It quantifies the
uncertainty associated with the distribution of class labels in a node.
Higher entropy indicates higher disorder and uncertainty.

CSIBER
Information Gain
Information Gain (IG)
Information gain measures the reduction in entropy (uncertainty) achieved by splitting the
data based on a particular feature.
The feature with the highest information gain is chosen as the decision feature for the
split, as it provides the most significant reduction in uncertainty.

CSIBER
Information Gain
• There is a strong relationship between entropy and information gain in the
context of decision trees. Information gain is a concept used to measure the
effectiveness of a feature in reducing the uncertainty or entropy in a dataset. The
higher the information gain, the more effective the feature is for making decisions
in a decision tree.

CSIBER
Information Gain
• The decision tree algorithm seeks to maximize information gain when
selecting features for splitting.
• It chooses the feature that results in the highest reduction of entropy in
the dataset, leading to more homogenous and well-separated subsets.
• This process is repeated recursively, building the decision tree by selecting
features that maximize information gain at each decision node.

CSIBER
A simple example
You want to guess the outcome of next week's game between the MallRats and the
Chinooks.
Available knowledge / Attribute
■ was the game at Home or Away
■ was the starting time 5pm, 7pm or 9pm.
■ Did Joe play center, or forward.
■ whether that opponent's center was tall or not.
■ …..

CSIBER
Basket ball data

CSIBER
What we know
The game will be away, at 9pm, and that Joe will play center on offense…
A classification problem
Generalizing the learned rule to new examples

CSIBER
● Decision tree is a classifier in the form of a tree structure
– Decision node: specifies a test on a single attribute
– Leaf node: indicates the value of the target attribute
– Arc/edge: split of one attribute
– Path: a disjunction of test to make the final decision
● Decision trees classify instances or examples by starting at the root of the tree
and moving through it until a leaf node.
Definition

CSIBER
Illustration
(2) Which node to
proceed?
(3) When to stop/ come to conclusion?
(1) Which to start? (root)

CSIBER
Random split
The tree can grow huge
These trees are hard to understand.
Larger trees are typically less accurate than smaller trees.

CSIBER
Example
Before partitioning, the entropy is
■ H(10/20, 10/20) = - 10/20 log(10/20) - 10/20 log(10/20) = 1
Using the ``where’’ attribute, divide into 2 subsets
■ Entropy of the first set H(home) = - 6/12 log(6/12) - 6/12 log(6/12) = 1
■ Entropy of the second set H(away) = - 4/8 log(6/8) - 4/8 log(4/8) = 1
Expected entropy after partitioning
■ 12/20 * H(home) + 8/20 * H(away) = 1
■ Information gain is 1-1 =0

CSIBER
Using the ``when’’ attribute, divide into 3 subsets
■ Entropy of the first set H(5pm) = - 1/4 log(1/4) - 3/4 log(3/4);
■ Entropy of the second set H(7pm) = - 9/12 log(9/12) - 3/12 log(3/12);
■ Entropy of the second set H(9pm) = - 0/4 log(0/4) - 4/4 log(4/4) = 0
Expected entropy after partitioning
■ 4/20 * H(1/4, 3/4) + 12/20 * H(9/12, 3/12) + 4/20 * H(0/4, 4/4) = 0.65
Information gain 1-0.65 = 0.35
Example

CSIBER
Decision
Knowing the ``when’’ attribute values provides larger information gain than ``where’’.
Therefore the ``when’’ attribute should be chosen for testing prior to the ``where’’
attribute.
Similarly, we can compute the information gain for other attributes.
At each node, choose the attribute with the largest information gain.

CSIBER
Iris Data Set
• The Iris dataset is a well-known dataset in the field of machine learning and statistics. It was introduced by the
British biologist and statistician Sir Ronald A. Fisher in 1936. The dataset is commonly used for practicing and
demonstrating various classification algorithms, particularly in the context of supervised learning.
• The Iris dataset consists of measurements of four features (attributes) of iris flowers from three different
species: setosa, versicolor, and virginica. The four features are:
• Sepal Length: Length of the iris flower's sepal (the outermost whorl of a flower).
• Sepal Width: Width of the iris flower's sepal.
• Petal Length: Length of the iris flower's petal (the innermost whorl of a flower).
• Petal Width: Width of the iris flower's petal.

CSIBER
Iris Data Set
• Each sample in the dataset represents an individual iris flower, and the corresponding
label indicates the species to which the flower belongs (setosa, versicolor, or virginica).
• The Iris dataset is commonly used as a beginner's dataset for learning and practicing
classification algorithms due to its simplicity and the clear separation of the three species
based on the provided features.
• The Iris dataset is available in many machine learning libraries, including scikit-learn,
making it easily accessible for experimentation and learning purposes.

CSIBER
Iris Data Set

CSIBER
Iris Data Set
Setosa (Iris setosa)
• Setosa is one of the three species in the Iris dataset.
• It is characterized by its distinctive features, including smaller size, a unique petal shape, and color patterns.
• Setosa flowers typically have a shorter sepal length compared to the other two species.
Versicolor (Iris versicolor)
• Versicolor is another species in the Iris dataset.
• It is known for its versatility in terms of petal and sepal characteristics.
• Versicolor flowers have medium-sized petals and sepals.
Virginica (Iris virginica)
• Virginica is the third species included in the Iris dataset.
• It is characterized by larger petal and sepal sizes compared to Setosa and Versicolor.
• Virginica flowers often exhibit different color patterns.

CSIBER
Performance of Classification
Algorithms
• Precision, recall, and F1-score are metrics commonly used to evaluate the
performance of classification algorithms, particularly in the context of binary or
multiclass classification problems.
• These metrics provide insights into how well a model is performing in terms of
correctly classifying instances from different classes.
• Let's break down each metric. These metrics are commonly reported together to
provide a comprehensive assessment of a model's performance, especially in
situations where achieving a balance between false positives and false negatives is
crucial.

CSIBER
Algorithms
• Precision: "Of all instances predicted as positive, how many are actually
positive?"
• Recall: "Of all actual positive instances, how many were correctly
predicted?"
• F1-Score: "A balance between precision and recall. It is the harmonic mean
of the two."
• Support: "The number of actual occurrences of the class in the dataset."

CSIBER
Algorithms
Let's use a hypothetical binary classification scenario to illustrate precision, recall, and F1-score. Consider
the following confusion matrix for a binary classification problem:
In this confusion matrix:
True Positives (TP) = 85
True Negatives (TN) = 90
False Positives (FP) = 10
False Negatives (FN) = 15

CSIBER
Algorithms
Let's create a simple example with two columns, "Actual Class" and "Predicted Class," each
containing ten rows. We will use binary classification, where the classes are either 0 or 1.

CSIBER
Algorithms
In this example:
True Positives (TP): Rows where both "Actual Class" and "Predicted Class" are 1. (e.g., row 2, row
5, row 8, row 9)
True Negatives (TN): Rows where both "Actual Class" and "Predicted Class" are 0. (e.g., row 6,
row 10)
False Positives (FP): Rows where "Actual Class" is 0, but "Predicted Class" is 1. (e.g., row 1, row 4)
False Negatives (FN): Rows where "Actual Class" is 1, but "Predicted Class" is 0. (e.g., row 3, row
7)

CSIBER
Algorithms
Now, let's calculate Precision, Recall, and F1-Score:
• Precision = TP / (TP + FP) = 4 / (4 + 2) = 0.67
• Recall = TP / (TP + FN) = 4 / (4 + 2) = 0.67
• F1-Score = 2 * (Precision * Recall) / (Precision + Recall) = 2 * (0.67 * 0.67) / (0.67 + 0.67) = 0.67
In this example, Precision and Recall are both 0.67, indicating that out of the instances
predicted as positive, 67% are actually positive, and out of all actual positive instances,
67% were correctly predicted. The F1-Score provides a balanced measure, and in this case,
it is also 0.67.

CSIBER
Algorithms
Now, let's calculate precision, recall, and F1-score using the formulas mentioned earlier:
So, in this example:
• Precision is approximately 0.8947.
• Recall is 0.85.
• F1-Score is approximately 0.8714.

CSIBER
Precision
• Precision is the ratio of correctly predicted positive observations to the total
predicted positives.
• It is calculated as the number of true positives divided by the sum of true positives
and false positives.
• Precision is a measure of how many of the predicted positive instances are actually
positive.
• The formula for precision is: Precision = TP / (TP + FP)

CSIBER
Recall (Sensitivity or True Positive Rate)
• Recall is the ratio of correctly predicted positive observations to the total actual
positives.
• It is calculated as the number of true positives divided by the sum of true positives
and false negatives.
• Recall measures the ability of the model to capture all the actual positive instances.
• The formula for recall is: Recall = TP / (TP + FN)
• The formula for precision is: Precision = TP / (TP + FP)

CSIBER
F1-Score
• F1-score is the harmonic mean of precision and recall. It provides a balance
between precision and recall.
• It is calculated as 2 * (Precision * Recall) / (Precision + Recall).
• F1-score reaches its best value at 1 (perfect precision and recall) and worst at 0.
• F1-score is particularly useful when the class distribution is imbalanced.

CSIBER
Support
• Support is the number of actual occurrences of the class in the specified dataset.
• It is the number of instances belonging to each class.

CSIBER
Practice Problem
Scenario: Email Spam Classification
• Suppose you have a dataset of emails labeled as either spam or ham
(Non-spam), and you want to build a classification model to identify spam
emails. The initial performance of the model may look like this:
• What are total rows in the data set?

CSIBER
Practice Problem
In this confusion matrix:
True Positives (TP): 80
True Negatives (TN): 950
False Positives (FP): 50
False Negatives (FN): 20

CSIBER
Practice Problem
The total number of rows in the dataset (total instances or observations) can be calculated by adding
up all these values:
Total Rows =TP + TN + FP + FN
Using the values from the confusion matrix:
Total Rows = 80 + 950 + 50 + 20 =1100
So, there are a total of 1100 rows or instances in the dataset based on the information provided in the
confusion matrix.
Each row corresponds to a specific instance (email in this case) that the model has made predictions
on, and the confusion matrix summarizes how those predictions align with the true classes.

CSIBER
Practice Problem

decision tree machine learning model for classification

More Related Content

Similar to decision tree machine learning model for classification

Recently uploaded

decision tree machine learning model for classification

Editor's Notes