3. Content
● Machine Learning
● Decision Tree Overview
● Examples, Splitting Criteria and Process
● Feature Selection and extraction, real world problems
● Training and Testing data set
● Advantages and disadvantages of Decision Tree
● Building Decision Tree
● Decision Tree Algorithms
● Missing Data, Effective decision tree
● Conclusion
5. Machine Learning
● Machine learning(ML) investigates how computers can learn based on data.ML
approaches have been applied to large language models, computer vision, speech
recognition, email filtering, agriculture and medicine.
● The term machine learning was coined in 1959 by Arthur Samuel. The synonym self-
teaching computers was also used in this time period.
● Machine learning and data mining often employ the same methods and overlap
significantly, but while ML focuses on prediction, based on known properties learned
from the training data, data mining focuses on the discovery of (previously) unknown
properties in the data.
6. ● Machine learning also employs data mining methods as "unsupervised learning" or
as a preprocessing step to improve learner accuracy.
● Modern-day machine learning has two objectives, one is to classify data based on
models which have been developed, the other purpose is to make predictions for
future outcomes based on these models.
● The mathematical foundations of ML are provided by mathematical optimization
(mathematical programming) methods.
● Machine learning approaches are traditionally divided into three broad categories
supervised learning, unsupervised learning, reinforcement learning.
7.
8. Supervised Learning
● Supervised learning algorithms build a mathematical model of a set of data that
contains both the inputs and the desired outputs.The data is known as training data,
and consists of a set of training examples and each training example is represented by
an array or vector, sometimes called a feature vector, and the training data is
represented by a matrix.
● Types of supervised-learning algorithms include active learning, classification and
regression.
Classification algorithms are used when the outputs are restricted to a limited set of values, and
regression algorithms are used when the outputs may have any numerical value within a range.
Example for a classification algorithm that filters emails, the input would be an incoming email,
9.
10. Decision Tree
● Decision Tree is a supervised learning technique that can be used for both
classification and regression problems, but mostly it is preferred for solving
classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision rules and
each leaf node represents the outcome.
● In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple
branches, whereas Leaf nodes are the output of those decisions and do not
contain any further branches.
11. ● The decisions or the test are performed on the basis of features of the given
dataset.
● It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
● A decision tree simply asks a question, and based on the answer (Yes/No),
it further split the tree into subtrees.
● The basic algorithm used in decision trees is known as the ID3 (by Quinlan)
algorithm. The ID3 algorithm builds decision trees using a top-down, greedy
approach.
12.
13. Decision Tree code in Python
# Import necessary libraries
from sklearn import tree
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load the iris dataset (as an example)
iris = load_iris()
X = iris.data
y = iris.target
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
14. # Create a decision tree classifier
clf = tree.DecisionTreeClassifier()
# Train the classifier on the training set
clf.fit(X_train, y_train)
# Make predictions on the test set
y_pred = clf.predict(X_test)
# Evaluate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")
15. OUTPUT
The output of the code will be the accuracy of the decision tree classifier on the
test set. Since the dataset and the random splitting may vary, the exact accuracy
value may differ each time we run the code. Here's an example of what the
output might look like:
OUTPUT:
Accuracy: 100.00%
17. Decision Tree Algorithm
● Decision Tree algorithm belongs to the family of supervised learning algorithms.
● Unlike other supervised learning algorithms, the decision tree algorithm can be
used for solving regression and classification problems
● The goal of using a Decision Tree is to create a training model that can use to
predict the class or value of the target variable by learning simple decision rules
inferred from training data.
● In Decision Trees, for predicting a class label for a record we start from the root
of the tree.
18. Types of Decision Trees
Types of decision trees are based on the type of target variable we have. It can be of two
types:
1. Categorical Variable Decision Tree: Decision Tree which has a categorical target
variable then it called a Categorical variable decision tree.
2. Continuous Variable Decision Tree: Decision Tree has a continuous target
variable then it is called Continuous Variable Decision Tree.
19.
20. Important Terminology related to Decision Trees
➢ Root Node: It represents the entire population or sample and this further gets
divided into two or more homogeneous sets.
➢ Splitting: It is a process of dividing a node into two or more sub-nodes.
➢ Decision Node: When a sub-node splits into further sub-nodes, then it is called
the decision node.
➢ Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.
21. ➢ Pruning: When we remove sub-nodes of a decision node, this process is called
pruning. You can say the opposite process of splitting.
➢ Branch / Sub-Tree: A subsection of the entire tree is called branch or sub-
tree.
➢ Parent and Child Node: A node, which is divided into sub-nodes is called a
parent node of sub-nodes whereas sub-nodes are the child of a parent node.
22. How do Decision Trees work ?
● The decision of making strategic splits heavily affects a tree’s accuracy. The
decision criteria are different for classification and regression trees.
● Decision trees use multiple algorithms to decide to split a node into two or more
sub-nodes.
● The creation of sub-nodes increases the homogeneity of resultant sub-nodes and
increases purity of the node with respect to the target variable.
● The decision tree splits the nodes on all available variables and then selects the
split which results in most homogeneous sub-nodes.
23. Node Splitting in a Decision Tree
● Node splitting, or simply splitting, divides a node into multiple sub-nodes to
create relatively pure nodes.
● This is done by finding the best split for a node and can be done in multiple
ways.
The ways of splitting a node can be broadly divided into two categories based on the
type of target variable:
❏ Continuous Target Variable: Reduction in Variance
❏ Categorical Target Variable: Gini Impurity, Information Gain, and Chi-Square
24. Example Question
Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not.
● root node (Salary)
● root node splits further into the next decision node (distance from the office)
● The next decision node further gets split into one decision node (Cab
facility) and one leaf node.
● he decision node splits into two leaf nodes (Accepted offers and Declined
offer)
25.
26. Feature Selection And Feature Extraction,
Real World Problems
Name-Satyabrata Dwivedy
Registration number-2004050002
27. Need for reduction
● Classification of leukemia tumors from microarray gene expression
data1
○ 72 patients (data points)
○ 7130 features (expression levels of different genes)
● Text mining, document classification
○ features are words
● Quantitative Structure-Activity Relationship (QSAR)
○ features are molecular descriptors, there exist plenty of them
○ biological activity
■ an expression describing the beneficial or adverse effects of a drug on living
matter
○ Structure-Activity Relationship (SAR)
■ hypotheses that similar molecules have similar activities
○ molecular descriptor
■ mathematical procedure transforms chemical information encoded within a
symbolic representation of a molecule into a useful number
28. Molecular Descriptor
adjacency (connectivity) matrix
total adj. index AV – sum all aij
measure of the graph connectedness
2.18
3
Randic connectivity indices
measure of the molecular branching
29. QSAR
• Form a mathematical/statistical relationship (model) between structural
(physiochemical) properties and activity.
• The mathematical expression can then be used to predict the biological
response of other chemical structures.
descriptor
biological
activity
30. Selection vs. Extraction
● In feature selection we try to find the best subset of the input feature set.
● In feature extraction we create new features based on transformation or
combination of the original feature set.
● Both selection and extraction lead to the dimensionality reduction.
● No clear cut evidence that one of them is superior to the other on all types
of task.
Why to do it?
● We’re interested in features – we want to know which are relevant. If
we fit a model, it should be interpretable.
■ facilitate data visualization and data understanding
■ reduce experimental costs (measurements)
● We’re interested in prediction – features are not interesting in
themselves, we just want to build a good predictor.
■ faster training
■ defy the curse of dimensionality
32. Classification of FS methods
• Filter
– Assess the relevance of features only by looking at the intrinsic
properties of the data.
– Usually, calculate the feature relevance score and remove low-
scoring features.
• Wrapper
– Bundle the search for best model with the FS.
– Generate and evaluate various subsets of features. The
evaluation is obtained by training and testing a specific ML
model.
• Embedded
– The search for an optimal subset is built into the classifier
construction (e.g. decision trees).
33. Filter Methods
● Two steps (score-and-filter approach)
○ assess each feature individually for ist potential in discriminating among classes
in the data
○ features falling beyond threshold are eliminated
● Advantages:
○ easily scale to high-dimensional data
○ simple and fast
○ independent of the classification algorithm
● Disadvantages:
○ ignore the interaction with the classifier
○ most techniques are univariate (each feature is considered separately)
Scores in filter methods
Information measures
information gain
mutual information
complexity: O(d)
Distance measures
Euclidean distance
Dependence measures
Pearson correlation coefficient
χ2-test
t-test
AUC
34. Wrappers
● Search for the best feature subset in combination with a fixed classification
method.
● The goodness of a feature subset is determined using cross-validation (k-fold,
LOOCV)
● Advantages:
○ interaction between feature subset and model selection
○ take into account feature dependencies
○ generally more accurate
● Disadvantages:
○ higher risk of overfitting than filter methods
○ very computationally intensive
35. Exhaustive Search
• Evaluate all possible subsets using exhaustive search – this leads to the optimum subset.
• For a total of d variables, and subset of size p, the total number of possible subsets is
• complexity: O(2d) (exponential)
• Various strategies how to reduce the search space.
– They are still O(2d), but much faster (at least 1000-times)
– e.g. “branch and bound”
e.g. d = 100, p = 10 → ≈2×1013
37. Sequential Forward Selection
• SFS
• At the beginning select the best feature using a scalar
criterion function.
• Add one feature at a time which along with already
selected features maximizes the criterion function.
• A greedy algorithm, cannot retract (also called nesting
effect).
• Complexity is O(d)
38. Sequential Backward Selection
• SBS
• At the beginning select all d features.
• Delete one feature at a time and select the
subset which maximize the criterion function.
• Also a greedy algorithm, cannot retract.
• Complexity is O(d).
39. “Plus q take away r” Selection
• At first add q features by forward selection, then
discard r features by backward selection
• Need to decide optimal q and r
• No subset nesting problems Like SFS and SBS
40. Sequential Forward Floating
Search
• SFFS
• It is a generalized “plus q take away r” algorithm
• The value of q and r are determined
automatically
• Close to optimal solution
• Affordable computational cost
• Also in backward disguise
41. Embedded FS
● The feature selection process is done inside the ML
algorithm.
● Decision trees
○ In final tree, only a subset of features are used
● Regularization
○ It effectively “shuts down” unnecessary features.
○ Pruning in NN.
43. ● FS – identify and select the “best” features with respect to the
target task.
● Selected features retain their original physical interpretation.
● FE – create new features as a transformation (combination) of
original features. Usually followed by FS.
● May provide better discriminatory ability than the best subset.
● Do not retain the original physical interpretation, may not
have clear meaning.
46. x
1
x
2
The variability in
data is highest
along this line. It is
called 1st principal
component.
And this is 2nd
principal component.
47. x
1
Principal components (PC’s) are
linear combinations of original
coordinates.
The coefficients of linear
combination (w0, w1, …) are called
loadings.
In the transformed coordinate
system, individual data points have
different coordinates, these are
called scores.
w0 + w1x1 +
w2x2
w’0 + w’1x1 +
w’2x2
x2
48. ● PCA - orthogonal linear transformation that changes the data into a new
coordinate system such that the variance is put in order from the greatest to the
least.
● Solve the problem = find new orthogonal coordinate system = find loadings
● PC’s (vectors) and their corresponding variances (scalars) are found by
eigenvalue decompositions of the covariance matrix C = XXT of the xi variables.
○ Eigenvector corresponding to the largest eigenvalue is 1st PC.
○ The 2nd eigenvector (the 2nd largest eigenvalue) is orthogonal to the 1st
one. …
● Eigenvalue decomposition is computed using standard algorithms: eigen
decomposition of covariance matrix (e.g. QR algorithm), SVD of mean centered
data matrix.
49. Interpretation of PCA
● New variables (PCs) have a variance equal to their
corresponding eigenvalue
Var(Yi)= λi for all I = 1…p
● Small λi ⬄ small variance ⬄ data changes little in the
direction of component Y
● The relative variance explained by each PC is given by λi /Σ λi
How many components?
● Enough PCs to have a cumulative variance explained by
the PCs that is >50-70%
● Kaiser criterion: keep PCs with eigenvalues >1
● Scree plot: represents the ability of PCs to explain de
variation in data
50. Topic:-Training and Testing sets
Advantages and Disadvantages of Decision Tree
Name-Smriti Panda
Registration number-2004050024
51. Training Set of data
● A decision tree is consistent with a training set.
• The training set is used to build the decision tree. During this phase:
❑ The algorithm selects the best attribute to split the data based on
metrics like entropy or Gini impurity.
❑ The goal is to find the attribute that maximizes information gain or
reduces impurity after the split.
❑ The decision tree is constructed by recursively partitioning the data
based on attribute values.
❑ Each node in the tree represents a split point based on an attribute.
❑ The tree grows until a stopping criterion (e.g., maximum depth or
minimum samples per leaf) is met.
52. Training Set of data
● Typically, for decision tree classification, the model should be
learned on training data with a predefined set of labels.
● It would predict a label (i.e., class) for new samples.
● So, we have a dataset with different attributes (features). Each
sample has its own combination of the value of the features.
54. Training Procedure
The steps involved in System Model of training are as follows:
1. Analysis and Identification: Analyze and identify the training
needs who needs training, what do they need to learn, estimating training
cost, etc. The next step is to develop a performance measure on the basis
of which actual performance would be evaluated.
2. Designing:
Design and provide training to meet identified needs. This step requires
developing objectives of training, identifying the learning steps,
sequencing and structuring the contents.
3. Developing:
This phase requires listing the activities in the training program that will
assist the participants to learn, selecting delivery method, examining the
training material and validating information to be imparted to make sure
it accomplishes all the goals and objectives.
55. Training Procedure
4. Implementation: Implementing is the hardest part of the
system because one wrong step can lead to the failure of whole
training programe.
5. Evaluation: Evaluating each phase so as to make sure it has
achieved its aim in terms of subsequent work performance.
Making necessary amendments to any of the previous stage in
order to remedy or improve failure practices.
56. Testing Set of data
• The test set is used to evaluate the performance of the decision tree.
• After constructing the tree using the training data, you evaluate how well it
performs on unseen data.
• For each instance in the test set, you call a function (often
named classify), passing in the newly-built tree and the data point you
want to classify.
• The function returns the leaf node to which the data point belongs,
effectively assigning a class label.
• By comparing the assigned class label to the actual label, you assess the
tree’s performance.
• A common practice is to shuffle the data and allocate 80% to training and
the remaining 20% to testing.
• The training set helps the decision tree to learn, while the test set
evaluates its accuracy.
59. Model-based testing
• Model-based testing is a software testing technique where the
run time behavior of the software under test is checked against
predictions made by a model.
• A model is a description of a system’s behavior.
• Behavior can be described in terms of input sequences,
actions, conditions, output, and flow of data from input to output.
• It should be practically understandable and can be reusable;
shareable must have a precise description of the system under
test.
60. Example: Predicting Diabetes
• To illustrate the use of decision trees, let's consider a simple example of predicting diabetes based
on certain features.
61. Example: Predicting Diabetes
• In this example, we used the diabetes_data.csv
dataset, which contains various features related to
diabetes, such as age, blood pressure, and glucose
level.
• The target variable, Outcome, indicates whether the
patient has diabetes (1) or not (0).
• We split the data into training and testing sets and then
built a decision tree model using the
DecisionTreeClassifier class from scikit-learn.
• Finally, we evaluated the model on the testing set and
printed the accuracy.
62. A simple decision tree
The figure shows that the decision tree starts from
the root node and after numerous training and
testings give leaf nodes as the result.
63. Advantages of Decision Tree
● Non-Parametric: Decision trees do not assume specific underlying data
distributions. This flexibility allows them to be applied to diverse problems
without worrying about model assumptions.
● Handling Categorical Values: Decision trees can naturally handle categorical
features without requiring explicit encoding or transformation.
● Minimal Data Preparation: Unlike some other algorithms, decision trees
require minimal data preprocessing. They can work directly with raw features,
reducing the need for extensive feature engineering.
● Non-Linear Models: Decision trees are inherently non-linear. They represent
piece-wise functions of various features in the feature space, making them
suitable for complex problems where linearity cannot be assumed.
64. Advantages of Decision Tree
• Relatively Easy to Interpret: Trained decision trees are generally
intuitive to understand. Their entire structure can be visualized as a
simple flow chart, making it easier for analysts and stakeholders to
grasp the decision-making process.
• Robust to Outliers: Well-regularized decision trees handle outliers
well. Predictions are generated by aggregating over a subsample of
training data, reducing the impact of outliers.
• Can Deal with Missing Values: The CART (Classification and
Regression Trees) algorithm naturally handles missing values.
Decision trees can be constructed without additional preprocessing to
address missing data.
• Combining Features for Predictions: Decision trees combine
decision rules (if-else conditions on input features) via AND
relationships as they traverse the tree. This enables the use of
feature combinations in making predictions.
65. Disadvantages of Decision Tree
● Prone to Overfitting: Decision trees can become overly complex and fit
noise in the training data, leading to poor generalization on unseen data.
● Sensitive to Noise: Decision trees can be sensitive to noisy data,
especially when the tree is deep.
● Sensitive to Changes in Data: Small changes in the training data can
significantly affect the tree’s structure, making it unstable.
● Greedy Algorithm: The tree-building process is greedy, meaning it
makes locally optimal decisions at each split without considering global
implications.
● Non-Continuous Predictions: Decision trees produce step-like
predictions, which may not be suitable for problems requiring smooth
outputs.
66. Building a Decision Tree: A Step-by-Step Approach
Constructing a decision tree for the "Play Golf" dataset.
Name-Swaraj Pradhan
Registration number-2002050082
67. Consider the table below. It represent factors that affect whether John would go out to play golf or not. Using
the data in the table, build a decision tree to model that can be used to predict if John would play golf
or not.
Figure 1: "Play Golf"
dataset
68. Step 1: Determine the Decision Column
➢ Since decision trees are used for classification, you need to determine the classes which are the basis for
the decision.
➢ In this case, it it the last column, that is Play Golf column with classes Yes and No.
➢ To determine the rootNode we need to compute the entropy.
➢ To do this, we create a frequency table for the classes (the Yes/No column).
Table 2: Frequency
Table
69. Step 2: Calculating Entropy for the classes (Play Golf)
Entropy
➢ It is the measure of impurity (or) uncertainty in the data. It lies between 0 to 1 and is calculated using
the below formula.
Entropy(PlayGolf) = E(5,9)
➢ Compute the entropy for the decision column(Play Golf) using a frequency
table
70. Step 3: Calculate Entropy for Other Attributes After Split (contd..)
For the other four attributes, we need to calculate the entropy after each of the split.
● E(PlayGolf, Outlook)
● E(PlayGolf, Temperature)
● E(PlayGolf, Humidity)
● E(PlayGolf,Windy)
➢ The entropy for two variables is calculated using the formula.
➢ There to calculate E(PlayGolf, Outlook), we would use the formula below:
➢ Which is the same as:
E(PlayGolf, Outlook) = P(Sunny) E(3,2) + P(Overcast) E(4,0) + P(rainy) E(2,30)
71. This frequency table is given below:
Table 3: Frequency Table for Outlook
Let’s go ahead to calculate E(3,2)
We would not need to calculate the second and the third terms! This is because
E(4, 0) = 0
E(2,3) = E(3,2)
72.
73. E (PlayGolf, Temperature) Calculation
Table 4: Frequency Table for Temperature
E(PlayGolf, Temperature) = P(Hot) E(2,2) + P(Cold) E(3,1) + P(Mild) E(4,2)
76. Step 4: Calculating Information Gain for Each Split
● Calculate information gain for each attribute using the formula:
Gain(S, T) = Entropy(S) – Entropy(S, T).
● Then the attribute with the largest information gain is used for the split.
Gain(PlayGolf, Outlook) = Entropy(PlayGolf) – Entropy(PlayGolf, Outlook)
= 0.94 – 0.693 = 0.247
Gain(PlayGolf, Temperature) = Entropy(PlayGolf) – Entropy(PlayGolf, Temperature)
= 0.94 – 0.911 = 0.029
Gain(PlayGolf, Humidity) = Entropy(PlayGolf) – Entropy(PlayGolf, Humidity)
= 0.94 – 0.788 = 0.152
Gain(PlayGolf, Windy) = Entropy(PlayGolf) – Entropy(PlayGolf, Windy)
= 0.94 – 0.892 = 0.048
77. Step 5: Perform the First Split (contd..)
➢ Now that we have all the information gain, we then split the tree based on the attribute with the
highest information gain.
➢ From our calculation, the highest information gain comes from Outlook. Therefore the split will
look like this:
Figure 2: Decision Tree after first split
78. ➢ Now that we have the first stage of the decision tree, we see that we have one leaf node. But we still
need to split the tree further.
➢ To do that, we need to also split the original table to create sub tables. This sub tables are given in below.
➢ NOTE :- From Table 3, we could see that the Overcast outlook requires no further split because it is just one
homogeneous group. So we have a leaf node.
79. Step 6: Perform Further Splits (contd..)
➢ The Sunny and the Rainy attributes needs to be split.
➢ The Rainy outlook can be split using either Temperature, Humidity or Windy.
➢ Humidity attribute would best be used for this split because it produces homogenous
groups.
Table 8: Split using Humidity
80. ➢ The Rainy attribute could be split using High and Normal attributes and that would give us the tree
below.
Figure 3: Split using the Humidity Attribute
81. ➢ The Sunny outlook can be split using either Temperature, Humidity or Windy.
➢ Windy attribute would best be used for this split because it produces homogeneous groups.
Table 9: Split using Windy Attribute
➢ NOTE:- If we do the split using the Windy attribute, we would have the final tree that would require
no further splitting! This is shown in Figure 4
82. Step 7: Complete the Decision Tree
The final decision tree is constructed with leaf nodes representing the decision classes (Yes/No).
Figure 4: Final Decision Tree
84. BASIC ALGORITHM OF DECISION TREE
Basic algorithm (a greedy algorithm)
● Tree is constructed in a top-down recursive divide-and-conquer manner
● At start, all the training examples are at the root
● Attributes are categorical (if continuous-valued, they are discretized in
advance)
● Examples are partitioned recursively based on selected attributes
● Test attributes are selected on the basis of a heuristic or statistical measure
(e.g., information gain)
Conditions for stopping partitioning
● All samples for a given node belong to the same class
● There are no remaining attributes for further partitioning - majority voting is
employed for classifying the leaf
● There are no samples left
85. DECISION TREE ALGORITHM
ID3 algorithm
C4.5 algorithm
● A successor of ID3
● Became a benchmark to which newer supervised learning algorithms are
often compared.
● Commercial successor: C5.0
CART (Classification and Regression Trees) algorithm
● The generation of binary decision trees
● Developed by a group of statisticians
86. ID3 is a strong system that
● Uses hill-climbing search based on the information gain measure to search
through the space of decision trees
● Outputs a single hypothesis
● Never backtracks. It converges to locally optimal solutions
● Uses all training examples at each step, contrary to methods that make
decisions incrementally
● Uses statistical properties of all examples:the search is less sensitive to
errors in individual training examples
However, ID3 has some drawbacks
● It can only deal with nominal data( e.g. continuous data)
● It may be not robust in presence of noise (e.g. overfitting)
● It is not able to deal with noisy data sets (e.g. missing values)
ID3
87.
88. CART(Classification And Regression Trees)
● Developed by Breiman, Friedman, Olshen, Stone in early 80’s.
● Introduced tree based modelling into the statistical mainstream.
● Rigorous approach involving cross validation to select the optimal tree
C4.5
● Be robust in the presence of noise.
● Avoid overfittin.
● Deal with continuous attributes.
● Deal with missing data.
● Convert trees to rules.
89. Overfitting and Tree Pruning
Overfitting: An induced tree may overfit the training data
● Too many branches, some may reflect anomalies due to noise or outliers
● Poor accuracy for unseen samples
Two approaches to avoid overfitting
➢ Prepruning: Halt tree construction early-do not split a node if this would result in the
goodness measure falling below a threshold
● Difficult to choose an appropriate threshold
➢ Postpruning: Remove branches from a "fully grown" tree-get a sequence of
progressively pruned trees
● Use a set of data different from the training data to decide which is the "best pruned
tree"
90. Tree Pruning
Cost Complexity pruning
● Post pruning approach used in CART
● Cost complexity - function of number of leaves and error rate of the tree
● For each internal node cost complexity is calculated wrt original and pruned
versions
● If pruning results in a smaller cost complexity - subtree is pruned
● Uses a separate prune set
Pessimistic Pruning
● Uses training set and adjusts error rates by adding a penalty
Minimum Description Length (MDL) principle
Issues: Repetition and Replication
92. MISSING DATA IN DECISION TREE
Missing data can arise due to various reasons such as incomplete data collection, sensor
malfunctions, or participants opting not to provide certain information. Dealing with missing data in
decision trees involves making decisions at nodes even when some values are missing.
(I) In a decision tree, when a node is being split based on a feature, instances with missing
values for that feature can still be assigned to one of the branches. The decision tree algorithm
considers other available features to determine the best split.
(II) Decision tree algorithms typically include rules for handling instances with missing values.
These rules guide the placement of instances with missing data during the tree-building process.
(III) Some decision tree algorithms may perform automatic imputation for missing values.
Imputation involves estimating or replacing missing values with a predicted or calculated value.
93. EFFECTIVE DECISION TREE
Some key characteristics of an effective decision tree
Interpretability: Decision trees are inherently interpretable. The structure of the tree, with nodes representing
decisions and branches representing outcomes, is easy to understand, making it a valuable tool for explaining
predictions to non-experts.
Simplicity: Effective decision trees are designed to be relatively simple. They aim to capture the essential
patterns in the data without unnecessary complexity. Simplicity helps with both understanding and
implementation.
Versatility:Decision trees can be applied to various types of problems, including classification and regression
tasks. They are capable of handling both categorical and numerical data, making them versatile for a wide range
of applications.
94. Cont.
Handling Missing Values: Effective decision trees can handle datasets with missing values. They have
mechanisms for making decisions at nodes even when certain values are missing, ensuring that available
information is still utilized.
Pruning for Generalization: Effective decision trees often undergo pruning, a process that removes
unnecessary branches. Pruning helps prevent overfitting, allowing the model to generalize well to new, unseen
data.
Scalability: Decision trees are computationally efficient and can handle datasets of varying sizes. This
scalability makes them suitable for applications in both small and large data environments.
Non-Linearity: Decision trees can model non-linear relationships in the data, allowing them to capture complex
patterns and interactions between variables. This is especially beneficial when the relationships are not easily
represented by linear models.
95. CONCLUSION
In conclusion, decision trees are powerful tools in the realm of data-driven decision-making. Their
simplicity, interpretability, and versatility make them valuable in various domains, from business
and finance to healthcare and beyond. Effective decision trees exhibit characteristics such as the
ability to handle both categorical and numerical data, automatic feature selection, scalability, and
the capacity to model non-linear relationships.
Their transparency allows decision-makers to understand and trust the decision-making process,
fostering confidence in the model's predictions. The feature importance insights provided by
decision trees contribute to a deeper understanding of the factors influencing outcomes.
96. CONCLUSION (Cont.)
Furthermore, the adaptability of decision trees to handle missing values and their
incorporation into ensemble methods like Random Forests enhance their robustness and
predictive performance. With proper pruning techniques, decision trees can generalize well
to new data, preventing overfitting.
Ultimately, decision trees offer a clear and accessible framework for navigating complex
decision spaces, making them an indispensable tool for those seeking actionable insights
and informed choices from their data.