Contents
•What is aDecision Tree?
•Types of Decision Trees
•Key Terminology
•Splitting Criteria
•Pruning Techniques
•Decision Tree Algorithms
•Advantages
•Disadvantages
•Applications
3.
What is aDecision Tree?
• A tree-like model used in machine learning to make decisions.
• Composed of:
• Internal nodes = Features
• Branches = Decision rules
• Leaves = Final output (labels or values)
Decision trees are intuitive supervised learning models. They work for
classification and regression by splitting data based on features. The goal
is to make accurate predictions through a series of simple decisions.
4.
Types of DecisionTrees
• Classification Tree: Predicts categorical outputs (e.g., Yes/No).
• Regression Tree: Predicts continuous outputs (e.g., Price).
• Binary Splits: Two branches per node.
• Multi-way Splits: More than two branches (used in some algorithms).
5.
Key Terminology
RootNode
The starting point of the tree representing the entire dataset. It’s where the first split happens.
Leaf Node
The end point of a branch with no further splits. It gives the final prediction or class.
Splitting
Dividing a node into smaller sub-nodes based on a feature to increase purity.
Branch / Sub-Tree
A path or section of the tree connecting nodes, representing decision outcomes.
Pruning
Removing unnecessary branches to reduce overfitting and simplify the tree.
Parent Node / Child Node
A parent node splits into child nodes; child nodes come directly from a parent node.
Pruning Techniques
Pruninghelps reduce overfitting by removing unnecessary branches from the decision
tree.
Pre-pruning
• Stops the tree from growing too deep based on conditions like maximum depth or
minimum number of samples.
• It's faster but can lead to underfitting.
Post-pruning
• First builds a full tree, then removes branches that do not improve accuracy on a
validation set.
• More accurate but takes more time.
8.
Decision Tree Algorithms
•ID3
UsesInformation Gain for splits. Works with categorical data. Prone to
overfitting.
•C4.5
Improved version of ID3. Supports continuous and categorical data. Uses
Gain Ratio and pruning.
•CART
Uses Gini Index. Builds binary trees. Supports both classification and
regression. Prunes trees to avoid overfitting.
•CHAID
Uses Chi-square tests. Handles multi-way splits. Best for categorical data.
9.
Advantages
• Easy tounderstand and interpret.
• Minimal data preparation (no scaling/normalization).
• Works with both numerical and categorical data.
• Handles missing values and outliers well.
10.
Disadvantages
• High riskof overfitting (especially with deep trees).
• Unstable: Small changes in data can drastically change the tree.
• Greedy algorithm: Doesn't guarantee the globally optimal tree.
• Biased toward features with many levels.
11.
Applications
• Customer Segmentation:Identify customer types for targeting.
• Credit Scoring: Predict likelihood of default.
• Medical Diagnosis: Predict diseases from symptoms.
• Spam Detection: Classify emails as spam or not.
• Fraud Detection: Identify suspicious transactions.