Why Decision Trees?
•Interpretable, white-box models with human-
readable rules
• Handle mixed data types (numeric +
categorical) and non-linear boundaries
• Minimal preprocessing (no scaling required);
robust to outliers
• Foundation for powerful ensembles (Random
Forests, Gradient Boosting)
3.
Core Idea
• Recursivelysplit the feature space to create
regions with high class purity (classification)
• Choose the split that maximizes impurity
reduction / information gain at each node
• Continue until stopping criterion; then assign
leaf predictions
4.
Impurity Measures (Classification)
•Gini: G = 1 - sum_k p_k^2
• Entropy: H = -sum_k p_k log_2 p_k
• Information Gain = Parent Impurity −
Weighted Child Impurities
5.
Split Selection
• Numericalfeatures: try candidate thresholds
(e.g., midpoints of sorted unique values)
• Categorical features: group categories (may be
exhaustive or heuristic)
• Pick split with best impurity reduction (ties
broken by secondary criteria)
6.
CART vs. ID3/C4.5
•ID3/C4.5: uses Entropy/Information Gain (or
Gain Ratio), often for categorical features
• CART: uses Gini (classification) and MSE
(regression), builds binary trees
• Modern libraries (e.g., scikit-learn) implement
CART-style trees
7.
Stopping Criteria (Pre-pruning)
•max_depth: limit tree depth
• min_samples_split / min_samples_leaf:
minimum samples to split/at leaf
• max_leaf_nodes: cap number of leaves
• min_impurity_decrease: require sufficient gain
to split
8.
Post-pruning (Cost-Complexity)
• Growa large tree, then prune back by
penalizing complexity
• Minimize: R_alpha(T) = sum_{leaves} R(t) +
alpha * |leaves|
• In scikit-learn: tune ccp_alpha via cross-
validation
9.
Regression Trees
• Impuritymeasure: Mean Squared Error (MSE)
or Mean Absolute Error (MAE)
• Leaf prediction: average of targets in the leaf
• Same control parameters; beware of
overfitting on noisy targets
10.
Bias–Variance Trade-off
• Shallowtree: high bias, low variance
(underfits)
• Deep tree: low bias, high variance (overfits)
• Cross-validate depth and leaf sizes to balance
performance
11.
Interpretability & Explanations
•Path explanations: ‘IF (conditions) THEN
prediction’
• Global feature importance (impurity decrease)
• Caveat: impurity-based importance can be
biased toward high-cardinality features
12.
Handling Practicalities
• Missingvalues: impute before training or use
surrogate splits (if library supports)
• Class imbalance: class_weight, balanced
subsampling, or threshold tuning
• No need for feature scaling; still wise to
encode categoricals consistently
Exporting Rules /Visualization
• from sklearn import tree
import matplotlib.pyplot as plt
plt.figure(figsize=(12,6))
tree.plot_tree(grid.best_estimator_, feature_names=feature_names,
class_names=class_names, filled=True)
plt.show()
15.
Pros & Cons(Summary)
• Pros: simple, interpretable, minimal
preprocessing, handles interactions
• Cons: unstable to small data changes, prone to
overfitting, axis-aligned splits only
• Often used as base learners for ensembles (RF,
GBM, XGBoost)