Decision Trees: Intuition to
Implementation
Classification & Regression Trees
(CART/ID3) — Teaching Deck
Why Decision Trees?
• Interpretable, white-box models with human-
readable rules
• Handle mixed data types (numeric +
categorical) and non-linear boundaries
• Minimal preprocessing (no scaling required);
robust to outliers
• Foundation for powerful ensembles (Random
Forests, Gradient Boosting)
Core Idea
• Recursively split the feature space to create
regions with high class purity (classification)
• Choose the split that maximizes impurity
reduction / information gain at each node
• Continue until stopping criterion; then assign
leaf predictions
Impurity Measures (Classification)
• Gini: G = 1 - sum_k p_k^2
• Entropy: H = -sum_k p_k log_2 p_k
• Information Gain = Parent Impurity −
Weighted Child Impurities
Split Selection
• Numerical features: try candidate thresholds
(e.g., midpoints of sorted unique values)
• Categorical features: group categories (may be
exhaustive or heuristic)
• Pick split with best impurity reduction (ties
broken by secondary criteria)
CART vs. ID3/C4.5
• ID3/C4.5: uses Entropy/Information Gain (or
Gain Ratio), often for categorical features
• CART: uses Gini (classification) and MSE
(regression), builds binary trees
• Modern libraries (e.g., scikit-learn) implement
CART-style trees
Stopping Criteria (Pre-pruning)
• max_depth: limit tree depth
• min_samples_split / min_samples_leaf:
minimum samples to split/at leaf
• max_leaf_nodes: cap number of leaves
• min_impurity_decrease: require sufficient gain
to split
Post-pruning (Cost-Complexity)
• Grow a large tree, then prune back by
penalizing complexity
• Minimize: R_alpha(T) = sum_{leaves} R(t) + 
alpha * |leaves|
• In scikit-learn: tune ccp_alpha via cross-
validation
Regression Trees
• Impurity measure: Mean Squared Error (MSE)
or Mean Absolute Error (MAE)
• Leaf prediction: average of targets in the leaf
• Same control parameters; beware of
overfitting on noisy targets
Bias–Variance Trade-off
• Shallow tree: high bias, low variance
(underfits)
• Deep tree: low bias, high variance (overfits)
• Cross-validate depth and leaf sizes to balance
performance
Interpretability & Explanations
• Path explanations: ‘IF (conditions) THEN
prediction’
• Global feature importance (impurity decrease)
• Caveat: impurity-based importance can be
biased toward high-cardinality features
Handling Practicalities
• Missing values: impute before training or use
surrogate splits (if library supports)
• Class imbalance: class_weight, balanced
subsampling, or threshold tuning
• No need for feature scaling; still wise to
encode categoricals consistently
scikit-learn Example (Classification)
• from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
clf = DecisionTreeClassifier(random_state=42)
param_grid = {
'max_depth': [None, 3, 5, 10],
'min_samples_leaf': [1, 2, 5, 10],
'ccp_alpha': [0.0, 0.001, 0.01]
}
grid = GridSearchCV(clf, param_grid, cv=5, n_jobs=-1)
grid.fit(X_train, y_train)
print(grid.best_params_, grid.best_score_)
Exporting Rules / Visualization
• from sklearn import tree
import matplotlib.pyplot as plt
plt.figure(figsize=(12,6))
tree.plot_tree(grid.best_estimator_, feature_names=feature_names,
class_names=class_names, filled=True)
plt.show()
Pros & Cons (Summary)
• Pros: simple, interpretable, minimal
preprocessing, handles interactions
• Cons: unstable to small data changes, prone to
overfitting, axis-aligned splits only
• Often used as base learners for ensembles (RF,
GBM, XGBoost)

This_is _the_Decision_Tree_Basics.pptx__

  • 1.
    Decision Trees: Intuitionto Implementation Classification & Regression Trees (CART/ID3) — Teaching Deck
  • 2.
    Why Decision Trees? •Interpretable, white-box models with human- readable rules • Handle mixed data types (numeric + categorical) and non-linear boundaries • Minimal preprocessing (no scaling required); robust to outliers • Foundation for powerful ensembles (Random Forests, Gradient Boosting)
  • 3.
    Core Idea • Recursivelysplit the feature space to create regions with high class purity (classification) • Choose the split that maximizes impurity reduction / information gain at each node • Continue until stopping criterion; then assign leaf predictions
  • 4.
    Impurity Measures (Classification) •Gini: G = 1 - sum_k p_k^2 • Entropy: H = -sum_k p_k log_2 p_k • Information Gain = Parent Impurity − Weighted Child Impurities
  • 5.
    Split Selection • Numericalfeatures: try candidate thresholds (e.g., midpoints of sorted unique values) • Categorical features: group categories (may be exhaustive or heuristic) • Pick split with best impurity reduction (ties broken by secondary criteria)
  • 6.
    CART vs. ID3/C4.5 •ID3/C4.5: uses Entropy/Information Gain (or Gain Ratio), often for categorical features • CART: uses Gini (classification) and MSE (regression), builds binary trees • Modern libraries (e.g., scikit-learn) implement CART-style trees
  • 7.
    Stopping Criteria (Pre-pruning) •max_depth: limit tree depth • min_samples_split / min_samples_leaf: minimum samples to split/at leaf • max_leaf_nodes: cap number of leaves • min_impurity_decrease: require sufficient gain to split
  • 8.
    Post-pruning (Cost-Complexity) • Growa large tree, then prune back by penalizing complexity • Minimize: R_alpha(T) = sum_{leaves} R(t) + alpha * |leaves| • In scikit-learn: tune ccp_alpha via cross- validation
  • 9.
    Regression Trees • Impuritymeasure: Mean Squared Error (MSE) or Mean Absolute Error (MAE) • Leaf prediction: average of targets in the leaf • Same control parameters; beware of overfitting on noisy targets
  • 10.
    Bias–Variance Trade-off • Shallowtree: high bias, low variance (underfits) • Deep tree: low bias, high variance (overfits) • Cross-validate depth and leaf sizes to balance performance
  • 11.
    Interpretability & Explanations •Path explanations: ‘IF (conditions) THEN prediction’ • Global feature importance (impurity decrease) • Caveat: impurity-based importance can be biased toward high-cardinality features
  • 12.
    Handling Practicalities • Missingvalues: impute before training or use surrogate splits (if library supports) • Class imbalance: class_weight, balanced subsampling, or threshold tuning • No need for feature scaling; still wise to encode categoricals consistently
  • 13.
    scikit-learn Example (Classification) •from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import GridSearchCV clf = DecisionTreeClassifier(random_state=42) param_grid = { 'max_depth': [None, 3, 5, 10], 'min_samples_leaf': [1, 2, 5, 10], 'ccp_alpha': [0.0, 0.001, 0.01] } grid = GridSearchCV(clf, param_grid, cv=5, n_jobs=-1) grid.fit(X_train, y_train) print(grid.best_params_, grid.best_score_)
  • 14.
    Exporting Rules /Visualization • from sklearn import tree import matplotlib.pyplot as plt plt.figure(figsize=(12,6)) tree.plot_tree(grid.best_estimator_, feature_names=feature_names, class_names=class_names, filled=True) plt.show()
  • 15.
    Pros & Cons(Summary) • Pros: simple, interpretable, minimal preprocessing, handles interactions • Cons: unstable to small data changes, prone to overfitting, axis-aligned splits only • Often used as base learners for ensembles (RF, GBM, XGBoost)